<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Sloth on Tarragon</title><link>https://tarrragon.github.io/blog/backend/06-reliability/vendors/sloth/</link><description>Recent content in Sloth on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Fri, 01 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/backend/06-reliability/vendors/sloth/index.xml" rel="self" type="application/rss+xml"/><item><title>Sloth：SLO YAML 與 Multi-burn-rate Alert 生成</title><link>https://tarrragon.github.io/blog/backend/06-reliability/vendors/sloth/slo-yaml-and-multi-burn-rate-alert-generation/</link><pubDate>Tue, 23 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/06-reliability/vendors/sloth/slo-yaml-and-multi-burn-rate-alert-generation/</guid><description>&lt;h2 id="問題情境">問題情境&lt;/h2>
&lt;p>SLO 從定義到 Prometheus 落地需要多層 rule。一個 SLO 對應 4 組 time window 的 recording rule（計算各窗口的 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/burn-rate/" data-link-title="Burn Rate" data-link-desc="說明 error budget 消耗速度如何支援告警與事故分級">burn rate&lt;/a>），再對應 fast burn 和 slow burn 兩組 alerting rule。手動維護這些 rule 容易出錯：window 參數不一致、新增 SLO 忘記補 alert、修改 SLI expression 只改了部分 rule。&lt;/p>
&lt;p>Sloth 的責任是把這個過程自動化。輸入一份 SLO YAML，產出一組完整的 Prometheus recording + alerting rules，讓 SLO 維護回到宣告式：改 YAML、重新生成、載入 Prometheus。&lt;/p>
&lt;h2 id="slo-yaml-設計">SLO YAML 設計&lt;/h2>
&lt;p>Sloth YAML 的核心結構是 &lt;code>version&lt;/code> → &lt;code>service&lt;/code> → &lt;code>slos[]&lt;/code>。每個 SLO 定義三件事：目標數字（objective）、量測方式（SLI）、告警等級（alerting）。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="nt">version&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">prometheus/v1&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">service&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">checkout-api&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">slos&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">availability&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">objective&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">99.9&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">description&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;checkout API 的請求成功率&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">sli&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">events&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">error_query&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">sum(rate(http_requests_total{service=&amp;#34;checkout&amp;#34;,code=~&amp;#34;5..&amp;#34;}[{{.window}}]))&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">total_query&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">sum(rate(http_requests_total{service=&amp;#34;checkout&amp;#34;}[{{.window}}]))&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">alerting&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">CheckoutAvailability&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">page_alert&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">labels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">severity&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">critical&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">ticket_alert&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">17&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">labels&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">18&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">severity&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">warning&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>SLI 有兩種類型。events-based SLI 用 error/total ratio 定義，Sloth 自動把 &lt;code>{{.window}}&lt;/code> 參數代入各 recording rule 的 range vector。raw SLI 直接寫 PromQL expression 算 error ratio，適合非 request-based 的 SLO（如 data freshness、replication lag）。&lt;/p>
&lt;p>raw SLI 範例 — data freshness：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">data-freshness&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">objective&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">99.5&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">sli&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">raw&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">error_ratio_query&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">|&lt;/span>&lt;span class="sd">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">&lt;span class="sd"> 1 - clamp_max(
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">&lt;span class="sd"> replication_lag_seconds{service=&amp;#34;checkout-db&amp;#34;} / 60,
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">8&lt;/span>&lt;span class="cl">&lt;span class="sd"> 1
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">9&lt;/span>&lt;span class="cl">&lt;span class="sd"> )&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>objective 數字的來源是 &lt;a href="https://tarrragon.github.io/blog/backend/06-reliability/slo-error-budget/" data-link-title="6.6 SLO 與 Error Budget 政策" data-link-desc="把可靠性目標轉成可驗證量測與凍結條件">6.6 SLO 政策&lt;/a> — 先從使用者旅程定義服務承諾，再把承諾轉成 objective。Sloth 不負責決定 objective 該是多少，只負責把 objective 轉成可執行的 Prometheus rule。&lt;/p>
&lt;p>alerting 分 page（嚴重，觸發即時通知）和 ticket（一般，產生工單）。兩者的 burn rate 門檻不同：page 用 fast burn window，ticket 用 slow burn window。label 設計跟 Alertmanager routing 對齊 — &lt;code>severity: critical&lt;/code> 走 PagerDuty / Slack alert channel，&lt;code>severity: warning&lt;/code> 走 ticket system（Jira / Linear）。&lt;/p></description><content:encoded><![CDATA[<h2 id="問題情境">問題情境</h2>
<p>SLO 從定義到 Prometheus 落地需要多層 rule。一個 SLO 對應 4 組 time window 的 recording rule（計算各窗口的 <a href="/blog/backend/knowledge-cards/burn-rate/" data-link-title="Burn Rate" data-link-desc="說明 error budget 消耗速度如何支援告警與事故分級">burn rate</a>），再對應 fast burn 和 slow burn 兩組 alerting rule。手動維護這些 rule 容易出錯：window 參數不一致、新增 SLO 忘記補 alert、修改 SLI expression 只改了部分 rule。</p>
<p>Sloth 的責任是把這個過程自動化。輸入一份 SLO YAML，產出一組完整的 Prometheus recording + alerting rules，讓 SLO 維護回到宣告式：改 YAML、重新生成、載入 Prometheus。</p>
<h2 id="slo-yaml-設計">SLO YAML 設計</h2>
<p>Sloth YAML 的核心結構是 <code>version</code> → <code>service</code> → <code>slos[]</code>。每個 SLO 定義三件事：目標數字（objective）、量測方式（SLI）、告警等級（alerting）。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="nt">version</span><span class="p">:</span><span class="w"> </span><span class="l">prometheus/v1</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">service</span><span class="p">:</span><span class="w"> </span><span class="l">checkout-api</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="nt">slos</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">availability</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="nt">objective</span><span class="p">:</span><span class="w"> </span><span class="m">99.9</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;checkout API 的請求成功率&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="nt">sli</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">      </span><span class="nt">events</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">        </span><span class="nt">error_query</span><span class="p">:</span><span class="w"> </span><span class="l">sum(rate(http_requests_total{service=&#34;checkout&#34;,code=~&#34;5..&#34;}[{{.window}}]))</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">        </span><span class="nt">total_query</span><span class="p">:</span><span class="w"> </span><span class="l">sum(rate(http_requests_total{service=&#34;checkout&#34;}[{{.window}}]))</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">    </span><span class="nt">alerting</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">      </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">CheckoutAvailability</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">      </span><span class="nt">page_alert</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">        </span><span class="nt">labels</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">          </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">critical</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">      </span><span class="nt">ticket_alert</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w">        </span><span class="nt">labels</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w">          </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">warning</span></span></span></code></pre></div><p>SLI 有兩種類型。events-based SLI 用 error/total ratio 定義，Sloth 自動把 <code>{{.window}}</code> 參數代入各 recording rule 的 range vector。raw SLI 直接寫 PromQL expression 算 error ratio，適合非 request-based 的 SLO（如 data freshness、replication lag）。</p>
<p>raw SLI 範例 — data freshness：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln">1</span><span class="cl"><span class="w">  </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">data-freshness</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="nt">objective</span><span class="p">:</span><span class="w"> </span><span class="m">99.5</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="nt">sli</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">      </span><span class="nt">raw</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">        </span><span class="nt">error_ratio_query</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="sd">          1 - clamp_max(
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="sd">            replication_lag_seconds{service=&#34;checkout-db&#34;} / 60,
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="sd">            1
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="sd">          )</span></span></span></code></pre></div><p>objective 數字的來源是 <a href="/blog/backend/06-reliability/slo-error-budget/" data-link-title="6.6 SLO 與 Error Budget 政策" data-link-desc="把可靠性目標轉成可驗證量測與凍結條件">6.6 SLO 政策</a> — 先從使用者旅程定義服務承諾，再把承諾轉成 objective。Sloth 不負責決定 objective 該是多少，只負責把 objective 轉成可執行的 Prometheus rule。</p>
<p>alerting 分 page（嚴重，觸發即時通知）和 ticket（一般，產生工單）。兩者的 burn rate 門檻不同：page 用 fast burn window，ticket 用 slow burn window。label 設計跟 Alertmanager routing 對齊 — <code>severity: critical</code> 走 PagerDuty / Slack alert channel，<code>severity: warning</code> 走 ticket system（Jira / Linear）。</p>
<h2 id="multi-window-multi-burn-rate-alert">Multi-window Multi-burn-rate Alert</h2>
<p>Sloth 預設產生 Google SRE 推薦的 4-window alert 結構。每個 SLO 生成以下 recording rules 和 alerting rules。</p>
<table>
  <thead>
      <tr>
          <th>Window 組合</th>
          <th>責任</th>
          <th>觸發行動</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>5m / 1h</td>
          <td>Fast burn 偵測</td>
          <td>短時間大量消耗 → page 通知</td>
      </tr>
      <tr>
          <td>30m / 6h</td>
          <td>Moderate burn 偵測</td>
          <td>中速消耗 → page 或 ticket</td>
      </tr>
      <tr>
          <td>2h / 1d</td>
          <td>Slow burn 偵測</td>
          <td>緩慢消耗 → ticket</td>
      </tr>
      <tr>
          <td>6h / 3d</td>
          <td>Very slow 偵測</td>
          <td>長期趨勢退化 → ticket 或 review</td>
      </tr>
  </tbody>
</table>
<p>fast burn alert 回答「error budget 是否正在被快速吃掉」。當 5 分鐘窗口的 burn rate 超過 14.4 倍（代表如果持續下去，1 小時會消耗完整個月的 budget），觸發 page。這個門檻的設計邏輯是：越短的窗口允許越高的 burn rate 容忍，因為短窗口的 false positive 率較高，需要搭配較長窗口的確認。</p>
<p>slow burn alert 回答「error budget 是否在不被注意的情況下被緩慢消耗」。6 小時窗口的 burn rate 超過 1 倍（代表月底會剛好用完 budget），觸發 ticket。slow burn 常被忽略，但它是高變更頻率服務最常見的可靠性退化模式 — 每次小回歸都不夠大到觸發 fast burn，累積到月底才發現 budget 已透支。</p>
<p>burn rate alert 跟 <a href="/blog/backend/06-reliability/slo-error-budget/" data-link-title="6.6 SLO 與 Error Budget 政策" data-link-desc="把可靠性目標轉成可驗證量測與凍結條件">6.6 SLO error budget 政策</a> 直接對應：fast burn → 凍結變更；slow burn → 提高驗證門檻；budget 健康 → 正常發版。</p>
<p>Sloth 產出的 recording rule 範例（5m window）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln">1</span><span class="cl">- <span class="nt">record</span><span class="p">:</span><span class="w"> </span><span class="l">slo:sli_error:ratio_rate5m</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">  </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="sd">    sum(rate(http_requests_total{service=&#34;checkout&#34;,code=~&#34;5..&#34;}[5m]))
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="sd">    /
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="sd">    sum(rate(http_requests_total{service=&#34;checkout&#34;}[5m]))</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">  </span><span class="nt">labels</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">    </span><span class="nt">sloth_service</span><span class="p">:</span><span class="w"> </span><span class="l">checkout-api</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w">    </span><span class="nt">sloth_slo</span><span class="p">:</span><span class="w"> </span><span class="l">availability</span></span></span></code></pre></div><p>對應的 alerting rule（fast burn）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln">1</span><span class="cl">- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">CheckoutAvailabilityFastBurn</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">  </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="sd">    slo:sli_error:ratio_rate5m{sloth_slo=&#34;availability&#34;} &gt; (14.4 * 0.001)
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="sd">    and
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="sd">    slo:sli_error:ratio_rate1h{sloth_slo=&#34;availability&#34;} &gt; (14.4 * 0.001)</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">  </span><span class="nt">labels</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">    </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">critical</span></span></span></code></pre></div><p>fast burn alert 要求 5m 和 1h 兩個窗口同時超過門檻，短窗口防止 spike false positive、長窗口確認趨勢持續。</p>
<h2 id="實作流程">實作流程</h2>
<h3 id="cli-生成">CLI 生成</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">sloth generate -i slo.yaml -o rules.yaml
</span></span><span class="line"><span class="ln">2</span><span class="cl">sloth validate -i slo.yaml</span></span></code></pre></div><p><code>generate</code> 產出的 <code>rules.yaml</code> 包含所有 recording rules 和 alerting rules，直接放入 Prometheus 的 <code>rule_files</code> 載入。<code>validate</code> 在 CI 中先行檢查 YAML 格式與 SLI expression 語法。</p>
<h3 id="k8s-operator-mode">K8s Operator mode</h3>
<p>Sloth 提供 K8s Operator，用 <code>PrometheusServiceLevel</code> CRD 定義 SLO。Operator 自動 reconcile，把 CRD 轉成 Prometheus rules 並同步到 Prometheus Operator 的 <code>PrometheusRule</code> 資源。</p>
<p>Operator mode 適合 K8s-native 環境：SLO 定義跟 service deployment 放在同一個 GitOps repo，變更 SLO 跟變更服務走同一套 PR review + CI 流程。</p>
<h3 id="ci--gitops-整合">CI / GitOps 整合</h3>
<p>在 CI pipeline 中跑 <code>sloth validate</code> 驗證 YAML，再跑 <code>sloth generate</code> 產出 rules，commit 進 GitOps repo。Prometheus 透過 config reload 或 Operator reconcile 載入新 rules。這條流程讓 SLO 變更有版本歷史、有 review、有 rollback 能力。</p>
<h2 id="邊界與陷阱">邊界與陷阱</h2>
<p>Sloth 只支援 Prometheus 作為後端。若觀測平台是 Datadog、New Relic、Honeycomb 或 Grafana Cloud，需要各平台自己的 SLO 功能或 <a href="/blog/backend/06-reliability/vendors/nobl9/" data-link-title="Nobl9" data-link-desc="SLO platform、跨 data source、企業 SLO 治理">Nobl9</a> 的 multi-source 整合。</p>
<p>SLI expression 錯誤是最常見的問題。分母為零（service 沒有流量）會產生 NaN，cascading 到所有 recording rule。label 不匹配（<code>service</code> label 拼錯）會產生空 series，alert 永遠不觸發。<code>sloth validate</code> 檢查語法但不檢查 Prometheus 中是否真的有對應 series — 上線後需要用 Prometheus query 確認 recording rule 產出非空結果。</p>
<p>SLO 數量增長會累積 recording rule 成本。每個 SLO 產生約 30 條 recording rule（4 windows × 多組 aggregation）。100 個 SLO 產生 3000 條 rule，Prometheus 的 rule evaluation 會消耗明顯的 CPU 和記憶體。定期監控 <code>prometheus_rule_evaluation_duration_seconds</code> 和 <code>prometheus_rule_group_rules</code>，在 rule 數量影響 evaluation latency 前調整。</p>
<p>升級路徑：Sloth YAML 跟 OpenSLO spec 部分相容。從 Sloth 移到 Nobl9 時，SLO 定義的語意可以保留，SLI expression 需要改寫成 Nobl9 的 data source query。這條路徑適合從 Prometheus-only 環境逐步擴展到 multi-source SLO governance。</p>
<h2 id="整合路由">整合路由</h2>
<ul>
<li>上游：<a href="/blog/backend/06-reliability/slo-error-budget/" data-link-title="6.6 SLO 與 Error Budget 政策" data-link-desc="把可靠性目標轉成可驗證量測與凍結條件">6.6 SLO 與 Error Budget 政策</a> — SLO 定義與 objective 來源</li>
<li>下游：<a href="/blog/backend/06-reliability/release-gate/" data-link-title="6.8 Release Gate 與變更節奏" data-link-desc="把驗證、migration、相容性納入放行判準">6.8 Release Gate</a> — burn rate alert 觸發凍結</li>
<li>平行：<a href="/blog/backend/06-reliability/vendors/nobl9/" data-link-title="Nobl9" data-link-desc="SLO platform、跨 data source、企業 SLO 治理">Nobl9</a>（SaaS multi-source）、Pyrra（K8s-native + UI）</li>
<li>案例回寫：<a href="/blog/backend/06-reliability/cases/google/error-budget-policy-and-release-gating/" data-link-title="Google：Error Budget 政策如何決定發布節奏" data-link-desc="把 SLO 消耗量轉成 release gate，讓可靠性與交付速度共用同一套決策語言。">Google G1</a>（error budget policy 原典）、<a href="/blog/backend/06-reliability/cases/honeycomb/burn-rate-driven-reliability-operations/" data-link-title="Honeycomb：以 Burn Rate 驅動的可靠性操作" data-link-desc="把 SLO burn rate 直接連到值班決策與改善優先序，降低高噪音告警造成的判讀失真。">Honeycomb HC1</a>（burn rate 驅動可靠性操作）</li>
</ul>
]]></content:encoded></item></channel></rss>