<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Chaos Mesh on Tarragon</title><link>https://tarrragon.github.io/blog/backend/06-reliability/vendors/chaos-mesh/</link><description>Recent content in Chaos Mesh on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Fri, 01 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/backend/06-reliability/vendors/chaos-mesh/index.xml" rel="self" type="application/rss+xml"/><item><title>Chaos Mesh：Workflow、Scope Control 與 Steady State Probe</title><link>https://tarrragon.github.io/blog/backend/06-reliability/vendors/chaos-mesh/workflow-experiment-scope-and-steady-state-probe/</link><pubDate>Tue, 23 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/06-reliability/vendors/chaos-mesh/workflow-experiment-scope-and-steady-state-probe/</guid><description>&lt;h2 id="問題情境">問題情境&lt;/h2>
&lt;p>單一 ChaosExperiment（PodChaos pod-kill、NetworkChaos delay）只能驗證一個故障面向。真實的可靠性驗證需要多步驟編排：先注入依賴延遲，觀察 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/steady-state/" data-link-title="Steady State" data-link-desc="說明可靠性實驗與事故恢復如何定義系統應維持的可接受狀態">steady state&lt;/a> 是否維持，再注入節點失效，最後驗證恢復路徑。Chaos Workflow 提供這個編排能力，把多個 fault injection 與 health check 組成可重播的驗證流程。&lt;/p>
&lt;p>experiment scope 的精準控制同樣關鍵。selector 選到 production 全部 pod 的 chaos experiment 會變成真實事故。scope control 的責任是讓 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/blast-radius/" data-link-title="Blast Radius" data-link-desc="說明事故影響面如何估算與隔離">blast radius&lt;/a> 從最小範圍開始，逐步放大，每一步都有停止條件。&lt;/p>
&lt;h2 id="chaos-workflow-設計">Chaos Workflow 設計&lt;/h2>
&lt;p>Chaos Workflow 是多個 ChaosExperiment 與 StatusCheck 組成的 DAG（有向無環圖），用 YAML 定義步驟順序與分支條件。&lt;/p>
&lt;h3 id="步驟類型">步驟類型&lt;/h3>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>類型&lt;/th>
 &lt;th>責任&lt;/th>
 &lt;th>適用場景&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Serial&lt;/td>
 &lt;td>順序執行，前一步完成才進下一步&lt;/td>
 &lt;td>依賴故障 → 觀察 → 節點故障&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Parallel&lt;/td>
 &lt;td>平行執行多個注入&lt;/td>
 &lt;td>同時打多個依賴驗證交叉影響&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Suspend&lt;/td>
 &lt;td>暫停等待人工確認後再繼續&lt;/td>
 &lt;td>高風險步驟前的 approval gate&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>StatusCheck&lt;/td>
 &lt;td>對 HTTP / gRPC / custom script 做 probe&lt;/td>
 &lt;td>注入前後的 steady state 驗證&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>StatusCheck 是 workflow 的核心控制面。它在故障注入前後對目標 endpoint 做 health check，pass/fail 決定 workflow 是否繼續。StatusCheck 的 success condition 對應 &lt;a href="https://tarrragon.github.io/blog/backend/06-reliability/steady-state-definition/" data-link-title="6.22 Steady State Definition" data-link-desc="在 chaos 與 failover 前先定義系統應維持的穩定狀態與可接受退化">6.22 steady state definition&lt;/a> 的穩態門檻：success rate、latency、queue lag 都能作為 probe 判準。&lt;/p>
&lt;p>典型 workflow 編排：NetworkChaos(delay 200ms) → StatusCheck(api-latency-ok) → PodChaos(pod-kill) → StatusCheck(recovery-within-30s)。第一個 StatusCheck 驗證延遲注入後服務仍可用；第二個 StatusCheck 驗證節點失效後恢復時間可接受。&lt;/p>
&lt;h3 id="suspend-的使用時機">Suspend 的使用時機&lt;/h3>
&lt;p>Suspend 步驟適合放在 blast radius 擴大之前。例如先在 canary namespace 跑完 chaos + StatusCheck，通過後 Suspend 等待值班工程師確認，再擴大到 production namespace。Suspend 讓自動化 workflow 在關鍵決策點保留人工判斷。&lt;/p>
&lt;h2 id="experiment-scope-control">Experiment Scope Control&lt;/h2>
&lt;p>Scope control 的責任是讓每個 ChaosExperiment 的影響面可預測、可限制。Chaos Mesh 用 selector + mode 兩層控制。&lt;/p>
&lt;h3 id="selector">Selector&lt;/h3>
&lt;p>Selector 決定哪些 pod 是實驗目標。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Selector 類型&lt;/th>
 &lt;th>作用&lt;/th>
 &lt;th>範例&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>namespace&lt;/td>
 &lt;td>限制在特定 namespace&lt;/td>
 &lt;td>&lt;code>namespaces: [canary]&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>labelSelector&lt;/td>
 &lt;td>按 label 篩選&lt;/td>
 &lt;td>&lt;code>app: checkout, tier: backend&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>annotationSelector&lt;/td>
 &lt;td>按 annotation 篩選&lt;/td>
 &lt;td>&lt;code>chaos-eligible: &amp;quot;true&amp;quot;&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>fieldSelector&lt;/td>
 &lt;td>按 field 篩選（如 node name）&lt;/td>
 &lt;td>&lt;code>spec.nodeName: node-3&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>podPhase&lt;/td>
 &lt;td>只選特定狀態的 pod&lt;/td>
 &lt;td>&lt;code>Running&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>最安全的起點是 namespace + labelSelector + annotation 三層組合：只在 canary namespace、只選帶 &lt;code>chaos-eligible&lt;/code> annotation 的特定服務 pod。annotation-based opt-in 讓團隊明確標記哪些 pod 可以被 chaos 觸及。&lt;/p></description><content:encoded><![CDATA[<h2 id="問題情境">問題情境</h2>
<p>單一 ChaosExperiment（PodChaos pod-kill、NetworkChaos delay）只能驗證一個故障面向。真實的可靠性驗證需要多步驟編排：先注入依賴延遲，觀察 <a href="/blog/backend/knowledge-cards/steady-state/" data-link-title="Steady State" data-link-desc="說明可靠性實驗與事故恢復如何定義系統應維持的可接受狀態">steady state</a> 是否維持，再注入節點失效，最後驗證恢復路徑。Chaos Workflow 提供這個編排能力，把多個 fault injection 與 health check 組成可重播的驗證流程。</p>
<p>experiment scope 的精準控制同樣關鍵。selector 選到 production 全部 pod 的 chaos experiment 會變成真實事故。scope control 的責任是讓 <a href="/blog/backend/knowledge-cards/blast-radius/" data-link-title="Blast Radius" data-link-desc="說明事故影響面如何估算與隔離">blast radius</a> 從最小範圍開始，逐步放大，每一步都有停止條件。</p>
<h2 id="chaos-workflow-設計">Chaos Workflow 設計</h2>
<p>Chaos Workflow 是多個 ChaosExperiment 與 StatusCheck 組成的 DAG（有向無環圖），用 YAML 定義步驟順序與分支條件。</p>
<h3 id="步驟類型">步驟類型</h3>
<table>
  <thead>
      <tr>
          <th>類型</th>
          <th>責任</th>
          <th>適用場景</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Serial</td>
          <td>順序執行，前一步完成才進下一步</td>
          <td>依賴故障 → 觀察 → 節點故障</td>
      </tr>
      <tr>
          <td>Parallel</td>
          <td>平行執行多個注入</td>
          <td>同時打多個依賴驗證交叉影響</td>
      </tr>
      <tr>
          <td>Suspend</td>
          <td>暫停等待人工確認後再繼續</td>
          <td>高風險步驟前的 approval gate</td>
      </tr>
      <tr>
          <td>StatusCheck</td>
          <td>對 HTTP / gRPC / custom script 做 probe</td>
          <td>注入前後的 steady state 驗證</td>
      </tr>
  </tbody>
</table>
<p>StatusCheck 是 workflow 的核心控制面。它在故障注入前後對目標 endpoint 做 health check，pass/fail 決定 workflow 是否繼續。StatusCheck 的 success condition 對應 <a href="/blog/backend/06-reliability/steady-state-definition/" data-link-title="6.22 Steady State Definition" data-link-desc="在 chaos 與 failover 前先定義系統應維持的穩定狀態與可接受退化">6.22 steady state definition</a> 的穩態門檻：success rate、latency、queue lag 都能作為 probe 判準。</p>
<p>典型 workflow 編排：NetworkChaos(delay 200ms) → StatusCheck(api-latency-ok) → PodChaos(pod-kill) → StatusCheck(recovery-within-30s)。第一個 StatusCheck 驗證延遲注入後服務仍可用；第二個 StatusCheck 驗證節點失效後恢復時間可接受。</p>
<h3 id="suspend-的使用時機">Suspend 的使用時機</h3>
<p>Suspend 步驟適合放在 blast radius 擴大之前。例如先在 canary namespace 跑完 chaos + StatusCheck，通過後 Suspend 等待值班工程師確認，再擴大到 production namespace。Suspend 讓自動化 workflow 在關鍵決策點保留人工判斷。</p>
<h2 id="experiment-scope-control">Experiment Scope Control</h2>
<p>Scope control 的責任是讓每個 ChaosExperiment 的影響面可預測、可限制。Chaos Mesh 用 selector + mode 兩層控制。</p>
<h3 id="selector">Selector</h3>
<p>Selector 決定哪些 pod 是實驗目標。</p>
<table>
  <thead>
      <tr>
          <th>Selector 類型</th>
          <th>作用</th>
          <th>範例</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>namespace</td>
          <td>限制在特定 namespace</td>
          <td><code>namespaces: [canary]</code></td>
      </tr>
      <tr>
          <td>labelSelector</td>
          <td>按 label 篩選</td>
          <td><code>app: checkout, tier: backend</code></td>
      </tr>
      <tr>
          <td>annotationSelector</td>
          <td>按 annotation 篩選</td>
          <td><code>chaos-eligible: &quot;true&quot;</code></td>
      </tr>
      <tr>
          <td>fieldSelector</td>
          <td>按 field 篩選（如 node name）</td>
          <td><code>spec.nodeName: node-3</code></td>
      </tr>
      <tr>
          <td>podPhase</td>
          <td>只選特定狀態的 pod</td>
          <td><code>Running</code></td>
      </tr>
  </tbody>
</table>
<p>最安全的起點是 namespace + labelSelector + annotation 三層組合：只在 canary namespace、只選帶 <code>chaos-eligible</code> annotation 的特定服務 pod。annotation-based opt-in 讓團隊明確標記哪些 pod 可以被 chaos 觸及。</p>
<h3 id="mode">Mode</h3>
<p>Mode 決定在 selector 命中的 pod 中選多少個。</p>
<table>
  <thead>
      <tr>
          <th>Mode</th>
          <th>行為</th>
          <th>Blast radius</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>one</td>
          <td>隨機選 1 個</td>
          <td>最小</td>
      </tr>
      <tr>
          <td>fixed</td>
          <td>固定選 N 個</td>
          <td>可控</td>
      </tr>
      <tr>
          <td>fixed-percent</td>
          <td>選命中 pod 的 N%</td>
          <td>比例控制</td>
      </tr>
      <tr>
          <td>random-max-percent</td>
          <td>隨機選最多 N%</td>
          <td>有隨機性</td>
      </tr>
      <tr>
          <td>all</td>
          <td>選全部命中的 pod</td>
          <td>最大</td>
      </tr>
  </tbody>
</table>
<p>從 <code>mode: one</code> 開始驗證基礎假設，確認 StatusCheck 通過後，逐步升級到 <code>fixed-percent: 25</code> → <code>fixed-percent: 50</code>。每一步放大前檢查 steady state 是否仍維持，這個節奏對應 <a href="/blog/backend/06-reliability/experiment-safety-boundary/" data-link-title="6.20 Experiment Safety Boundary" data-link-desc="定義 chaos、load test、DR drill 的 [blast radius](/backend/knowledge-cards/blast-radius/)、停止條件與權限約束">6.20 experiment safety boundary</a> 的漸進放大原則。</p>
<h3 id="duration-與-schedule">Duration 與 Schedule</h3>
<p>duration 控制單次故障注入持續多久，schedule 控制實驗重複頻率。duration 太短可能看不到系統完整的退化與恢復循環；太長則增加實際風險。初始建議：duration 設為 recovery SLA 的 2-3 倍（例如 RTO 30s 則 duration 設 60-90s），讓觀測窗涵蓋完整恢復。</p>
<h2 id="實作範例">實作範例</h2>
<p>一個完整的 Chaos Workflow：先對 checkout 服務注入網路延遲，驗證 API 仍可用，再 kill pod 驗證恢復。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">chaos-mesh.org/v1alpha1</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">Workflow</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">checkout-resilience-验证</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">  </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">chaos-testing</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">  </span><span class="nt">entry</span><span class="p">:</span><span class="w"> </span><span class="l">main</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">  </span><span class="nt">templates</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">main</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">      </span><span class="nt">templateType</span><span class="p">:</span><span class="w"> </span><span class="l">Serial</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">      </span><span class="nt">children</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">        </span>- <span class="l">network-delay</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">        </span>- <span class="l">check-api-health</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">        </span>- <span class="l">pod-kill</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">        </span>- <span class="l">check-recovery</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">    </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">network-delay</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w">      </span><span class="nt">templateType</span><span class="p">:</span><span class="w"> </span><span class="l">NetworkChaos</span><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w">      </span><span class="nt">networkChaos</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w">        </span><span class="nt">action</span><span class="p">:</span><span class="w"> </span><span class="l">delay</span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w">        </span><span class="nt">delay</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w">          </span><span class="nt">latency</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;200ms&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w">        </span><span class="nt">selector</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w">          </span><span class="nt">namespaces</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">canary]</span><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w">          </span><span class="nt">labelSelectors</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w">            </span><span class="nt">app</span><span class="p">:</span><span class="w"> </span><span class="l">checkout</span><span class="w">
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="w">        </span><span class="nt">mode</span><span class="p">:</span><span class="w"> </span><span class="l">one</span><span class="w">
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="w">        </span><span class="nt">duration</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;60s&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="w">    </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">check-api-health</span><span class="w">
</span></span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="w">      </span><span class="nt">templateType</span><span class="p">:</span><span class="w"> </span><span class="l">StatusCheck</span><span class="w">
</span></span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="w">      </span><span class="nt">statusCheck</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="w">        </span><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l">HTTP</span><span class="w">
</span></span></span><span class="line"><span class="ln">32</span><span class="cl"><span class="w">        </span><span class="nt">http</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="w">          </span><span class="nt">url</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;http://checkout.canary/health&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="w">          </span><span class="nt">criteria</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">35</span><span class="cl"><span class="w">            </span><span class="nt">statusCode</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;200&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln">36</span><span class="cl"><span class="w">        </span><span class="nt">timeoutSeconds</span><span class="p">:</span><span class="w"> </span><span class="m">30</span><span class="w">
</span></span></span><span class="line"><span class="ln">37</span><span class="cl"><span class="w">        </span><span class="nt">failureThreshold</span><span class="p">:</span><span class="w"> </span><span class="m">3</span><span class="w">
</span></span></span><span class="line"><span class="ln">38</span><span class="cl"><span class="w">    </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">pod-kill</span><span class="w">
</span></span></span><span class="line"><span class="ln">39</span><span class="cl"><span class="w">      </span><span class="nt">templateType</span><span class="p">:</span><span class="w"> </span><span class="l">PodChaos</span><span class="w">
</span></span></span><span class="line"><span class="ln">40</span><span class="cl"><span class="w">      </span><span class="nt">podChaos</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">41</span><span class="cl"><span class="w">        </span><span class="nt">action</span><span class="p">:</span><span class="w"> </span><span class="l">pod-kill</span><span class="w">
</span></span></span><span class="line"><span class="ln">42</span><span class="cl"><span class="w">        </span><span class="nt">selector</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">43</span><span class="cl"><span class="w">          </span><span class="nt">namespaces</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">canary]</span><span class="w">
</span></span></span><span class="line"><span class="ln">44</span><span class="cl"><span class="w">          </span><span class="nt">labelSelectors</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">45</span><span class="cl"><span class="w">            </span><span class="nt">app</span><span class="p">:</span><span class="w"> </span><span class="l">checkout</span><span class="w">
</span></span></span><span class="line"><span class="ln">46</span><span class="cl"><span class="w">        </span><span class="nt">mode</span><span class="p">:</span><span class="w"> </span><span class="l">one</span><span class="w">
</span></span></span><span class="line"><span class="ln">47</span><span class="cl"><span class="w">    </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">check-recovery</span><span class="w">
</span></span></span><span class="line"><span class="ln">48</span><span class="cl"><span class="w">      </span><span class="nt">templateType</span><span class="p">:</span><span class="w"> </span><span class="l">StatusCheck</span><span class="w">
</span></span></span><span class="line"><span class="ln">49</span><span class="cl"><span class="w">      </span><span class="nt">statusCheck</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">50</span><span class="cl"><span class="w">        </span><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l">HTTP</span><span class="w">
</span></span></span><span class="line"><span class="ln">51</span><span class="cl"><span class="w">        </span><span class="nt">http</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">52</span><span class="cl"><span class="w">          </span><span class="nt">url</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;http://checkout.canary/health&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln">53</span><span class="cl"><span class="w">          </span><span class="nt">criteria</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">54</span><span class="cl"><span class="w">            </span><span class="nt">statusCode</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;200&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln">55</span><span class="cl"><span class="w">        </span><span class="nt">timeoutSeconds</span><span class="p">:</span><span class="w"> </span><span class="m">60</span><span class="w">
</span></span></span><span class="line"><span class="ln">56</span><span class="cl"><span class="w">        </span><span class="nt">failureThreshold</span><span class="p">:</span><span class="w"> </span><span class="m">5</span></span></span></code></pre></div><h3 id="gitops-整合">GitOps 整合</h3>
<p>Workflow 定義存在 git repo，用 ArgoCD 或 Flux sync 到 cluster。變更 chaos experiment 走 PR review，跟 code 變更同樣的 approval 流程。這讓 experiment 的修改歷史可追蹤、可審計。</p>
<h3 id="rbac-約束">RBAC 約束</h3>
<p>Chaos Mesh 的 ServiceAccount 權限需要最小化。production namespace 的 chaos experiment 應使用獨立 ServiceAccount，只授予目標 namespace 的 ChaosExperiment create/get/list 權限。避免使用 cluster-admin 角色跑 chaos — 權限過大會讓 selector 誤配時的影響面不可控。</p>
<h2 id="邊界與陷阱">邊界與陷阱</h2>
<p><strong>StatusCheck timeout 太短</strong>：服務在 pod-kill 後需要 readiness probe 通過、load balancer 更新、cache 預熱。若 StatusCheck 的 timeoutSeconds 設太短，服務還在恢復中就被判失敗，產生 false negative。初始 timeout 建議設為預期恢復時間的 2 倍。</p>
<p><strong>Selector 太寬</strong>：namespace-level selector 不加 labelSelector 會命中該 namespace 所有 pod，包含 sidecar、monitoring agent 等非目標 pod。永遠用 labelSelector 或 annotationSelector 收窄範圍。</p>
<p><strong>Privilege 需求</strong>：Chaos Mesh 的 IOChaos 和 StressChaos 需要 container 的 SYS_ADMIN / SYS_PTRACE capability。安全團隊可能限制這些 capability 的使用。若無法取得 privilege，可以先用 PodChaos + NetworkChaos（不需額外 capability）建立 chaos 習慣，再逐步推進。</p>
<p><strong>K8s-only 限制</strong>：Chaos Mesh 只能注入 Kubernetes 上的故障。非 K8s 環境的依賴（外部 SaaS、bare-metal DB、第三方 API）需要用 <a href="/blog/backend/06-reliability/vendors/toxiproxy/" data-link-title="Toxiproxy" data-link-desc="TCP-level fault injection proxy（Shopify 開源）">Toxiproxy</a>（TCP-level fault）或 <a href="/blog/backend/06-reliability/vendors/gremlin/" data-link-title="Gremlin" data-link-desc="商業 chaos engineering 平台、跨平台與 GameDay">Gremlin</a>（跨平台 SaaS）補充。</p>
<h2 id="整合路由">整合路由</h2>
<ul>
<li>上游概念：<a href="/blog/backend/06-reliability/experiment-safety-boundary/" data-link-title="6.20 Experiment Safety Boundary" data-link-desc="定義 chaos、load test、DR drill 的 [blast radius](/backend/knowledge-cards/blast-radius/)、停止條件與權限約束">6.20 Experiment Safety Boundary</a> — selector + mode 對應 blast radius 設計</li>
<li>上游概念：<a href="/blog/backend/06-reliability/steady-state-definition/" data-link-title="6.22 Steady State Definition" data-link-desc="在 chaos 與 failover 前先定義系統應維持的穩定狀態與可接受退化">6.22 Steady State Definition</a> — StatusCheck 對應穩態門檻</li>
<li>下游交接：<a href="/blog/backend/06-reliability/verification-evidence-handoff/" data-link-title="6.23 Verification Evidence Handoff" data-link-desc="把 SLO、load、chaos、DR 與 readiness 結果包成 release / incident 可用證據">6.23 Verification Evidence Handoff</a> — Workflow 結果作為 release gate 證據</li>
<li>平行 vendor：<a href="/blog/backend/06-reliability/vendors/litmuschaos/" data-link-title="LitmusChaos" data-link-desc="Kubernetes chaos engineering 平台（CNCF graduated）">LitmusChaos</a>、<a href="/blog/backend/06-reliability/vendors/gremlin/" data-link-title="Gremlin" data-link-desc="商業 chaos engineering 平台、跨平台與 GameDay">Gremlin</a>、<a href="/blog/backend/06-reliability/vendors/toxiproxy/" data-link-title="Toxiproxy" data-link-desc="TCP-level fault injection proxy（Shopify 開源）">Toxiproxy</a></li>
<li>案例回寫：<a href="/blog/backend/06-reliability/cases/netflix/steady-state-chaos-and-fit/" data-link-title="Netflix：Steady State、Chaos 與 FIT 的驗證路徑" data-link-desc="把故障注入從工具操作升級成可驗證流程：先定義穩態，再設計注入與回復條件。">Netflix N1</a>（steady state hypothesis）、<a href="/blog/backend/06-reliability/cases/netflix/chaos-monkey-business-hours-guardrails/" data-link-title="Netflix：Business-Hours Chaos 與 Guardrails" data-link-desc="Chaos Monkey 為何刻意在 business hours 執行：把即時應變能力納入驗證，並用 guardrails 限制實驗風險。">Netflix N2</a>（business-hours guardrails 對應 scope control）</li>
</ul>
]]></content:encoded></item></channel></rss>