Vendor on Tarragon

Chaos Mesh：Workflow、Scope Control 與 Steady State Probe

Tue, 23 Jun 2026 00:00:00 +0000

問題情境

單一 ChaosExperiment（PodChaos pod-kill、NetworkChaos delay）只能驗證一個故障面向。真實的可靠性驗證需要多步驟編排：先注入依賴延遲，觀察 steady state 是否維持，再注入節點失效，最後驗證恢復路徑。Chaos Workflow 提供這個編排能力，把多個 fault injection 與 health check 組成可重播的驗證流程。

experiment scope 的精準控制同樣關鍵。selector 選到 production 全部 pod 的 chaos experiment 會變成真實事故。scope control 的責任是讓 blast radius 從最小範圍開始，逐步放大，每一步都有停止條件。

Chaos Workflow 設計

Chaos Workflow 是多個 ChaosExperiment 與 StatusCheck 組成的 DAG（有向無環圖），用 YAML 定義步驟順序與分支條件。

步驟類型

類型	責任	適用場景
Serial	順序執行，前一步完成才進下一步	依賴故障 → 觀察 → 節點故障
Parallel	平行執行多個注入	同時打多個依賴驗證交叉影響
Suspend	暫停等待人工確認後再繼續	高風險步驟前的 approval gate
StatusCheck	對 HTTP / gRPC / custom script 做 probe	注入前後的 steady state 驗證

StatusCheck 是 workflow 的核心控制面。它在故障注入前後對目標 endpoint 做 health check，pass/fail 決定 workflow 是否繼續。StatusCheck 的 success condition 對應 6.22 steady state definition 的穩態門檻：success rate、latency、queue lag 都能作為 probe 判準。

典型 workflow 編排：NetworkChaos(delay 200ms) → StatusCheck(api-latency-ok) → PodChaos(pod-kill) → StatusCheck(recovery-within-30s)。第一個 StatusCheck 驗證延遲注入後服務仍可用；第二個 StatusCheck 驗證節點失效後恢復時間可接受。

Suspend 的使用時機

Suspend 步驟適合放在 blast radius 擴大之前。例如先在 canary namespace 跑完 chaos + StatusCheck，通過後 Suspend 等待值班工程師確認，再擴大到 production namespace。Suspend 讓自動化 workflow 在關鍵決策點保留人工判斷。

Experiment Scope Control

Scope control 的責任是讓每個 ChaosExperiment 的影響面可預測、可限制。Chaos Mesh 用 selector + mode 兩層控制。

Selector

Selector 決定哪些 pod 是實驗目標。

Selector 類型	作用	範例
namespace	限制在特定 namespace	`namespaces: [canary]`
labelSelector	按 label 篩選	`app: checkout, tier: backend`
annotationSelector	按 annotation 篩選	`chaos-eligible: "true"`
fieldSelector	按 field 篩選（如 node name）	`spec.nodeName: node-3`
podPhase	只選特定狀態的 pod	`Running`

最安全的起點是 namespace + labelSelector + annotation 三層組合：只在 canary namespace、只選帶 chaos-eligible annotation 的特定服務 pod。annotation-based opt-in 讓團隊明確標記哪些 pod 可以被 chaos 觸及。

Mode

Mode 決定在 selector 命中的 pod 中選多少個。

Mode	行為	Blast radius
one	隨機選 1 個	最小
fixed	固定選 N 個	可控
fixed-percent	選命中 pod 的 N%	比例控制
random-max-percent	隨機選最多 N%	有隨機性
all	選全部命中的 pod	最大

從 mode: one 開始驗證基礎假設，確認 StatusCheck 通過後，逐步升級到 fixed-percent: 25 → fixed-percent: 50。每一步放大前檢查 steady state 是否仍維持，這個節奏對應 6.20 experiment safety boundary 的漸進放大原則。

Duration 與 Schedule

duration 控制單次故障注入持續多久，schedule 控制實驗重複頻率。duration 太短可能看不到系統完整的退化與恢復循環；太長則增加實際風險。初始建議：duration 設為 recovery SLA 的 2-3 倍（例如 RTO 30s 則 duration 設 60-90s），讓觀測窗涵蓋完整恢復。

實作範例

一個完整的 Chaos Workflow：先對 checkout 服務注入網路延遲，驗證 API 仍可用，再 kill pod 驗證恢復。

 1apiVersion: chaos-mesh.org/v1alpha1
 2kind: Workflow
 3metadata:
 4  name: checkout-resilience-验证
 5  namespace: chaos-testing
 6spec:
 7  entry: main
 8  templates:
 9    - name: main
10      templateType: Serial
11      children:
12        - network-delay
13        - check-api-health
14        - pod-kill
15        - check-recovery
16    - name: network-delay
17      templateType: NetworkChaos
18      networkChaos:
19        action: delay
20        delay:
21          latency: "200ms"
22        selector:
23          namespaces: [canary]
24          labelSelectors:
25            app: checkout
26        mode: one
27        duration: "60s"
28    - name: check-api-health
29      templateType: StatusCheck
30      statusCheck:
31        type: HTTP
32        http:
33          url: "http://checkout.canary/health"
34          criteria:
35            statusCode: "200"
36        timeoutSeconds: 30
37        failureThreshold: 3
38    - name: pod-kill
39      templateType: PodChaos
40      podChaos:
41        action: pod-kill
42        selector:
43          namespaces: [canary]
44          labelSelectors:
45            app: checkout
46        mode: one
47    - name: check-recovery
48      templateType: StatusCheck
49      statusCheck:
50        type: HTTP
51        http:
52          url: "http://checkout.canary/health"
53          criteria:
54            statusCode: "200"
55        timeoutSeconds: 60
56        failureThreshold: 5

GitOps 整合

Workflow 定義存在 git repo，用 ArgoCD 或 Flux sync 到 cluster。變更 chaos experiment 走 PR review，跟 code 變更同樣的 approval 流程。這讓 experiment 的修改歷史可追蹤、可審計。

RBAC 約束

Chaos Mesh 的 ServiceAccount 權限需要最小化。production namespace 的 chaos experiment 應使用獨立 ServiceAccount，只授予目標 namespace 的 ChaosExperiment create/get/list 權限。避免使用 cluster-admin 角色跑 chaos — 權限過大會讓 selector 誤配時的影響面不可控。

邊界與陷阱

StatusCheck timeout 太短：服務在 pod-kill 後需要 readiness probe 通過、load balancer 更新、cache 預熱。若 StatusCheck 的 timeoutSeconds 設太短，服務還在恢復中就被判失敗，產生 false negative。初始 timeout 建議設為預期恢復時間的 2 倍。

Selector 太寬：namespace-level selector 不加 labelSelector 會命中該 namespace 所有 pod，包含 sidecar、monitoring agent 等非目標 pod。永遠用 labelSelector 或 annotationSelector 收窄範圍。

Privilege 需求：Chaos Mesh 的 IOChaos 和 StressChaos 需要 container 的 SYS_ADMIN / SYS_PTRACE capability。安全團隊可能限制這些 capability 的使用。若無法取得 privilege，可以先用 PodChaos + NetworkChaos（不需額外 capability）建立 chaos 習慣，再逐步推進。

K8s-only 限制：Chaos Mesh 只能注入 Kubernetes 上的故障。非 K8s 環境的依賴（外部 SaaS、bare-metal DB、第三方 API）需要用 Toxiproxy（TCP-level fault）或 Gremlin（跨平台 SaaS）補充。

整合路由

上游概念：6.20 Experiment Safety Boundary — selector + mode 對應 blast radius 設計
上游概念：6.22 Steady State Definition — StatusCheck 對應穩態門檻
下游交接：6.23 Verification Evidence Handoff — Workflow 結果作為 release gate 證據
平行 vendor：LitmusChaos、Gremlin、Toxiproxy
案例回寫：Netflix N1（steady state hypothesis）、Netflix N2（business-hours guardrails 對應 scope control）

k6：Threshold CI Gate 與 Scenario 設計

Tue, 23 Jun 2026 00:00:00 +0000

問題情境

Load test 跑完會產生大量指標，但 CI pipeline 需要的是 pass/fail 訊號。若沒有 threshold 把指標轉成判讀結論，效能退化只能靠人工看 dashboard 發現，等到看見時通常已經累積數個版本。

另一面，threshold 的判讀品質取決於 workload model 的真實度。用 --vus 10 --duration 30s 跑出來的結果跟 production 流量結構差距太大時，threshold 通過也無法證明 production 安全。

這篇處理兩個問題：怎麼設 threshold 讓 CI gate 可靠，怎麼設 scenario 讓 workload 接近真實。

Threshold 設計

Threshold 的責任是把 load test 指標轉成 CI 的 pass/fail 訊號。k6 在所有 threshold 都通過時回傳 exit code 0，任一 threshold 失敗就回傳非零 — CI pipeline 直接用 exit code 判斷。

多指標 threshold

單一指標 threshold 容易漏風險。latency 正常但 error rate 偏高代表系統在丟請求；throughput 正常但 latency 偏高代表排隊開始堆積。完整的 threshold 至少涵蓋三個面向：

1export const options = {
2  thresholds: {
3    http_req_duration: ['p(95)<500', 'p(99)<1000'],
4    http_req_failed:   ['rate<0.01'],
5    http_reqs:         ['rate>100'],
6  },
7};

latency threshold 用 percentile 而不是 average — average 會被長尾稀釋，p95/p99 更接近使用者感知的最差體驗。

門檻來源

Threshold 的門檻從 production baseline 出發。先從 observability 系統（Grafana / Datadog）取最近 7-30 天的 p95/p99 latency 與 error rate，加上可接受退化幅度（通常 10-20%）作為 threshold。門檻太緊會讓 CI 環境噪音觸發 false positive；門檻太寬會讓真退化滑過去。

校準節奏：每月或每次重大架構變更後重新對齊 production baseline，避免 threshold 跟真實系統漂移。

Path-level threshold

不同 API path 的效能特徵不同。checkout 路徑的 latency 容忍度可能比 listing 路徑低很多。k6 的 group + tag 機制讓 threshold 可以按 path 設定：

 1import { group } from 'k6';
 2
 3export default function () {
 4  group('checkout', function () {
 5    // checkout 請求
 6  });
 7  group('listing', function () {
 8    // listing 請求
 9  });
10}
11
12export const options = {
13  thresholds: {
14    'http_req_duration{group:::checkout}': ['p(95)<300'],
15    'http_req_duration{group:::listing}':  ['p(95)<800'],
16  },
17};

path-level threshold 讓 gate 的判讀粒度從「整體效能」細化到「關鍵路徑效能」。

Scenario 設計

Scenario 的責任是讓壓測的流量結構接近 production。k6 提供五種 scenario executor，選擇取決於要控制什麼變量。

Executor	控制變量	適用場景
constant-vus	並發使用者數	簡單 smoke test
ramping-vus	並發使用者數	階梯式升壓找 saturation
constant-arrival-rate	固定 RPS	CI regression（穩定輸入）
ramping-arrival-rate	變化 RPS	模擬 production peak/off-peak
externally-controlled	外部 API	結合 production 流量 replay

Executor 選擇判準

constant-vus 最簡單，但 throughput 會隨 response time 波動 — 伺服器變慢時 RPS 自動下降，掩蓋了真正的壓力。constant-arrival-rate 控制 RPS 穩定，能讓 threshold 的判讀基準一致，但需要設定足夠的 preAllocatedVUs 避免 k6 因為 VU 不足而主動降速。

CI regression 測試建議用 constant-arrival-rate：輸入固定、輸出可比較、版本間的差異才有意義。

Production traffic shape 對齊

用 ramping-arrival-rate 模擬 production 的流量形狀：

 1export const options = {
 2  scenarios: {
 3    peak_simulation: {
 4      executor: 'ramping-arrival-rate',
 5      startRate: 50,
 6      stages: [
 7        { target: 200, duration: '2m' },  // ramp up
 8        { target: 200, duration: '5m' },  // sustain peak
 9        { target: 50,  duration: '1m' },  // ramp down
10      ],
11      preAllocatedVUs: 300,
12    },
13  },
14};

流量形狀的參數（startRate / target / duration）從 production access log 的 peak 時段推算。Shopify 的 BFCM 準備流程把 game day 的 load test scenario 跟實際峰值形狀對齊 — 短時間爆量加高寫入比例需要特別設計 scenario 來覆蓋。

Cohort 模擬

Production 流量不是單一類型。用多 scenario 並行模擬不同 cohort：

 1export const options = {
 2  scenarios: {
 3    read_traffic: {
 4      executor: 'constant-arrival-rate',
 5      rate: 150, exec: 'readFlow',
 6      preAllocatedVUs: 200,
 7      duration: '5m',
 8    },
 9    write_traffic: {
10      executor: 'constant-arrival-rate',
11      rate: 30, exec: 'writeFlow',
12      preAllocatedVUs: 50,
13      duration: '5m',
14    },
15  },
16};
17
18export function readFlow() { /* GET 請求 */ }
19export function writeFlow() { /* POST 請求 */ }

讀寫比例從 production 的 access log 或 APM 資料推算。比例偏差會讓瓶頸位置失真 — 讀為主的模型抓不到寫入引起的 lock contention。

資料驅動

測試資料用 SharedArray 載入，避免每個 VU 各自載入造成記憶體浪費：

1import { SharedArray } from 'k6/data';
2
3const users = new SharedArray('users', function () {
4  return JSON.parse(open('./users.json'));
5});

資料來源可以是 production sample（脫敏後）或 synthetic generation。資料分佈需要接近 production — ID 範圍、key 分佈、payload 大小都會影響 query plan 與 cache 行為。

CI 整合實務

Fast path（每次 push）

固定 scenario + 短 duration（30s-2min），用 constant-arrival-rate 做 regression 偵測。threshold 設在 production baseline + 10%。這一層的目的是快速攔住明顯退化，不需要模擬完整峰值。

Slow path（merge gate）

完整 scenario + 較長 duration（5-15min），包含多 cohort 與 ramping 模擬。threshold 涵蓋 path-level 指標。這一層的目的是深層驗證變更在接近真實壓力下的行為。

結果留存

k6 結果預設輸出到 stdout。CI 整合時用 --out flag 把結果送到時序資料庫（InfluxDB / Prometheus Remote Write / Grafana Cloud k6），讓歷史趨勢可查詢。趨勢比較能偵測 threshold 內但持續惡化的 slow drift。

LinkedIn 的自動化壓測實踐把 load test 結果跟容量預測接在一起 — saturation point 隨時間的變化趨勢直接驅動擴容決策。

邊界與陷阱

Threshold variance：CI runner 的硬體差異（shared runner 的鄰居效應、network jitter、GC pause）會讓同一份 code 在不同 run 產生不同結果。控制方式：dedicated runner 消除鄰居效應、warmup iteration 丟棄前幾輪結果、多次 run 取中位數。若 variance 超過 threshold 的退化幅度，gate 判讀就不可信。

門檻過寬或過緊：threshold 永遠通過代表 gate 形同虛設；threshold 頻繁 false positive 會讓團隊忽略 CI 結果。兩者都會讓 gate 失去判讀價值。校準的判準是：過去 30 天的 threshold 結果中，真正需要關注的退化是否都被攔住，同時 false positive 率低於 5%。

Scenario 跟 production drift：production 的流量結構會隨產品演進改變。定期（每月或每次重大功能上線）用 access log 校準 scenario 的 RPS、cohort 比例與資料分佈，避免模型越跑越偏。

整合路由

上游概念：6.2 load testing 的 workload model 設計
下游能力：6.13 performance regression gate 的 baseline 管理與退化定位
平行 vendor：Gatling、Locust、JMeter
案例回寫：Shopify BFCM 容量治理（game day load test 對齊峰值形狀）、LinkedIn Automated Load Testing（持續壓測驅動容量預測）

Sloth：SLO YAML 與 Multi-burn-rate Alert 生成

Tue, 23 Jun 2026 00:00:00 +0000

問題情境

SLO 從定義到 Prometheus 落地需要多層 rule。一個 SLO 對應 4 組 time window 的 recording rule（計算各窗口的 burn rate），再對應 fast burn 和 slow burn 兩組 alerting rule。手動維護這些 rule 容易出錯：window 參數不一致、新增 SLO 忘記補 alert、修改 SLI expression 只改了部分 rule。

Sloth 的責任是把這個過程自動化。輸入一份 SLO YAML，產出一組完整的 Prometheus recording + alerting rules，讓 SLO 維護回到宣告式：改 YAML、重新生成、載入 Prometheus。

SLO YAML 設計

Sloth YAML 的核心結構是 version → service → slos[]。每個 SLO 定義三件事：目標數字（objective）、量測方式（SLI）、告警等級（alerting）。

 1version: prometheus/v1
 2service: checkout-api
 3slos:
 4  - name: availability
 5    objective: 99.9
 6    description: "checkout API 的請求成功率"
 7    sli:
 8      events:
 9        error_query: sum(rate(http_requests_total{service="checkout",code=~"5.."}[{{.window}}]))
10        total_query: sum(rate(http_requests_total{service="checkout"}[{{.window}}]))
11    alerting:
12      name: CheckoutAvailability
13      page_alert:
14        labels:
15          severity: critical
16      ticket_alert:
17        labels:
18          severity: warning

SLI 有兩種類型。events-based SLI 用 error/total ratio 定義，Sloth 自動把 {{.window}} 參數代入各 recording rule 的 range vector。raw SLI 直接寫 PromQL expression 算 error ratio，適合非 request-based 的 SLO（如 data freshness、replication lag）。

raw SLI 範例 — data freshness：

1  - name: data-freshness
2    objective: 99.5
3    sli:
4      raw:
5        error_ratio_query: |
6          1 - clamp_max(
7            replication_lag_seconds{service="checkout-db"} / 60,
8            1
9          )

objective 數字的來源是 6.6 SLO 政策 — 先從使用者旅程定義服務承諾，再把承諾轉成 objective。Sloth 不負責決定 objective 該是多少，只負責把 objective 轉成可執行的 Prometheus rule。

alerting 分 page（嚴重，觸發即時通知）和 ticket（一般，產生工單）。兩者的 burn rate 門檻不同：page 用 fast burn window，ticket 用 slow burn window。label 設計跟 Alertmanager routing 對齊 — severity: critical 走 PagerDuty / Slack alert channel，severity: warning 走 ticket system（Jira / Linear）。

Multi-window Multi-burn-rate Alert

Sloth 預設產生 Google SRE 推薦的 4-window alert 結構。每個 SLO 生成以下 recording rules 和 alerting rules。

Window 組合	責任	觸發行動
5m / 1h	Fast burn 偵測	短時間大量消耗 → page 通知
30m / 6h	Moderate burn 偵測	中速消耗 → page 或 ticket
2h / 1d	Slow burn 偵測	緩慢消耗 → ticket
6h / 3d	Very slow 偵測	長期趨勢退化 → ticket 或 review

fast burn alert 回答「error budget 是否正在被快速吃掉」。當 5 分鐘窗口的 burn rate 超過 14.4 倍（代表如果持續下去，1 小時會消耗完整個月的 budget），觸發 page。這個門檻的設計邏輯是：越短的窗口允許越高的 burn rate 容忍，因為短窗口的 false positive 率較高，需要搭配較長窗口的確認。

slow burn alert 回答「error budget 是否在不被注意的情況下被緩慢消耗」。6 小時窗口的 burn rate 超過 1 倍（代表月底會剛好用完 budget），觸發 ticket。slow burn 常被忽略，但它是高變更頻率服務最常見的可靠性退化模式 — 每次小回歸都不夠大到觸發 fast burn，累積到月底才發現 budget 已透支。

burn rate alert 跟 6.6 SLO error budget 政策直接對應：fast burn → 凍結變更；slow burn → 提高驗證門檻；budget 健康 → 正常發版。

Sloth 產出的 recording rule 範例（5m window）：

1- record: slo:sli_error:ratio_rate5m
2  expr: |
3    sum(rate(http_requests_total{service="checkout",code=~"5.."}[5m]))
4    /
5    sum(rate(http_requests_total{service="checkout"}[5m]))
6  labels:
7    sloth_service: checkout-api
8    sloth_slo: availability

對應的 alerting rule（fast burn）：

1- alert: CheckoutAvailabilityFastBurn
2  expr: |
3    slo:sli_error:ratio_rate5m{sloth_slo="availability"} > (14.4 * 0.001)
4    and
5    slo:sli_error:ratio_rate1h{sloth_slo="availability"} > (14.4 * 0.001)
6  labels:
7    severity: critical

fast burn alert 要求 5m 和 1h 兩個窗口同時超過門檻，短窗口防止 spike false positive、長窗口確認趨勢持續。

實作流程

CLI 生成

1sloth generate -i slo.yaml -o rules.yaml
2sloth validate -i slo.yaml

generate 產出的 rules.yaml 包含所有 recording rules 和 alerting rules，直接放入 Prometheus 的 rule_files 載入。validate 在 CI 中先行檢查 YAML 格式與 SLI expression 語法。

K8s Operator mode

Sloth 提供 K8s Operator，用 PrometheusServiceLevel CRD 定義 SLO。Operator 自動 reconcile，把 CRD 轉成 Prometheus rules 並同步到 Prometheus Operator 的 PrometheusRule 資源。

Operator mode 適合 K8s-native 環境：SLO 定義跟 service deployment 放在同一個 GitOps repo，變更 SLO 跟變更服務走同一套 PR review + CI 流程。

CI / GitOps 整合

在 CI pipeline 中跑 sloth validate 驗證 YAML，再跑 sloth generate 產出 rules，commit 進 GitOps repo。Prometheus 透過 config reload 或 Operator reconcile 載入新 rules。這條流程讓 SLO 變更有版本歷史、有 review、有 rollback 能力。

邊界與陷阱

Sloth 只支援 Prometheus 作為後端。若觀測平台是 Datadog、New Relic、Honeycomb 或 Grafana Cloud，需要各平台自己的 SLO 功能或 Nobl9 的 multi-source 整合。

SLI expression 錯誤是最常見的問題。分母為零（service 沒有流量）會產生 NaN，cascading 到所有 recording rule。label 不匹配（service label 拼錯）會產生空 series，alert 永遠不觸發。sloth validate 檢查語法但不檢查 Prometheus 中是否真的有對應 series — 上線後需要用 Prometheus query 確認 recording rule 產出非空結果。

SLO 數量增長會累積 recording rule 成本。每個 SLO 產生約 30 條 recording rule（4 windows × 多組 aggregation）。100 個 SLO 產生 3000 條 rule，Prometheus 的 rule evaluation 會消耗明顯的 CPU 和記憶體。定期監控 prometheus_rule_evaluation_duration_seconds 和 prometheus_rule_group_rules，在 rule 數量影響 evaluation latency 前調整。

升級路徑：Sloth YAML 跟 OpenSLO spec 部分相容。從 Sloth 移到 Nobl9 時，SLO 定義的語意可以保留，SLI expression 需要改寫成 Nobl9 的 data source query。這條路徑適合從 Prometheus-only 環境逐步擴展到 multi-source SLO governance。

整合路由

上游：6.6 SLO 與 Error Budget 政策 — SLO 定義與 objective 來源
下游：6.8 Release Gate — burn rate alert 觸發凍結
平行：Nobl9（SaaS multi-source）、Pyrra（K8s-native + UI）
案例回寫：Google G1（error budget policy 原典）、Honeycomb HC1（burn rate 驅動可靠性操作）

Cloudflare WAF

Mon, 18 May 2026 00:00:00 +0000

Cloudflare WAF 是 edge-deployed 的 Web Application Firewall、跑在 Cloudflare 全球 anycast 網路上、攔截 HTTP/HTTPS 攻擊在抵達 origin 之前。它跟 AWS WAF / Fastly Next-Gen WAF 的核心差異是 跟其他 Cloudflare 產品深度整合：DDoS protection、Bot Management、Rate Limiting、Page Shield（JS supply chain）、API Shield（schema validation）、Zero Trust、Workers 邊緣計算共用同一個控制面。客戶選 Cloudflare WAF 通常不只是要 WAF、是要 整套 edge security suite。

服務定位

Cloudflare WAF 的核心定位是 把攻擊擋在 origin 之前的一站式 edge security。流量打到 Cloudflare anycast IP、經過 WAF / DDoS / Bot / Rate Limit / Page Shield 多層處理、再 proxy 到 origin。這跟 AWS WAF 跑在 AWS 內部 ALB / CloudFront / API Gateway 前是不同部署模型 — AWS WAF 流量 已經進到 AWS、Cloudflare WAF 流量 還沒到 origin。對 origin 是 任意雲 / on-prem 的客戶、Cloudflare 是天然選項；對 AWS-only 客戶、AWS WAF 整合更深但 edge 範圍小。

跟 Fastly Next-Gen WAF（前 Signal Sciences）相比、Cloudflare 走 signature + managed rule + ML 混合、Fastly NG-WAF 走 語意分析 + behavioral detection（不靠 regex signature）。Cloudflare managed rule 覆蓋廣但 false positive 較常見、需要 sensitivity tuning；Fastly NG-WAF 預設較低 FP 但需要 自己定義業務 anomaly。

關鍵張力：客戶信任的不只是 WAF rule 攔截能力、還包括 Cloudflare control plane 的安全性。Cloudflare 2023 control plane token 跟 Cloudflare 2026 route leak 兩個事件展示：vendor 自己被打進去 / 自動化配置失誤時、客戶側 直接修不了、只能等公告 + 客戶側 token rotation + emergency bypass。

本章目標

讀完本頁、讀者能判斷：

Cloudflare WAF 在 edge security stack 中承擔哪一段（DDoS / WAF / Bot / Page Shield / API Shield）、哪些要靠 origin 自己做
Managed Rule vs Custom Rule 的取捨、sensitivity tuning 跟 false positive curve
Cloudflare control plane 出事時的客戶側補強路徑（API token rotation、Origin Rules bypass、第二邊界 fallback）
何時用 Cloudflare、何時走 AWS WAF / Fastly NG-WAF 的取捨

最短判讀路徑

判斷 Cloudflare WAF 配置是否健康、最少看四件事：

誰能改 WAF 規則：Cloudflare account 的 admin / member role 配置、API token scope（不要用 Global API Key、用 scoped API token + 限定 zone / 限定 permission）、Audit Log 是否同步到 SIEM
規則覆蓋面：Managed Ruleset（OWASP Core Ruleset + Cloudflare Managed Ruleset + Exposed Credentials Check）是否開、Sensitivity（Low / Medium / High）對應的 FP rate 是否監控、Custom Rule 是否進版控（Terraform provider）
入口暴露：origin IP 是否曝光（DNS 直查 / 歷史 SAN cert / 子域名）、Argo Tunnel / Authenticated Origin Pull 是否強制、繞過 Cloudflare 直連 origin 的路徑是否封住
證據可回查：Security Events Log 是否同步到 SIEM（Logpush 推到 R2 / S3 / Splunk）、Page Shield 偵測異常 script 是否 alert、API token 異常操作（特別 zone settings 變更）是否 alert

四件事任一缺失、就是 Audit Log 與 Entry Point Protection 邊界的待補項目。

日常操作與決策形狀

Managed Ruleset 分層：Cloudflare 提供三類 managed rule — OWASP Core Ruleset（OWASP CRS、寬覆蓋、FP 較多）、Cloudflare Managed Ruleset（Cloudflare 維護、針對熱門 CMS / framework）、Exposed Credentials Check（檢測登入流量中的已洩漏 credential）。production 通常開全部三套 + 各設適當 sensitivity。Sensitivity 不是「敏感度越高越好」— High sensitivity 攔截更多 borderline traffic、business-critical endpoint 可能誤殺合法請求。建議從 Log Mode 開始、觀察 1-2 週的 FP pattern、再切到 Block。

Custom Rule（Cloudflare Rules）：用 Rules language（類 SQL 表達式）定義條件 + 動作（Block / Challenge / Log / JS Challenge / Managed Challenge）。常見用法：geo block（特定國家）、known bad IP（threat intel feed）、URI path-based limit（admin endpoint 限定 IP）、header anomaly（缺 User-Agent / 異常 Referer）。所有 Custom Rule 走 Terraform provider 進版控、避免 console 直接改、變更走 PR review。

Rate Limiting：跟 WAF rule 是 獨立 product、配置是 threshold + window + action（例：1000 req/min per IP → challenge）。Rate Limiting 比 WAF 適合處理 legitimate-looking high volume（credential stuffing、scraping、API abuse）。注意 NAT pool IP 的問題 — 一個公司 / ISP NAT 出口可能合法產生高 QPS、簡單 per-IP rate limit 會誤殺、需要組合 cf.threat_score 或 cookie-based identification。

Bot Management（單獨 SKU）：免費版 WAF 不含 Bot Management、需要 Pro / Business / Enterprise 才有。Bot Management 用 ML + behavioral fingerprint 區分 human / good bot（搜尋引擎）/ likely bot / verified bot、給 bot score（1-99）。客戶在 Custom Rule 用 cf.bot_management.score < 30 之類條件挑出 likely bot 處理。簡單 user-agent 過濾擋不住現代 headless browser、必須走 Bot Management。

Page Shield（JS supply chain 防護）：Page Shield 監測客戶網頁載入的 JS / connect 來源、發現 新出現的腳本 或 已洩漏的 script（CT log + threat intel）就 alert。意義是 防 third-party script 被供應鏈攻擊（類 Magecart）— WAF 攔不住、因為攻擊發生在 browser 端 而非 origin 流量。需要在 Page 載入 Page Shield 的 monitoring script。

API Shield：用 OpenAPI schema validation、auto-discovery API endpoint、mTLS 驗證、JWT validation。對於有 schema 的 API、可以擋掉 schema 不符的請求（多餘欄位、型別錯誤、缺必要欄位）— 比 generic WAF rule 精準。

Origin 暴露面收緊：Cloudflare 唯一有效的前提是 流量必須經過 Cloudflare。如果攻擊者拿到 origin 真實 IP（DNS 歷史記錄、漏洞披露網站、SSL cert SAN）、可以繞過 Cloudflare 直打 origin。控制方法：origin firewall 只允許 Cloudflare IP range 入站、Argo Tunnel（origin 主動建 outbound 連線到 Cloudflare、不開任何入站 port）、Authenticated Origin Pull（origin 用 cert 驗證請求來自 Cloudflare）三選一或組合。

API token 治理：避免 Global API Key（全帳號 root token）、改用 scoped API token（限 zone + 限 permission + 限 IP + 限 TTL）。token 進 Secret Management / Vault、定期 rotate。對應 Cloudflare control plane token 2023 揭示的 lesson：Cloudflare 自己也踩過 token 治理不足、客戶側不能假設 vendor 完美。

核心取捨表

取捨維度	Cloudflare WAF	AWS WAF	Fastly Next-Gen WAF
部署位置	Cloudflare global edge（300+ POP）	AWS region 內 ALB / CloudFront / API Gateway 前	Fastly edge + Agent + Module（自管 Nginx / Apache / Envoy / IIS）+ Cloud WAF proxy、三模型可混
Origin 中立性	強 — origin 可以是任何雲 / on-prem	弱 — 跟 AWS 緊耦合（限 AWS service 前）	強 — Fastly CDN / 任何 origin
偵測模型	Signature + Managed Rule + ML	Signature + Managed Rule + Lambda 自訂	Signal / behavioral（語意分析、低 FP）
DDoS 內建	是 — 跟 WAF 同套餐	AWS Shield Standard 內建、Advanced 加價	內建 + Fastly DDoS
Bot Management	加價 add-on（Pro / Business / Enterprise）	AWS WAF Bot Control	加價 add-on
JS supply chain	Page Shield（Business+）	無原生、靠後端 CSP / 第三方	Inline JS monitoring（Next-Gen WAF 部分）
API schema	API Shield（Enterprise）	AWS WAF + API Gateway request validator	NG-WAF inline + sigsci-agent
學習曲線	中 — UI / Rules language 易上手、Terraform 完整	較陡 — JSON policy + 跟 AWS service 整合多軌	中 — agent 安裝 + Signal 語意設定
第三方信任成本	高 — Cloudflare 控制面（2023、2026 自家事件）	中 — AWS 控制面、跟 IAM 同套	中 — Fastly 控制面（規模小、事件少但社群影響也小）
適合場景	Multi-cloud / on-prem origin、要整套 edge security	AWS-heavy、ALB / CloudFront 是主要入口	高 FP 容忍度低、業務有 schema、想避 regex signature

選 Cloudflare WAF 的核心訴求：多雲 / on-prem origin + 需要 整套 edge security suite（DDoS + WAF + Bot + Page Shield + API Shield） + 接受 Cloudflare 控制面風險、且有預算做 Enterprise tier 才能拿到完整功能。純 AWS-internal app + ALB origin 用 AWS WAF 整合更直接。

進階主題

Workers + Workers AI 作為 custom logic：當 managed rule + custom rule 表達力不夠（例：根據 user account tier 決定 challenge 強度、整合內部 risk score API）、可以用 Cloudflare Workers 寫 JavaScript / TypeScript / Rust 在 edge 執行。Workers AI 提供 edge ML inference、可以做 inline content moderation 或 anomaly detection。代價是 Workers code 進 Cloudflare 控制面、變更要走部署流程、debug 跟 origin 是兩條 trace。

Logpush 跟 SIEM 整合：Cloudflare Security Events 量大、free / Pro 在 dashboard 看、Business / Enterprise 走 Logpush 到 R2 / S3 / Splunk / Datadog / Sumo Logic。production 必須走 Logpush、不能只在 dashboard — 事件 30 天保留期是 Cloudflare 端、SIEM 留更久。Logpush 也是 SIEM 上做 跨來源 correlation 的前提（WAF event + origin app log + IdP log）。

Multi-account / Tenant：大企業有多個 Cloudflare account（不同 BU / 不同產品線）、要走 Cloudflare for SaaS 或 Account-level access、API token scope 要限定 account。Single account 多 zone 是常見小組織配置、但跨組織 / 跨產品線必須拆 account 隔離 admin compromise blast radius。

Magic Transit / Zero Trust integration：Magic Transit 是 L3 DDoS（不只 HTTP、TCP / UDP 也 anycast）、Zero Trust 是 employee access（取代 VPN）。跟 WAF 是不同產品、但常一起部署 — Magic Transit 防 L3/L4 attack、WAF 防 L7、Zero Trust 防內部 east-west。

排錯與失敗快速判讀

Managed Rule 誤殺合法請求：High sensitivity 開後 business endpoint 變慢 / 報錯 — 看 Security Events 找 rule_id、用 Custom Rule skip 該 rule 在特定 path / 特定 user-agent、不要全 zone 關 rule
Bot Management 太嚴 / 太鬆：bot score threshold 設不對、合法 API client 被當 bot、或攻擊者拿到 verified bot 假冒 — 用 Bot Analytics 看分數分布、調整 threshold 同時加白名單（API key + IP CIDR）
Rate Limit 誤殺 NAT 用戶：per-IP rate limit 在 NAT 出口 IP 上炸 — 改 per-session（cookie-based）或 cf.threat_score 條件
Origin IP 外洩：DNS 歷史 + 漏洞披露 + cert SAN 揭露真實 origin、攻擊繞 Cloudflare 直打 — 換 IP + 開 origin firewall（只允許 Cloudflare CIDR）+ Argo Tunnel
API token over-scoped：CI / 第三方 SaaS 拿到 Global API Key、整 account 都被改 — 改 scoped token、限 zone + permission + IP、進 Vault
Security Events 沒進 SIEM：事件只在 dashboard、跨來源 correlation 沒法做 — 配 Logpush + alert 規則
Page Shield 沒裝：客戶端 JS 被植入、伺服器端日誌看不到攻擊、第三方 script CDN 被打 — 啟用 Page Shield + CSP report-uri 雙軌
第二邊界沒設：完全依賴 Cloudflare、Cloudflare 出事流量全停（2023 / 2026 自家事件）— 高 SLA 服務應該設 fallback origin / secondary DNS（如 Route53 health check failover 到 Fastly 或直連 origin）

何時改走其他服務

需求形狀	改走
AWS-only + ALB / CloudFront origin	AWS WAF
低 FP 容忍 / 業務有 schema	Fastly Next-Gen WAF
純內部 mTLS / east-west	SPIRE + service mesh
Cert lifecycle	cert-manager / Let’s Encrypt
客戶端 JS supply chain	Page Shield + supply chain integrity
DDoS L3/L4	Cloudflare Magic Transit / AWS Shield Advanced

不在本頁內的主題

Cloudflare 完整 product line（Workers / Pages / R2 / D1 / Magic Transit / Zero Trust 各自細節）
WAF Rules language 完整語法 reference
Page Shield / API Shield Enterprise tier 完整功能對照
各 PCI DSS / SOC 2 / FedRAMP 合規矩陣
Cloudflare 在中國的部署模式（JD Cloud Union 合作）

案例回寫

Cloudflare WAF 在 07 案例庫有 兩個直接 vendor-level 事件 + 多個 edge-exposure 對照：

案例	跟 Cloudflare WAF 的關係
Cloudflare Control Plane Token 2023	直接 — Cloudflare 自家 API token 治理不足、客戶側必須假設 vendor 也會被打、API token rotation 跟 IP allowlist 必做
Cloudflare Route Leak 2026	直接 — 自動化路由配置錯誤導致流量擁塞、客戶側應有 secondary DNS / failover origin 預案
Citrix Bleed 2023 Session Hijack	對照啟示 — WAF 攔不住 edge appliance zero-day、需要「修補 + session 失效 + 異常清查」三同步
Fortinet SSL-VPN CVE 2023-27997	對照啟示 — vendor patch 前的臨時 WAF rule + 收斂可達來源是修補窗口期的標準動作
Log4Shell CVE-2021-44228	對照啟示 — WAF rule 是 emergency mitigation、但 exploitation 過 WAF 後在後端執行、不能單靠 WAF 防後端 supply chain
Okta-Cloudflare 2023 Support Supply Chain	對照啟示 — 上游 IdP 出事傳導到 Cloudflare admin 帳號、API token / admin session 要立即 rotate、不等供應商公告

下一步路由

上游：7.3 入口治理與伺服器防護
平行：AWS WAF、Fastly Next-Gen WAF
下游：7.4 資料保護與遮罩治理（WAF block 不夠時、資料層也要遮罩）
跨類：HashiCorp Vault（Cloudflare API token 存放）、Okta（Cloudflare admin 走 SSO）
跨模組：8 事故處理 vendor 清單（WAF block 事件 / Cloudflare 自家事件如何 routing 進 IR）
官方：Cloudflare WAF Documentation

HashiCorp Vault

Mon, 18 May 2026 00:00:00 +0000

HashiCorp Vault 是 self-hosted 的 secret management 控制面、解決三個核心問題：static secret 集中保管（KV engine、跟 Secret Management 卡同概念）、dynamic credential 即用即發即收（database / cloud / SSH engine 在請求時動態建立短期憑證）、encryption-as-a-service 與內部 PKI（transit engine 把加解密外包給 Vault、PKI engine 自簽憑證）。三件事在 cloud-native 替代品（AWS Secrets Manager / Google Secret Manager / Azure Key Vault）裡通常拆成不同 service、且綁單一雲。

服務定位

Vault 的核心定位是 跨雲 + 跨環境 + 跨 secret 形態的單一 secret 控制面。當組織同時跑 AWS + GCP + on-prem K8s、又需要 dynamic database credential + 內部 PKI + envelope encryption、用三個 cloud-native service 拼起來會出現 secret 治理鏈不連續（AWS 的 secret 怎麼授權 GCP service 取用、on-prem app 怎麼拿短期 cloud credential、內部 CA 跟外部 ACM 怎麼分工）。Vault 把這層 統一抽象 — 應用端只跟 Vault 講話、Vault 後端接各雲 KMS / database / PKI。

跟 AWS Secrets Manager / Google Secret Manager 相比、Vault 多了：dynamic credential engine（cloud-native 對應產品有限）、transit engine 做 encryption-as-a-service、PKI engine 自簽內部憑證、跨雲統一介面。代價是 自管運維（HA cluster、auto-unseal、replication、upgrade）— 跟自管 Keycloak 的取捨同類。HCP Vault（HashiCorp Cloud Platform）是 HashiCorp 託管版、把運維交還、但綁 HashiCorp。

本章目標

讀完本頁、讀者能判斷：

哪些 secret 適合 Vault（dynamic credential、跨雲、PKI、encryption-as-a-service）、哪些直接用雲端 native service 即可
Vault deployment 的最低安全需求（auto-unseal、HA、audit device、policy、replication）
Vault 自己出事時的降級路徑（seal storm、root token 復原、audit log gap）
何時用 Vault、何時走 Secrets Manager / Google Secret Manager / Azure Key Vault 的取捨

最短判讀路徑

判斷 Vault deployment 是否健康、最少看五件事：

誰能做什麼：root token 是否已 revoke、policy 是否走 path-based least privilege、admin 是否走 OIDC / AWS IAM auth 而不是 token、break-glass token 是否離線存
Auth method 收緊：AppRole / Kubernetes / OIDC / JWT auth 哪些開、role 對應的 policy 是不是過寬、TTL 是否短、bound_* 條件是否鎖（namespace / audience / subject）
Secret engine 設定：KV v2 開 versioning？dynamic engine（database / aws / pki）lease TTL 多久、max TTL 限制是什麼、revocation 是否驗證生效
Seal / unseal 治理：是否走 auto-unseal（KMS-backed）、recovery key 持有者跟 Shamir threshold、replication 跟 DR cluster 是否同步
證據是否可回查：audit device（file / syslog / socket）是否多 channel、是否同步到 SIEM、replay 攻擊防護是否開（HMAC + nonce）

五件事任一缺失、就是 Audit Log 與 Secret Management 邊界的待補項目。

日常操作與決策形狀

Auth method 設計：AppRole 適合不在雲端 metadata 內的 workload（on-prem、CI runner）但 secret_id 本身要妥善保管；Kubernetes auth 適合 K8s 內 workload、用 ServiceAccount token + projected token；AWS IAM auth 適合 AWS 內 workload、走 STS 簽名驗證、不需要存 secret；OIDC / JWT 適合 human admin + CI（GitHub Actions / GitLab CI 走 OIDC token）。每個 auth method 對應 一組 role、role 綁 policy 跟 TTL。

Secret engine 分層：KV v2（static secret + version history）作為基線；dynamic database engine（PostgreSQL / MySQL / MongoDB）發短期 DB user、max_ttl = 1h 級別、過期 Vault 自動 revoke；AWS / Azure / GCP secret engine 對 cloud account 發短期 STS credential / service account key；PKI engine 自簽憑證、跟 cert-manager 整合做 K8s workload mTLS；transit engine 做 envelope encryption — app 把資料丟給 Vault 加密、key 不離 Vault。

Policy（path-based）：Vault policy 是 path + capabilities（create / read / update / delete / list / sudo）的 mapping。常見錯配：給 secret/* read 等於整個組織所有 secret 都看得到、應該用 secret/data/{team}/* 之類前綴限定；admin policy 不要給 sudo 太寬、policy 變更走 PR review + CI apply。

Rotation 跟 lease 治理：static secret（KV）的 rotation 是 app 自己做（拿新 secret 後手動 update）；dynamic secret 是 Vault 控制 lease 生命週期、app 只要在 TTL 內續租即可。對應 Failure: Credential Rotation Without Scope：static secret 的 rotation 必須有 scope map — 哪些 service 用了同一把 secret、哪個 service 支援零停機 rotation、誰是 last to be rotated。沒這份 map 就會發生「rotate 後某個被遺忘的 cron job 認證失敗、整個下游崩」。

Seal / unseal 設計：Vault 啟動時 sealed、必須 unseal 才能服務。Shamir secret sharing 是預設（5 key holders、3 threshold）— 任何重啟需要找齊 3 個人合 unseal、production 場景幾乎都該換 auto-unseal（用 AWS KMS / GCP KMS / Azure Key Vault 當 master key custodian）。代價是 把 master key 託給雲廠 — 不接受的組織保留 Shamir + 嚴格 key holder rotation。

Audit device 是必開：Vault 預設不開 audit、要手動 enable（vault audit enable file path=/var/log/vault_audit.log）。沒 audit device 在 production = 事故時 連 token 被誰用過都查不到。建議多 channel（file + syslog + 推到外部 SIEM）— 單一 channel 失效（disk full、socket broken）Vault 會拒絕請求、影響 availability、所以多 channel 是必要冗餘。

Break-glass 與 root token：初始化時產生的 root token 應該 用完立刻 revoke、改用 admin policy + OIDC auth。break-glass scenario 用 recovery key 重新發 root token、recovery key 走 Shamir 多人持有 + 離線存。

核心取捨表

取捨維度	Vault (self-hosted)	HCP Vault	AWS Secrets Manager	Google Secret Manager	Azure Key Vault
部署模型	自管 cluster（HA + replication）	HashiCorp 託管	AWS managed	GCP managed	Azure managed
跨雲	強 — 同一介面跨 AWS / GCP / Azure / on-prem	強	弱 — 綁 AWS	弱 — 綁 GCP	弱 — 綁 Azure
Dynamic credential	DB / cloud / SSH engine 完整	同 OSS	無 — 僅 RDS / Redshift static rotation Lambda	無 — 自寫 Cloud Function；secret-less 走 WIF	無 — 純 static；secret-less 走 Managed Identity
PKI / transit	內建 PKI engine + transit engine	同 OSS	走 AWS ACM + KMS	走 cloud KMS + Certificate Authority Service	走 Azure Key Vault cert 功能
運維成本	高 — HA、upgrade、replication、cert 自己顧	低 — HashiCorp 顧	低	低	低
第三方信任成本	低 — 自管	中 — HashiCorp 控制面	中 — AWS 控制面	中 — GCP 控制面	中 — Microsoft 控制面
適合場景	跨雲、需要 dynamic credential、內部 PKI、預算允許	想要 Vault 能力但不想自管	AWS-heavy + 簡單 static secret	GCP-heavy + Workload Identity 已主導	Azure-heavy + Managed Identity 已主導
退場成本	中 — 自己掌握資料、但 dynamic engine 接線多	中	低	低	低

選 Vault 的核心訴求：跨雲 + dynamic credential + 內部 PKI + transit encryption 至少滿足兩項、且能投入 SRE 量能跑 HA cluster、有 SIEM 接 audit log、能接受 self-hosted 的 upgrade / cert / DB 運維成本。單純需要 AWS-only static secret rotation、直接用 Secrets Manager 更便宜更簡單。

進階主題

Dynamic credential 的 lease 生命週期治理：dynamic engine 發出的 credential 都帶 lease ID、Vault 在 TTL 到期時自動 revoke（database engine 真的會 DROP USER、cloud engine 真的會 DeleteAccessKey）。設計時要算清楚 app 連線池的 connection lifetime — DB connection 持續用同一組 credential、credential lease 過期但 connection 還在會出現 staled credential 問題。常見作法：lease TTL > connection idle timeout * 2、加 lease renewal mechanism（app 在 TTL 50% 時主動 renew）。

Transit engine（encryption-as-a-service）：app 不持 encryption key、把 plaintext 丟給 Vault encrypt API、拿 ciphertext 回來；解密時把 ciphertext 給 Vault decrypt API。Key 完全不離 Vault、所有 cryptographic operation 在 Vault 內、app 只需要 encrypt / decrypt capability。對應 Storm-0558 signing key chain 的對照啟示：key 不能 export 是減 blast radius 的關鍵設計 — transit 把這個原則內建。

PKI engine + cert-manager 整合：Vault PKI engine 可以當內部 root CA + intermediate CA、issue 短期 cert（hours-level）給 K8s workload；cert-manager 用 Vault PKI issuer 自動更新 cert。比起手動跑 OpenSSL CA、Vault PKI 的優勢是 cert lifecycle 進 Vault audit、跟 secret rotation 用同一套 evidence chain（呼應 credential rotation scoped evidence）。

Namespace（Enterprise）跟 multi-tenancy：Enterprise 版 namespace 是 tenant 邏輯隔離、每個 namespace 有自己的 auth method、policy、secret engine。OSS 版沒 namespace — 多團隊共用 Vault 要靠 path 命名規約 + policy prefix 拼隔離、邊界較鬆。大組織通常需要 namespace 才能避免單一 admin 跨 team 越界。

Replication（Enterprise）：Performance Replication（主從 + 多 region active）跟 DR Replication（純 standby）是兩個獨立功能。production HA 通常需要 同 region 的 cluster + 跨 region 的 DR replication、recovery key 跟 unseal 機制要跨 cluster 一致。

排錯與失敗快速判讀

Audit device 沒開：production 啟動時忘了 enable audit、事故發生時無 forensic data — 啟動 checklist 必含「enable audit before serving traffic」、SRE runbook 用 health check 驗
Policy 過寬：給整個 secret/* read、單一 token 等於拿到全公司 secret — 用 path prefix 限定到 {team}/{env}/*、policy review 走 PR
Dynamic credential lease 太長 / 沒 max_ttl：DB user 跑了一週還沒收、攻擊者只要拿到一次就長期可用 — 設定 lease TTL = 1h、max_ttl = 24h
Auto-unseal KMS access 沒監控：AWS KMS / GCP KMS 的 Vault auto-unseal key 沒 alert 異常使用 — KMS 端設 alert（GetKeyValue / Decrypt 突增）
Replication lag 沒 alert：Performance / DR replication 落後幾分鐘到幾小時、failover 時拿到 stale state — Prometheus 監控 vault.replication.* metric
Root token 未 revoke：初始化時的 root token 還在用、policy / audit / OIDC 全 bypass — 初始化 checklist 強制 revoke、CI 跑 vault token lookup 驗證 root 不可用
Sealed 後 unseal key 找不到人：production cluster 緊急 restart、Shamir threshold 3 但有 1 個 key holder 在度假 — production 必須 auto-unseal、recovery key 走 break-glass 流程

何時改走其他服務

需求形狀	改走
AWS-only + 簡單 static secret	AWS Secrets Manager
GCP-only + 已用 Workload Identity	Google Secret Manager
Azure-only + 已用 Managed Identity	Azure Key Vault
大型 cryptographic / HSM 需求	CloudHSM（FIPS 140-2 Level 3、Vault auto-unseal 後端）
公開憑證 PKI（serving cert）	AWS ACM / Let’s Encrypt
K8s workload cert 自動化	cert-manager（可用 Vault 當 issuer）
跨服務 workload identity (SPIFFE)	SPIRE
Secret 全公司 rotation 證據鏈	7.5 Credential Rotation Scoped Evidence

不在本頁內的主題

Vault 完整 API reference 跟 CLI 詳盡用法
每個 secret engine 的內部實作細節（DB connection pool、cloud SDK 呼叫順序）
Enterprise 各 license tier 的功能對照
Terraform / Ansible 跟 Vault 整合的完整步驟
各 auth method 的 OIDC / SAML provider 設定教學

案例回寫

Vault 在 07 案例庫沒有直接 vendor-level 事件、以下案例採對照引用：

案例	跟 Vault 的關係（對照）
Failure: Credential Rotation Without Scope	static secret rotation 必須有 scope map — Vault KV 多 service 共用同一把 secret 時、rotation 要分批 + 雙軌驗證窗口、不能一次 push 全域更新
Microsoft Storm-0558 Signing Key Chain (red-team)	transit engine 的設計啟示 — key 不離保護邊界、即使被讀也搬不走、跟 HSM-bound 同 mindset
CircleCI 2023 Secrets Rotation (red-team)	CI 平台 secret 集中化的 blast radius — Vault AppRole secret_id 散落在 CI runner 時、CI 出事 = 大量 AppRole credential 一次外洩、需 scope tag + 優先級 rotation
Okta Support System 2023	對照啟示 — Vault 自己的 support / debug tooling（root token、recovery key）也是 secret leak vector、HAR 級別的事件可發生在任何 admin console

下一步路由

上游：7.6 秘密管理與機器憑證治理、7.13 偵測覆蓋率與訊號治理
平行：AWS Secrets Manager、Google Secret Manager、Azure Key Vault
下游：AWS KMS / Google Cloud KMS（Vault auto-unseal master key custodian）
下游：cert-manager（用 Vault PKI engine 作為 K8s workload cert issuer）
跨模組：8 事故處理 vendor 清單（Vault 事件如何 routing 進 IR 流程）
官方：Vault Documentation

Okta

Mon, 18 May 2026 00:00:00 +0000

Okta 是 SaaS Identity Provider 的事實標準。它承擔三個責任：human identity 的 SSO 與 MFA、application / cloud account 的 federation gateway、SCIM-based lifecycle 自動化（joiners / movers / leavers）。當公司把 SSO 集中到 Okta、員工的工作信任邊界就從「每個應用各自的密碼」變成「Okta tenant + 客服流程 + signing key」三件事是否安全。在 0.22 能力級買 vs 建的光譜上、把企業 SSO 交給 Okta 是認證 commodity「買」的代表選擇（feature SaaS 深度）；這個外包深度與遷出代價的權衡見外包深度卡。

服務定位

Okta 是 人類身份的控制面、不是 cloud resource permission engine。把 cloud IAM（AWS IAM、Google Cloud IAM、Azure RBAC）的角色指派交給 Okta 是常見組合 — Okta 負責「這個人是誰」、雲端 IAM 負責「這個身份能對 resource 做什麼」。Workforce Identity Cloud（員工）跟 Customer Identity Cloud（消費者、原 Auth0）是兩個產品線、安全模型跟事件分布都不同（本頁聚焦 Workforce、Auth0 見 Auth0 vendor）。

跟自管 IdP（Keycloak）相比、Okta 把 issuer 信任、signing key 生命週期、support tooling 都託管出去 — 代價是 第三方控制面的事故會直接打到自己（Okta 2022 Sitel 環境洩漏、2023 support system HAR token 外洩、2023 cross-tenant impersonation）。跟 cloud-native SSO（AWS IAM Identity Center）相比、Okta 的核心優勢是 多雲 + SaaS app 數百個 integration 預先建好、不是綁單一雲廠。

本章目標

讀完本頁、讀者能判斷：

Okta 該承擔哪一段 identity 控制（SSO / MFA / lifecycle / federation）、哪一段該交給雲端 IAM
Okta tenant 的信任邊界與最低稽核需求（admin role、API token、SCIM、support workflow）
Okta 自己出事時的降級路徑（emergency access、break-glass、out-of-band MFA）
何時用 Okta、何時走 Auth0 / Keycloak / AWS IAM Identity Center 的取捨

最短判讀路徑

判斷 Okta 配置是否健康、最少看四件事：

誰能做什麼：Super Admin / Org Admin / Read-Only Admin 的人數、是否走 Okta 自己的 access request workflow、是否強制 phishing-resistant 認證
憑證在哪裡：API token 的 owner、scope、TTL、是否走 OAuth service app 而不是 personal API token；service account 是否獨立 audit
入口如何暴露：SSO 是 SAML 還是 OIDC、IdP-initiated 是否關閉、admin console 是否限 IP / device trust、helpdesk reset 是否要 callback / out-of-band 驗證
證據是否可回查：System Log 是否同步到 SIEM、admin / token / impersonation 事件是否 alert、是否保留 90 天以上

四件事任一缺失、就是 Audit Log 與 Authorization 邊界的待補項目。

日常操作與決策形狀

Onboarding / lifecycle：HR 系統推 SCIM 進 Okta、Okta 推 SCIM 到下游 SaaS / 雲端 SSO。決策點是 誰是 source of truth — HRIS 還是 Okta 自己。混用會造成 stale account 與例外帳號無法收。

Policy（authentication）：Sign-On Policy 跟 Authentication Policy（New Policy Framework）兩套並行、要避免規則交疊。高風險操作（admin login、寫權限應用）應該強制 phishing-resistant factor（WebAuthn / passkey）、不只是 push MFA（Uber 2022 揭露：純 push MFA 抗不過 fatigue）。

MFA factor 選擇：避免 SMS / voice 作為主要 factor。Okta 2024 把 telephony 推給客戶 BYO（Okta BYO Telephony case）— 信任邊界從「Okta 全管」變成「客戶自己挑簡訊供應商」、若沒同步調整威脅模型會把 SMS swap 風險吃下來。

API token / OAuth service app：personal API token 容易隨人員離職 stale、應該走 OAuth service app（client credentials）並把 scope 收到最小。token 不存 source code、走 Secret Management 取用。

Exception / break-glass：至少 2 個 break-glass admin、credential 離線存（紙本保險箱 / secret management 隔離 tenant）、走獨立 MFA（hardware key、不依賴主要 Okta tenant 的 push）、季度驗證可用。Okta tenant 整個失聯時這是唯一退路。

Audit / handoff：System Log 推進 SIEM、特別 alert 三類事件 — admin role 變更、API token 建立、impersonation / support access。Okta 2023 support system 事件展示：如果客戶沒 alert support impersonation 的 session、就只能等 Okta 公告。

核心取捨表

取捨維度	Okta	自管 Keycloak	AWS IAM Identity Center
控制面責任	Okta 託管 issuer / signing / support	自己跑 issuer、key rotation、HA、support	AWS 託管、限 AWS 帳號 + 已整合 SAML app
Integration	7000+ SaaS app 預建	OIDC / SAML 通用、specific app 要自己接	AWS 帳號 + 中等規模 SaaS
第三方信任成本	高 — Okta 出事客戶被動受害（2022 / 2023 多起）	低 — 自管、自己承擔運維	中 — 綁 AWS 信任邊界
運維成本	低 — SaaS	高 — HA、DR、cert、DB、upgrade 都要顧	低 — AWS managed
適合場景	多雲、大量 SaaS、需要 lifecycle 自動化	預算 / 主權 / 自管要求、不接受 SaaS IdP	AWS-heavy、員工數中等、SaaS 少
退場成本	高 — SAML / SCIM 接線分散在數百 app	中 — 自己掌握資料	中 — AWS 內部換

選 Okta 的核心訴求：跨雲 + 大量 SaaS app + lifecycle 要自動化、且能接受第三方控制面風險、有預算做完整 SIEM / break-glass / 第三方應變流程。

進階主題

Federation 跟 workload identity：Okta 對人類 SSO 強、對 workload identity 較弱。CI / 服務間用 AWS IAM role 的 OIDC trust、Google workload identity federation 比把 Okta API token 散到服務裡更安全。

Cross-tenant 邊界：B2B 合作（partner、contractor）要清楚是「partner 用自己 IdP 做 federation 進來」還是「partner 在我的 Okta tenant 開帳號」。2023 cross-tenant impersonation 事件（Okta Cross-Tenant case）揭示：admin 工具若沒限定 tenant scope、單一 admin compromise 會跨多 tenant 擴散。

Device trust / posture：Okta Device Trust + EDR signal 是補 phishing-resistant MFA 之後的下一層 — 確認 使用者 對之外、確認裝置健康。BYOD 比例高的組織這層做不起來就靠人類因子守。

Identity Threat Protection / ITP：Okta 2024 推的事件偵測 add-on、補 session anomaly、credential stuffing、impossible travel 等場景。本質是把 SIEM detection 的一部分內建、不是取代外部 SIEM。

排錯與失敗快速判讀

Admin account 過多：經常超過必要 — 用 Group Rules + Access Request workflow 收斂、把日常操作用 Read-Only Admin + 特定權限 group 替代
API token stale / 散落：personal API token 跟著員工離職留下 — 季度盤點、改 OAuth service app
SMS MFA 還是預設：MFA enrollment 沒強制 WebAuthn / passkey、新員工選最弱 factor — Authentication Policy 應該限制可選 factor
System Log 沒進 SIEM：事件只在 Okta UI、alert 沒接 on-call — 用 Log Streaming（CloudWatch / S3 / Splunk HEC）打進 SIEM、特定事件接 alert runbook
Helpdesk reset 無 callback：MGM 2023 / Caesars 2023 都是 helpdesk social engineering、需要 callback + out-of-band 驗證、不是 ticket 上看到「我忘記密碼」就 reset
Support 工具 session 沒監控：Okta 2023 support 事件揭示需要 alert support impersonation session 進入我的 tenant 的事件 — System Log 有對應事件、但通常沒 default alert

何時改走其他服務

需求形狀	改走
Customer / B2C identity	Auth0 vendor
自管 / 不接受 SaaS IdP	Keycloak vendor
AWS-only 員工 SSO	AWS IAM Identity Center
Microsoft 365 / Azure 重度組織	Entra ID（Azure RBAC vendor 頁） — Entra ID 是 Microsoft 自家 workforce IdP、跟 Okta 直接競爭、M365 + Azure 為主的組織通常直接用 Entra ID 而非疊一層 Okta
Cloud resource permission（非人類身份）	AWS IAM / Google IAM / Azure RBAC
事件偵測（不只 Okta 內部）	04 SIEM / detection 工具（04 observability 跟 07 SIEM 章節）
Secret / API key 治理	7.6 秘密管理與機器憑證治理

不在本頁內的主題

Okta 完整 SAML / OIDC 規格細節、SCIM schema 客製
Workforce vs Customer Identity Cloud 完整功能對照
Okta 各定價層級的功能差異
各 SaaS app 的 SSO 接線教學

案例回寫

案例	跟 Okta 的關係
Okta Support System Incident 2023	支援工具鏈納入身份治理、HAR session 透過個人 Chrome profile 同步外洩、客戶側必須 alert impersonation session
Okta Cross-Tenant Impersonation 2023	admin tool 缺 tenant scope、單一 admin compromise 跨 tenant 擴散
Okta BYO Telephony Shift	telephony 供應商責任轉移、客戶要重新評估 SMS 路徑威脅模型
Cloudflare 2023 Okta Token Follow-Through	上游 IdP 事件後客戶側的 token / session rotation 節奏、不該等供應商公告
Uber 2022 MFA Fatigue	純 push MFA 抗不過 fatigue、高風險操作要求 phishing-resistant factor
MGM 2023 Identity Lateral Impact	helpdesk social engineering 是 Okta-customer 通用入口、callback / out-of-band 驗證是控制面
Twilio 2022 Social Engineering	員工身份即客戶風險面、IdP 對員工帳號異常的隔離速度決定下游受損規模
Failure: Credential Rotation Without Scope	Okta API token / OAuth service app credential 的 rotation 必須分域、不能把多 service app 共用同一批 rotation 命令打

下一步路由

上游：7.2 身分與授權邊界、7.13 偵測覆蓋率與訊號治理
平行：Auth0 vendor、Keycloak vendor、AWS IAM Identity Center
下游：AWS IAM / Google Cloud IAM / Azure RBAC（Okta 之後的 cloud resource permission 層）
跨模組：8 事故處理 vendor 清單（Okta 事件如何 routing 進 IR 流程）
官方：Okta Documentation

Splunk

Mon, 18 May 2026 00:00:00 +0000

Splunk 是 SIEM（Security Information and Event Management）的事實標準、大企業 / 金融 / 政府的 SOC 主流選擇。2024 年被 Cisco 收購、產品線維持獨立發展。它跟 Elastic Security / Datadog Security / Google Security Operations 的差異在 計費模型 + ecosystem maturity + detection content 深度、偵測能力本身相近 — Splunk 的 ingestion-based pricing 是業界最貴的 SIEM 計費模式、但 detection content 跟 SOC tooling ecosystem 也是最成熟的。

服務定位

Splunk 的核心定位是 任意 log source 的統一查詢平台、SIEM 是其上的 application layer（Splunk Enterprise Security app）。底層是 Splunk Enterprise（自管）或 Splunk Cloud Platform（SaaS）、頂層產品包含：Enterprise Security (ES) — premium SIEM app、含 correlation rule、Risk-Based Alerting、ITSI 整合；SOAR（前 Phantom）— security orchestration / automated response；UBA（User Behavior Analytics）— ML-based anomaly detection。

跟 Elastic Security 比、Splunk 走 deeper but more expensive — SPL 比 KQL / EQL 表達力更強、detection content（Splunk Security Content 公開 YAML rules）覆蓋廣、ES app 的 Risk-Based Alerting 是業界先驅；但 ingestion-based pricing 在 TB/day 級別會痛。跟 Datadog Security 比、Splunk 走 security-first、Datadog Cloud SIEM 是 observability platform 加上 security view；Datadog 適合 cloud-native + 中等規模、Splunk 適合 enterprise + 跨 on-prem。跟 Google Security Operations（前 Chronicle）比、Google Security Ops 走 fixed-price by data、massive scale、Splunk 是 per-GB 累進、超大規模反而 Google 划算。

關鍵張力：ingestion-based 計費 ↔ 偵測覆蓋率 是 Splunk 客戶最大的 trade-off。為了省錢選擇性 ingest log（只進 Windows Event Log 不進 Linux auth log、只進 prod 不進 dev）、結果 Storm-0558 / Uber MFA 那種跨來源 correlation 抓不到。要看清楚自己 容忍多少偵測盲點換多少預算。

本章目標

讀完本頁、讀者能判斷：

Splunk 在 SOC stack 中承擔哪一段（log aggregation / SIEM / SOAR / UBA）、哪些要外接（Vault 管 service token、IdP log 來源治理）
SPL / correlation rule / detection content 的 ownership 設計（誰寫、誰 review、誰調 false positive）
Ingestion pricing trap 的應對（log priority tiering、Cribl / Cribl Stream 做 pre-filter、Splunk SmartStore 把冷資料丟 S3）
何時用 Splunk、何時走 Elastic / Datadog / Google Security Ops 的取捨

最短判讀路徑

判斷 Splunk deployment 是否健康、最少看四件事：

誰能改 correlation rule：Splunk admin / ES admin / KV store admin 的人數、SPL search 跟 saved search 是否走版控（Git → git-fusion / Splunk Cloud Versioned Configs）、rule change 是否經 PR review
Ingestion 治理：哪些 source 進 Splunk（IdP audit log / cloud control plane log / endpoint log / network log / app log）、是否有 log priority tier（critical / standard / archive）、Cribl Stream 是否在前面做 pre-filter / routing
Detection content coverage：Splunk Security Content（公開 YAML rule library）有多少 enabled、是否跟 MITRE ATT&CK 對照、自家 custom rule 是否補 organization-specific anti-pattern
Alert quality / SOC handoff：alert volume per day、SOC analyst triage time、false positive rate、alert 是否進 SOAR playbook 自動處理低風險、跟 8 incident response 的 routing 是否定義

四件事任一缺失、就是 Detection Coverage and Signal Governance 邊界的待補項目。

日常操作與決策形狀

Ingestion architecture：log 進 Splunk 三種路徑 — Universal Forwarder / Heavy Forwarder（agent-based，自管 host）、HTTP Event Collector (HEC)（push log via HTTP endpoint、SaaS / serverless workload 預設）、Splunk Add-on for 各 cloud / SaaS（cloud-native log pull）。production 通常混用：endpoint 用 Universal Forwarder、cloud control plane 用 Add-on（AWS / GCP / Azure / Okta）、自家 app 用 HEC。在前面接 Cribl Stream 做 routing / filtering / sampling 是大型 deployment 的標準補位。

SPL（Search Processing Language）：類 Unix pipe 的 | 串接（index=ids sourcetype=auth | stats count by user | where count > 100）、表達力強但學習曲線陡。SPL 是 first-class concept、不只是查詢工具 — saved search 變 correlation rule、scheduled search 變 alert、accelerated search 變 data model 加速。SPL 寫得好不好直接決定 偵測規則品質 + 查詢成本。

Correlation rule / Notable Event：ES app 把 high-confidence finding 轉成 Notable Event、進 Incident Review queue。Correlation rule 的反例是 single-event alert（看到一個 SSH brute force attempt 就 alert、SOC analyst 一天看 10000 個沒意義）— production rule 應該是 time-bounded aggregation（過去 5min 內 100 個 brute force from same IP）+ cross-source correlation（brute force IP 同時出現在 cloud control plane access）。

Detection content lifecycle：Splunk Security Content 是 Splunk 維護的 OSS detection rule library、YAML format、跟 MITRE ATT&CK 對應。組織通常 先 import 全部 baseline、再選擇性 disable noisy 規則 + 新增 organization-specific 規則。Rule change 走 PR review、staging tenant 跑 24-48hr 觀察 false positive curve 才 promote 到 production。對應 Detection Engineering Lifecycle 的章節原則。

Risk-Based Alerting (RBA)：ES app 7.0+ 引入、給每個 user / asset 累積 risk score（取代逐 finding alert）、累積到 threshold 才 alert。處理 alert fatigue 的工程化做法：5 個 low-confidence signal 加總超過 threshold 比單一 high-confidence alert 更接近真實 attack pattern。對應 Alert Fatigue and Signal Quality。

SOAR integration：Splunk SOAR（前 Phantom）接 alert + playbook 自動執行 — 例如 leaked credential 自動 rotate（拉 Vault API）、suspect IP 自動加 firewall block（拉 Cloudflare WAF custom rule）、suspect user 自動 force MFA re-enroll（拉 Okta API）。playbook 進版控、定期 dry-run、不能黑箱 production fire-and-forget。

Ingestion pricing 治理：Splunk 按 ingestion volume（GB/day）計費、TB-scale deployment 年費千萬美元級別。實務治理：tier 1 log（IdP / cloud control plane / payment processor / DB audit）進 Splunk hot index、tier 2 log（app log / web access log）按 sampling / filtering 進 Splunk、tier 3 log（debug / verbose）走 SmartStore 到 S3 / GCS 冷儲存、或繞過 Splunk 直接打到 Elastic / data lake。Cribl Stream 在 forwarder 前 pre-filter 是業界標準作法、可省 30-50% ingestion cost。

SmartStore 跟冷熱分離：SmartStore 把 indexer 的 warm + cold bucket 放到 S3 / Azure Blob / GCS、indexer 只保留 hot data + cache。意義是 retention 從幾個月延長到幾年但 cost 不線性漲。production deployment 幾乎都該開、不開等於每年砸錢買 EBS。

核心取捨表

取捨維度	Splunk	Elastic Security	Datadog Security	Google Security Operations
計費模型	Ingestion-based（GB/day、累進）	Resource-based（node / cluster size）	Per-host + per-event（events/month）	Fixed price by data tier（PB-scale 划算）
學習曲線	陡 — SPL 表達力強但 idiom 多	中 — KQL / EQL 較直觀	緩 — 沿用 Datadog observability 語法	中 — YARA-L 是新語法但結構清楚
部署模型	Self-hosted (Splunk Enterprise) / SaaS (Cloud)	Self-hosted / Elastic Cloud / Serverless	SaaS only	SaaS only（Google Cloud）
Detection content	Splunk Security Content（最豐富、社群活躍）	Elastic Prebuilt rules + Sigma 支援	Datadog Security Rules（中等）	Google YARA-L 內建 + Google threat intel
SOAR / Response	Splunk SOAR（前 Phantom、業界先驅）	內建 Cases + Endpoint response（Elastic Defend）	Workflow Automation（基本）	SOAR 內建（前 Siemplify）
跨來源 correlation	強 — data model + SPL 支撐	強 — EQL sequence + Lucene	中 — log + metrics + trace 同 plane	強 — UDM normalization + cross-tenant
Multi-cloud	強 — Add-on 覆蓋三大雲	強 — Beats / Agent 跨雲	強 — Datadog Agent 跨雲	GCP-first、跨雲靠 Forwarder
適合場景	Enterprise + 跨 on-prem / 多雲、預算允許	OSS-friendly、中大型、Elastic stack 已用	Cloud-native、observability 已用 Datadog	超大規模 ingestion、Google 雲 + 多雲 SOC
退場成本	高 — SPL / detection content / dashboard 量多	中 — Sigma / Lucene 較可移植	中	中

選 Splunk 的核心訴求：Enterprise scale + 跨 on-prem + detection content 跟 SOC tooling ecosystem 成熟、且能投入預算（千萬美元級別 license + Cribl pre-filter + SmartStore 冷儲存治理）+ 有 SOC team 維護 correlation rule 跟 SOAR playbook。中等規模 cloud-native 直接走 Datadog / Google Security Ops 更划算。

進階主題

Enterprise Security app 的 Risk-Based Alerting：RBA 把「事件 → alert」改成「事件 → risk score → 累積 → alert」、是 alert fatigue 的工程化解法。實作要決定 risk decay window（多久後 risk score 衰減）、risk attribution（同一台 EC2 上多 user 的 risk 怎麼分）、per-asset vs per-user threshold。配對 Uber 2022 MFA Fatigue 的 lesson：單一 MFA fail 不該 alert、5min 內 50 個 fail + 新裝置 + 異常地理就是 high risk。

Common Information Model (CIM) + Data Model：Splunk CIM 把不同 source 的欄位 normalize 到統一 schema（authentication / network_traffic / web 等 data model）。意義是 SPL 跨 source 寫一次、不用為 Okta log / Azure AD log / CrowdStrike log 各寫一份。CIM 配合 Add-on 自動 mapping、organization 寫 custom source 需要自己定 CIM mapping。

Multi-tenant deployment：MSSP / 大型集團多 BU 共用一個 Splunk 部署、用 index（隔離 data）+ role / capability（隔離 access）+ App（隔離 dashboard / search）三層。注意 Splunk admin 在跨 tenant 場景是高權限角色、應該走 break-glass 流程 + audit。

Cisco 整合（2024+）：Cisco 收購後 Splunk 跟 Cisco XDR / Talos threat intel / Cisco Secure Endpoint 整合加速。對 Cisco-heavy 環境是 ecosystem 一致性增加；對非 Cisco 環境暫時影響有限、但長期 roadmap 會有 Cisco-specific 加值。

排錯與失敗快速判讀

Alert volume 爆炸 / SOC 看不完：correlation rule 寫成 single-event alert、或 false positive baseline 沒調 — 用 RBA 改 risk-based、staging tenant 跑 48hr 觀察再 promote
Detection coverage 出事故時才發現缺：critical log source 沒進 Splunk（為了省錢）— 補回 tier 1 log priority、用 Cribl Stream 對 tier 2 / 3 做 sampling 而非整批不 ingest
Ingestion cost 暴衝：新 source 加入沒 review、debug log 直接打進 Splunk — Cribl Stream 前置 + license usage dashboard alert + indexer ingestion quota
SPL search 慢 / 卡 search head：full-fidelity search on 1TB raw event、沒用 data model acceleration — 改用 accelerated data model、限定 time range、用 tstats 而非 stats
Correlation rule false positive 多：rule 寫得太寬、env-specific noise 沒 tune — staging tenant 跑 1 週統計 FP、tune threshold、加 lookup table 排除已知合法 source
SOAR playbook 黑箱 fire-and-forget：自動 disable account 結果誤殺 CEO — playbook 走 approval gate for high-impact action、defaults to containment not deletion
Splunk admin 太多 / 沒 break-glass：日常運維用 admin token、admin compromise blast radius 太大 — 收 admin 角色、改 power user + 特定 capability、break-glass 走 Vault

何時改走其他服務

需求形狀	改走
OSS-friendly / 預算敏感	Elastic Security
Cloud-native + observability 已用	Datadog Security
超大規模 ingestion + Google 雲	Google Security Operations
DLP / sensitive data discovery	Google DLP / Microsoft Purview
Endpoint detection 為主	CrowdStrike Falcon / Microsoft Defender for Endpoint
Pre-filter / log routing	Cribl Stream（前置 forwarder、不是替代 SIEM）
Incident routing	8 事故處理 vendor 清單

不在本頁內的主題

SPL 完整語法 reference、saved search 跟 macro 進階用法
Splunk Cloud Platform vs Splunk Enterprise 的功能對照細節
Splunk Observability Cloud（前 SignalFx 收購、跟 Datadog 直接競爭、屬 observability 不屬 security）
ITSI（IT Service Intelligence）— 屬 ITSM / observability、不在資安範圍
SOAR playbook 的具體實作（Phantom Python SDK）

案例回寫

Splunk 在 07 案例庫沒有直接 vendor-level 事件、但所有 detection-related case 都是 SIEM 偵測覆蓋率的對照：

案例	跟 Splunk 的關係（對照啟示）
Uber 2022 MFA Fatigue	MFA 請求密度應是 Splunk correlation rule first-class signal、5min window count > N 直接 alert + RBA 升級高風險 user score
Microsoft Storm-0558 Signing Key Chain	跨租戶 token 異常驗證需 Splunk Add-on for Azure AD + cloud control plane log 同時 ingest、跨來源 correlation 才能秒級偵測
Snowflake 2024 Credential Abuse	資料平台 query volume + 跨 schema scan + 來源 IP 異常的複合 correlation rule、不只看 audit log 也要 query metrics correlation
SolarWinds 2020 Sunburst	簽章驗證通過但 runtime 行為異常需 endpoint log + network log correlation、不靠 IoC-only 規則
Detection Engineering Lifecycle (section)	Splunk Security Content + 自家 custom rule 走 propose → staging tune → promote → review 的工程 lifecycle、不是 console 直改
Alert Fatigue and Signal Quality (section)	RBA 是工程化解 alert fatigue、不是「忽略低風險」、要設 risk decay + threshold tuning lifecycle

下一步路由

上游：7.13 偵測覆蓋率與訊號治理、Detection Engineering Lifecycle
平行：Elastic Security、Datadog Security、Google Security Operations
下游：Google DLP / Microsoft Purview（DLP signal 進 Splunk）
跨類：Okta（IdP log source）、HashiCorp Vault（SOAR playbook 拉 API）、Cloudflare WAF（WAF log + auto-block）
跨模組：8 事故處理 vendor 清單（Notable Event → IR routing）、4 observability（log pipeline 共用）
官方：Splunk Documentation

k6

Fri, 15 May 2026 00:00:00 +0000

k6 的核心責任是把 workload model 轉成可重跑、可版本化、可接到 CI 的壓測 scenario。它適合 API、HTTP、gRPC、WebSocket 與 browser-style flow 的負載驗證，重點在用程式化腳本描述使用者行為、負載階段、threshold 與結果輸出。

服務定位

k6 是 Grafana Labs 旗下的 scriptable load testing 工具、2021 年被 Grafana 收購。產品線分兩層：k6 OSS（Go 寫的 engine + JS API 描述 scenario、CLI 為主、output 可丟 Prometheus / InfluxDB / JSON / CSV）跟 Grafana Cloud k6（前 k6 Cloud、SaaS 多 region runner + 結果保存 + 跟 Grafana Cloud dashboard / Loki / Tempo 同 plane）。底層 engine 是 Go、不是 JS — JS 只是 scenario 描述層、runtime 由 Go 跑、所以單機 VU 容量比 Python-based 工具高出一個量級。

跟 JMeter 比、k6 走 code-first + CI-friendly、JMeter 走 XML / GUI + plugin ecosystem；JMeter 在 protocol 廣度（JDBC / LDAP / JMS / FTP）跟非工程團隊操作勝出、k6 在版控、PR review、artifact pipeline 勝出。跟 Locust 比、k6 用 JS、Locust 用 Python；Locust 對 Python team 自然、但 Python GIL 讓單機 VU 容量受限、需多 worker、k6 單機可跑數千 VU。跟 Gatling 比、Gatling 走 JVM + Scala/Java/Kotlin DSL、適合 JVM-heavy 團隊；k6 的 threshold + Grafana ecosystem 整合在 release gate 場景更直接。

定位

k6 適合把壓測納入工程流程。當團隊已經能描述 traffic shape、endpoint mix、arrival rate、think time 與 stop condition，k6 可以把這些模型寫成腳本，讓每次 release、capacity review 或 peak-event readiness 都能重跑同一組驗證。

這個定位讓 k6 接到三個主章。它從 9.2 Workload Modeling 接收流量模型，從 9.4 Saturation Discovery 接收 ramp-up 與 knee point 判讀，從 9.10 Production-Side 驗證接收 canary、dark launch 或 production-like load test 的安全邊界。

適用場景

API 壓測是 k6 最穩定的入口。Checkout、login、search、order query、payment callback mock 與 internal API 都可以用 scenario 表達，並用 threshold 把 latency、error rate 與 throughput 轉成 pass / fail 訊號。

CI performance gate 是 k6 的常見價值。團隊可以在 merge、nightly、pre-release 或 game day 前跑固定 baseline，觀察 p95 / p99、error rate、throughput 與 regression trend，再把結果交給 6.13 Performance Regression Gate。

Peak readiness rehearsal 適合用 k6 表達階段式負載。活動前可以用 ramping arrival rate 模擬 T-90、T-30、T-7、T-1 與 T-0 的負載階段，並把結果回寫到 9.11 高峰事件準備。

最短判讀路徑

判斷 k6 deployment 是否健康、最少看四件事：

Scenario design：用 executor: ramping-arrival-rate 而非 constant-vus、把 RPS / arrival rate 設成 first-class、VU 由 engine 自動算；scenario 描述跟 9.2 Workload Modeling 的 endpoint mix、think time、cohort 對得起來
Threshold gate：thresholds 區塊明確寫 p95 / p99 / error rate / throughput、CI fail 條件清楚、不靠人眼看 summary 判斷 pass / fail
Output 進 observability stack：--out experimental-prometheus-rw 把 metric remote-write 到 Prometheus、Grafana dashboard 接 k6 同 datasource、結果跟 target service 的 saturation metric 在同一張圖上看
k6 Cloud vs CLI 邊界：本地 CLI 跑 baseline + CI、Grafana Cloud k6 跑跨 region / 大規模 / 結果 retention；不要把 CI gate 放 Cloud（成本 + 時間不對）、也不要本地單機硬跑 100k VU（runner 自身瓶頸假象）

四件事任一缺失、就是 scenario 已經寫得不完整、threshold gate 失效、或 runner 觀測缺失。

選型判準

判準	k6 的價值	需要補的能力
腳本化	scenario、threshold、setup / teardown 可版本化	production traffic 抽樣與模型校正
CI 友善	CLI 與 artifact 容易接 pipeline	長期趨勢儲存與 release gate 語意
API 導向	HTTP / gRPC / WebSocket 等常見 API 場景清楚	複雜瀏覽器互動與端到端資料準備
團隊學習成本	JavaScript 腳本容易被多數 backend 團隊接手	大型分散式 runner 與測試資料治理

腳本化價值來自可重跑。一次性的壓測只能回答當天配置能撐多少；可版本化 scenario 可以回答 release 後容量曲線有沒有漂移，並讓退化調查回到同一份 workload model。

CI 友善價值來自交接成本低。壓測結果要能轉成 artifact、threshold、trend 與 gate decision，才會從「工程師手動跑工具」變成 release 流程的一部分。

API 導向價值來自後端路徑明確。k6 很適合 checkout API、search API、internal API 與 webhook receiver；如果主要問題是完整 browser UX、第三方真實支付或多裝置同步，文章要把資料準備、side effect 與環境隔離另外寫清楚。

跟其他工具的取捨

k6 和 JMeter 的主要差異是工作方式。k6 偏程式化腳本、CLI、CI artifact 與工程流程；JMeter 偏 GUI、protocol plugin、既有企業測試流程與非工程團隊協作。

k6 和 Gatling 的主要差異是生態與語言。k6 使用 JavaScript-style 腳本，Gatling 偏 JVM / Scala / Java / Kotlin 生態；團隊語言能力與既有 pipeline 會影響維護成本。

k6 和 Locust 的主要差異是團隊技能與模型表達。Locust 使用 Python，對 Python 團隊與 custom user behavior 很自然；k6 的 threshold、CLI 與雲端 / Grafana 生態讓 release gate 整合更直接。

k6 和 Vegeta 的主要差異是場景複雜度。Vegeta 適合簡單 HTTP load、CLI workflow 與快速 saturation 探測；k6 適合較完整的 multi-step scenario、threshold 與長期 baseline。

核心取捨表

取捨維度	k6	JMeter	Locust	Gatling
Scenario 語言	JavaScript（ES6+）	XML（GUI 編輯）/ Groovy	Python	Scala / Java / Kotlin DSL
Engine runtime	Go	JVM	Python（gevent）	JVM（Akka）
單機 VU 容量	高（thousands+）	中（JVM heap-bound）	中低（GIL、需 multi-worker）	高（Akka actor）
CI 友善度	強 — CLI + threshold + JSON / Prometheus	中 — 需 plugin / Jenkins integration	中 — CLI 友善但 result reporting 較弱	強 — CLI + HTML report + Maven/Gradle plugin
Protocol 廣度	HTTP / gRPC / WebSocket / Browser	最廣（JDBC / LDAP / JMS / FTP / SMTP）	HTTP 為主、其他靠 custom client	HTTP / WebSocket / JMS / MQTT
Browser test	k6 Browser（Playwright-based）	無原生（Selenium plugin）	無原生	無原生
Distributed	k6 Cloud / k6 Operator on k8s	Master / Slave（運維重）	Master / Worker	Gatling Enterprise / FrontLine
適合場景	API-first + CI gate + Grafana ecosystem	企業 + protocol 多 + 非工程團隊	Python team + custom user behavior	JVM team + DSL 表達力

選 k6 的核心訴求：API-first scenario + CI gate + Grafana / Prometheus ecosystem 已用、且團隊接受 JS DSL。Protocol 廣度需求大、走 JMeter；Python team、走 Locust；JVM-heavy、走 Gatling。

進階主題

k6 Browser：基於 Chromium + Playwright API、跑在 k6 同 scenario 內、可混 protocol-level 跟 browser-level load（前段 API call、後段真實 browser flow）。意義是「pure API load 跟 real user UX 在同一份 scenario」、不用維護兩套工具。但 browser VU 比 protocol VU 重幾十倍、runner cost 要重新算。

xk6 extensions：用 Go 寫 k6 extension、補 protocol（Kafka / Redis / SQL / AMQP）或 output（custom backend）。xk6 build 生出客製 binary、organization 可維護自家 extension。意義是 k6 不只跑 HTTP — Kafka producer load / Redis hot-key probe 都能用同一個 scenario harness。

Grafana Cloud k6（前 k6 Cloud）：SaaS 跑 multi-region runner、結果保存、跟 Grafana Cloud dashboard / Loki / Tempo / Prometheus 同 plane。適合 跨 region 真實延遲驗證、大規模 distributed run、結果 retention + team share。跟 Grafana Cloud 已用的團隊 ecosystem 一致；只用 OSS 的團隊走 k6 Operator on k8s。

Distributed execution：自管 distributed 走 k6 Operator on Kubernetes、scenario 拆 instance、結果 aggregate 到 output。意義是不需要 k6 Cloud 也能跑跨機器 load、但 runner pool 自管成本 + 結果 aggregation 自己處理。

Output integration：--out experimental-prometheus-rw 直接 remote-write 到 Prometheus、Grafana dashboard 一張圖看 k6 client metric + target service saturation；--out cloud 上 Grafana Cloud k6；--out json=... 落地檔案給 CI artifact；--out influxdb 接 InfluxDB（legacy）。Loki 用來接 k6 console log、Tempo 用來接 k6 trace（若 scenario 帶 W3C trace context）。

排錯與失敗快速判讀

VU 跑不上去 / runner CPU 滿：scenario 寫了重 JS 邏輯（big JSON parse、複雜 regex、crypto）— 把 setup-once 邏輯搬 setup()、不要每 VU iteration 重算
Resource throttling 假象：runner 機器 CPU / network bandwidth / file descriptor 自身瓶頸、target service 還沒到 saturation — 換大機 / 多 runner / 看 runner 自身 saturation metric 排除
Threshold 設過嚴 / CI 一直 red：threshold 抄 production SLO 不留 budget — staging tenant 跑 5-10 次抓 baseline distribution、threshold 設 baseline + buffer、不是 SLO 直接搬
p95 看起來好但 user 抱怨慢：scenario endpoint mix 跟 production traffic shape 不符 — 補 production endpoint distribution、按 weight 配 scenario、跟 9.2 Workload Modeling 對齊
Script logic 太重 / VU iteration 不穩：在 scenario 內做 token refresh / large payload 處理、iteration 時間漂移 — 用 executor: ramping-arrival-rate 鎖 RPS 而非 VU count、iteration 時間漂移由 engine 吸收
結果無法回放 / 找不到 baseline：output 沒落 artifact、Grafana dashboard 沒存 time range — 每次 run 強制 --out json + tag scenario version + push 到 evidence package

操作成本

k6 的主要成本是 workload model 維護。腳本本身容易寫，真正的成本在 production endpoint mix、資料分布、tenant / region / user cohort、think time 與 peak shape 的持續校正。

Runner 成本會隨負載規模上升。單機 runner 適合小型 API baseline；跨 region、數十萬 RPS 或長時間 soak test 需要分散式 runner、網路成本、目標服務隔離與觀測儲存。

測試資料治理是高風險成本。Checkout、payment、order、email、notification 與 webhook 路徑都可能產生 side effect，因此 scenario 要明確定義 test tenant、idempotency key、mock boundary、cleanup 與 stop condition。

Evidence Package

k6 結果應回寫到 evidence package。最小欄位包括 scenario version、target environment、time range、VUs / arrival rate、threshold、p95 / p99、error rate、throughput、target service saturation metric、known gap 與 owner。

欄位	k6 證據來源
Source	k6 summary、JSON output、dashboard link
Time range	test start / end
Query link	Grafana / Prometheus / APM 查詢連結
Data quality	scenario coverage、test data freshness
Confidence	production similarity、runner capacity
Known gap	未覆蓋 endpoint、未模擬第三方、資料偏差

Evidence package 的核心用途是讓 release gate 能判斷。k6 的 threshold pass 只是其中一個訊號；gate 還要看 target service 的 CPU、connection、DB latency、cache hit rate、queue lag 與 cloud cost。

案例回寫

k6 目前在 09 案例庫中主要作為工具類承接點，案例主角仍是負載形狀與驗證節奏。它可回寫到 9.C15 Tixcraft 售票壓測的 pre-event load test 判讀、9.C1 Prime Day readiness 的 staged validation、9.C28 FanDuel 雙峰 workload 的多模型壓測需求、9.C2 GR8 Tech FIFA World Cup readiness 的 54000 TPS @ 25ms p95 驗證、以及 9.C7 Lyft 8x peak 跨 100+ 微服務的獨立 threshold 設計。

這些案例提供的是負載形狀與工程節奏。k6 頁引用案例時，要把 case 轉成 workload model、ramp-up、threshold、runner 規模與 stop condition，並讓工具回到可替換的承載選項 — 例如 GR8 Tech 25ms p95 是 threshold pass / fail 的硬目標、Lyft 的「8x 是特定服務、不是全部 8x」要拆成 per-service scenario。

下一步路由

PostgreSQL

Wed, 13 May 2026 00:00:00 +0000

PostgreSQL 是 backend 預設關聯式資料庫的安全選擇。生態完整、SQL 功能豐富、MVCC 跟 transaction 模型穩定、新版本仍積極演進（pg17 加入 JSON_TABLE、平行 vacuum；pg18 加入 io_uring async）。Aurora（AWS managed）、CockroachDB、Aurora DSQL（2024-12 preview / 2025-05 GA）、Spanner（2024 PostgreSQL dialect）都把 PostgreSQL wire protocol 當作相容標的 — 它是 SQL DB 世界的 lingua franca。

教學路線：SQL baseline 與交易演進

PostgreSQL 服務頁的教學目標是建立 SQL baseline。讀者讀完後要能用 PostgreSQL 理解 transaction、schema evolution、query boundary、connection pressure 與 managed / distributed SQL 的比較基準。

學習段	核心問題	對應段落
SQL baseline	PostgreSQL 為什麼常作為 OLTP 預設比較基準	定位、適用場景
容量邊界	connection、write throughput、replica、storage 如何限制服務	容量特性、容量規劃要點
交易與查詢	複雜 SQL、JSONB、GIS、全文檢索如何影響資料模型	適用場景、跟其他 vendor 的取捨
演進與維護	vacuum、partition、index、replication 如何成為長期責任	容量規劃要點、常見陷阱
替代路由	何時轉 Aurora、CockroachDB、Spanner、DynamoDB 或 OLAP	不適用場景、跟其他 vendor 的取捨

定位：OLTP 預設、SQL 工程深度

PostgreSQL 跟 MySQL 是兩大 SQL OLTP 主流、但設計取捨明顯不同：

PostgreSQL 偏 特性深度 — JSON、GIS、full-text search、partial index、CTE、window function 都成熟
MySQL 偏 簡單 query 效能 + 分片生態 — Vitess / PlanetScale 提供超大規模 database sharding

選 PostgreSQL 的核心訴求：需要進階 SQL 特性、需要長期 schema evolution 彈性、信任 community-driven 演進、想避免單一 vendor lock-in（PostgreSQL 是 open source、可跨雲 / on-prem）。

容量特性

PostgreSQL 沒有「vendor 給的容量數字」、要靠 instance 配置 + tuning 推估。但有幾個工程上限要知道：

單一 primary 寫吞吐：

一般 m5.4xlarge 級 instance：5K-10K WPS（依 schema、index、commit fsync）
高階 r6i.16xlarge + io2 storage：30K-50K WPS
超過這個級別 → 應用層 database sharding 或換 Aurora / Spanner

Connection 上限：

預設 100 connection、每個 connection ~10MB RAM
1000+ connection 必須 pgBouncer / PgCat 共享 pool
對應 9.C29 Lemino case — RDB connection limit 是 surge 場景的隱性 bottleneck

Read replica：

streaming replication：1 個 primary + 多個 standby（async / sync）
跨 AZ replication lag 通常 < 100ms、跨 region 可能秒級
跟 Aurora 比、自管 PostgreSQL replication lag 較大

Storage 上限：

單一 table 32 TB（PostgreSQL 設計上限）
實務上單表超過 1 TB 開始有 vacuum / index 問題、建議 partition

適用場景

1. 多用途 OLTP、複雜查詢：

複雜 JOIN、CTE、window function、subquery
訂單系統、會員系統、訂閱方案、權限 RBAC
需要 strong consistency + ACID transaction

2. JSON / 半結構化資料：

JSONB column 支援 indexing、partial query
比 MongoDB 適合 主要結構化 + 部分 JSON workload
不適合主要 document workload（用 MongoDB / Cosmos DB）

3. 地理 / 全文檢索：

PostGIS 是業界標準 GIS extension
全文檢索（ts_vector）對中等規模夠用、超大規模用 Elasticsearch

4. 進階特性需求：

partial index（WHERE 條件下才建 index）
exclusion constraints（避免 booking 重疊）
range types（時間 / 數字範圍）
logical decoding / CDC（Debezium、pgcapture）
foreign data wrapper（query 跨 DB）

5. 跨雲 / on-prem 部署：

不想 vendor lock-in
可用 Patroni / Stolon / pg_auto_failover 做 HA
對應 1.11 全球分散式 OLTP 的 CockroachDB / Aurora DSQL 比較段

6. 中小規模高峰場景：

流量 < 10K WPS 級別、PostgreSQL 自管或 RDS 通常夠
流量更高、考慮 Aurora（同 wire protocol、storage 升級）

不適用場景

1. 極高寫入吞吐（單機 > 50K WPS）：

必須進入 database sharding 或分散式 SQL
替代：CockroachDB、TiDB、Spanner、應用層 sharding

2. 全球 multi-region active-active write：

PostgreSQL 是 single primary、不支援 multi-region active-active
替代：Aurora DSQL、Spanner、CockroachDB multi-region

3. KV 簡單查詢 + sub-10ms p99：

PostgreSQL connection 開銷 + parsing + planning 已經 1-3ms
KV-pattern workload 用 DynamoDB / Redis / Cosmos DB 更便宜更快

4. 大規模 OLAP：

PostgreSQL 定位在 OLTP，analytics workload 交給 OLAP 系統
大數據分析用 ClickHouse / BigQuery / Snowflake / Redshift / Synapse

5. 連線量極大 SaaS（每個用戶一個 connection）：

即使有 pgBouncer、超大連線量仍是 PostgreSQL 結構性限制
對應 9.C29 Lemino 案例 — 流量上升 connection 爆是換 DynamoDB 的主因

跟其他 vendor 的取捨

vs MySQL：

PostgreSQL：SQL 特性深、JSON / GIS / window 完整、replication 較簡單但 lag 較大
MySQL：簡單 query 效能好、replication 機制成熟、Vitess 分片生態強
選 PostgreSQL：需要進階 SQL、複雜 query、JSON workload
選 MySQL：高併發簡單 query、需要 sharding、已用 MySQL 生態

vs Aurora（同 PostgreSQL wire protocol）：

PostgreSQL：自管 / RDS、特性接近 upstream、跨雲可用
Aurora：AWS managed、storage / compute 分離、更多 read replica
選 PostgreSQL：跨雲、想最新特性、預算敏感
選 Aurora：AWS 生態、需要更快 failover + 更多 read replica
詳見 Aurora vendor page

vs CockroachDB（PostgreSQL wire protocol 相容）：

PostgreSQL：single-primary OLTP、SQL 特性完整
CockroachDB：multi-region 強一致 SQL、PostgreSQL wire 相容但部分特性缺
選 PostgreSQL：single-region 或 read replica 跨 region 夠
選 CockroachDB：必須 multi-region active-active write
詳見 1.11 全球分散式 OLTP

vs Spanner / Aurora DSQL（全球分散式 SQL）：

PostgreSQL：傳統設計、跨 region 是 async replication
Spanner / Aurora DSQL：全球線性化、跨 region 強一致
選 PostgreSQL：90% 場景夠用、便宜、容易
選 Spanner / Aurora DSQL：金融交易、ticketing inventory、必須全球強一致

vs DynamoDB：

詳見 1.10 KV / Document DB 容量規劃的 connection model 對比段

vs Neon（PostgreSQL serverless）：

PostgreSQL：standard、自管或 RDS
Neon：branch-based、scale-to-zero、適合 dev / preview environment
選 Neon：dev / preview、稀疏 workload、CI 用
選 PostgreSQL：production sustained workload

容量規劃要點

1. Connection pool 必須有：

直接連 1000+ connection 會壓垮 PostgreSQL
pgBouncer（最簡單、transaction pooling）
PgCat（rust 寫的進階替代、支援 sharding）
application 層 pool（HikariCP、SQLAlchemy pool）
通常組合使用：application pool 30-50 connection × 多 instance → pgBouncer 共享 → PostgreSQL 200 connection
對應 Connection Pool 卡片

2. Replication 配置：

streaming replication：async / sync / quorum
跨 AZ async：lag 通常 < 100ms、failover 1-2 分鐘
跨 AZ sync：lag 接近 0、但寫入要等 standby ack、會降寫吞吐
跨 region 通常 async
HA 工具：Patroni（最常見）、pg_auto_failover、Stolon

3. Vacuum 跟 bloat 治理：

PostgreSQL MVCC 會留下 dead tuples、必須 vacuum
autovacuum 配置：throttle 大表、避免在 peak 跑
bloat 監控：pg_stat_user_tables 看 dead_tup ratio
大表 vacuum 可能要 hours、影響 maintenance window

4. 大表 partitioning：

單表 > 1 TB 建議 partition（按時間、按 tenant）
partition pruning 讓 query 只掃需要的 partition
partition 限制：cross-partition unique constraint、跨 partition join 較慢

5. Index 策略：

預設 B-tree、適合大多數 query
partial index 對 boolean / status column 特別有用
GIN / GiST 對 JSON / full-text / GIS
index 太多會拖累寫入、定期 review 未用 index（pg_stat_user_indexes）

安全、DR 與角色分工

PostgreSQL 的 production 完整性不只來自 SQL 特性，也來自資料存取、備份復原、升級責任與事故證據的分工。這一段補上 PG baseline 原本留在 limitation 的三個缺口：Security / RLS / audit logging、cross-region DR、application developer vs DBA / SRE 視角。

責任面	PostgreSQL 要回答的問題	主要引用路徑
Access control / RLS	table、row、function、extension 與 service account 權限如何切	Security / RLS / Audit Logging、7.4 Data Protection、Audit Log
TLS / credential	application 連線、DB user、憑證與 secret rotation 如何治理	TLS / mTLS、Credential、Secret Management
Cross-region DR	region 失效時要 async replica、PITR、Aurora Global Database 還是 distributed SQL	Cross-region DR、RPO、RTO、Failover、PITR + WAL Archiving
Developer / DBA split	application schema、migration、query、index 與 rollback 誰負責	Developer / DBA Responsibility Split、1.2 Schema Design、1.6 Migration Playbook
Incident evidence	資料事故中要留下哪些 query、timeline、restore 與 decision evidence	4.20 Observability Evidence Package、8.19 Incident Decision Log

Access control / RLS 的判讀重點是把資料責任放在資料層與 application 層之間分工。PostgreSQL 支援 role、grant、schema、function security 與 row-level security；但 RLS 會把授權邏輯拉進 database，適合 multi-tenant row isolation、資料平台或共享 reporting schema，日常 OLTP 仍要保留 application authorization 與 audit trail。

TLS / credential 的判讀重點是連線安全與憑證生命週期。Self-managed PostgreSQL 要處理 server cert、client cert、DB user rotation 與 connection pool 重連；managed PostgreSQL 常把 certificate、IAM auth 或 secret integration 交給平台，但 application pool、migration tool 與 read replica 仍要一起更新。

Cross-region DR 的判讀重點是 RPO / RTO 與資料一致性。自管 PostgreSQL 可用 streaming replication、WAL archiving、PITR 與 Patroni 做 region failover；Aurora 把 backup、PITR 與 Global Database 交給 AWS；真正 active-active 或 global strong consistency 需求要回到 CockroachDB、Spanner 或 Aurora DSQL，single-primary PostgreSQL 保留為 region failover 與 async DR 路線。

Developer / DBA split 的判讀重點是把日常責任寫進流程。Application developer 擁有 query shape、transaction boundary、repository adapter 與 migration contract；DBA / SRE 擁有 backup、replication、pooler、extension、vacuum、index maintenance 與 DR drill；release gate 需要把兩邊 evidence 合在同一份 decision log。

Managed PG 與相容變體路由

PostgreSQL wire protocol 已成為 managed SQL 與 distributed SQL 的相容目標。選型時要區分「PostgreSQL 本體」、「managed PostgreSQL」、「PostgreSQL-compatible distributed SQL」與「PostgreSQL extension ecosystem」四種不同責任。

變體	適合情境	主要代價 / 檢查點	下一步路由
RDS / self-managed PG	想接近 upstream、保留跨雲與 extension 彈性	團隊承擔 HA、backup、upgrade、vacuum 與 pooler	Patroni HA、PITR + WAL Archiving
Aurora PostgreSQL	AWS 內 production OLTP、想轉移 HA / storage ops	extension whitelist、cost model、cluster endpoint	→ Aurora、Aurora vendor
Cloud SQL / AlloyDB	GCP 內 managed PostgreSQL 與 Google operation model	extension / version matrix、IAM / backup / cost model	Managed PG Comparison
Azure Cosmos DB for PostgreSQL	Citus-based distributed PostgreSQL、tenant / shard workload	coordinator / worker topology、Citus 語意	Citus distributed、Database Sharding、Cosmos DB vendor
Neon / serverless PG	preview、branch、稀疏 workload、dev environment	cold start、connection、production sustained workload	本頁 vs Neon 段、後續 serverless PG comparison
Aurora DSQL / CockroachDB	global write、distributed SQL、region resiliency	transaction retry、extension gap、latency / cost	→ Aurora DSQL、→ CockroachDB

Managed PG 變體的引用規則是先查 compatibility，再談 migration。Extension whitelist、backup / restore API、logical replication 支援、connection endpoint 行為與 pricing 都是時間敏感 claim；實作前要回到官方文件確認版本，並把確認日期留在 migration plan 或 decision log。

Deep article + Migration playbook（已完成）

主題	文章	類型
Streaming replication topology + LSN + slot	replication-topology	Deep article
pg_repack / pg-osc 跟 PG 內建 ALTER 行為	online-schema-change	Deep article
Process-per-connection model + pooler 必要性	connection-scaling	Deep article
pgBouncer + PgCat connection pool	pgbouncer-config	Deep article
Patroni HA + DCS-based failover	patroni-ha	Deep article
Autovacuum tuning + bloat 治理	autovacuum-tuning	Deep article
Logical replication + Debezium CDC	logical-replication-debezium	Deep article
Citus distributed extension	citus-distributed	Deep article
BDR / pgEdge / Bucardo multi-master	bdr-multi-master	Deep article
MVCC + lock model（PG 並行控制核心）	mvcc-lock-model	Deep article
EXPLAIN / auto_explain / pg_hint_plan	query-optimization	Deep article
Index method 選型決策樹（B-tree / GIN / GiST / BRIN）	index-selection	Deep article
Declarative partitioning + pg_partman	declarative-partitioning	Deep article
JSONB binary storage + GIN index	jsonb-deep-dive	Deep article
Full-text search（tsvector + pg_trgm）	full-text-search	Deep article
Extension ecosystem（pgvector / TimescaleDB 等）	extension-ecosystem	Deep article
TimescaleDB hypertable + CAGG + compression	timescaledb-deep-dive	Deep article
pgvector HNSW / IVFFlat ANN search	pgvector-deep-dive	Deep article
PostGIS geometry / geography + GiST	postgis-deep-dive	Deep article
PITR + WAL archiving	pitr-wal-archiving	Deep article
Replication slot management（含 PG 17 failover slot）	replication-slot-management	Deep article
SQL features baseline + MySQL 對比	sql-features-baseline	Deep article
Hands-on 操作路線	hands-on	操作型章節群
Major version upgrade（N → N+1 pg_upgrade）	major-version-upgrade	Migration playbook（5-type 漏類 / 接近 Type B 但需 upgrade-specific audit）
→ Aurora PostgreSQL	migrate-to-aurora	Migration playbook（Type C）
→ Aurora DSQL（PG wire-compat distributed）	migrate-to-aurora-dsql	Migration playbook（Type E）
→ CockroachDB	migrate-to-cockroachdb	Migration playbook（Type E）
Multi-region + GDPR rollout	multi-region-gdpr-rollout	Migration playbook（Type F）
Partition redesign	partition-redesign	Migration playbook（Type F）

補充正文路由

當前 deep article、migration playbook、補充正文與 hands-on 已 cover replication / HA / OSC / connection / CDC / sharding / multi-master / MVCC / query opt / index / partitioning / JSONB / FTS / extension（含 TimescaleDB / pgvector / PostGIS）/ backup / slot / SQL features / upgrade / migration / security / DR / managed variant 等維度。下列補充正文用來承接 overview 中提到的延伸議題：

Logical decoding plugins deep dive：wal2json / pgoutput / decoderbufs 對位、CDC pipeline 整合
pg_partman advanced：retention 跟 child partition 自動 management
Connection pooler comparison：PgBouncer vs Pgcat vs Odyssey 細部對比
Aurora I/O-Optimized vs standard：cost model 取捨
AlloyDB / Cloud SQL 比較：GCP managed PG 選型

上述補充篇已完成正文，並保留既有引用路徑。Logical decoding 接 Logical Replication + Debezium 與 Replication Slot Management；pg_partman advanced 接 Declarative Partitioning；pooler comparison 接 Connection Scaling 與 pgBouncer Config；Aurora cost 接 → Aurora；AlloyDB / Cloud SQL 接 Managed PG Comparison。

案例對照

PostgreSQL 沒有直接的 09 case（多數 09 case 用 managed vendor）、但作為 baseline 跟遷移源頭 在許多 case 出現：

案例	跟 PostgreSQL 的關係
9.C23 Netflix Aurora consolidation	從多套 RDBMS（含 PostgreSQL）統一到 Aurora
9.C32 Clearent Azure SQL Hyperscale	Azure 生態替代 PostgreSQL 的選擇
9.C29 Lemino RDB connection limit	PostgreSQL/MySQL 都有的 connection 限制

已知 Limitation 與 Audit 紀錄

本 vendor 頁的 22 篇 deep article + 6 篇 migration playbook 經過 4-reviewer audit（A 寫作規範 / B 跨檔一致性 / C 技術準確性 / D 框架偏誤）、Phase 1-3 修法完成。承認以下 limitation：

PG narrative bias：pgvector / TimescaleDB / extension-ecosystem / Citus 四篇對「PG 取代專業 DB」描述偏 PG-favoring；對手 vendor（Pinecone / InfluxDB / Vitess）的優勢段相對簡短。讀者選型時、請以 cost / ops / scale 三軸綜合判斷、不依本 vendor 頁單一視角。
Anti-recommendation 深度不一：bdr-multi-master / extension-ecosystem 有「99% 不需要」明確邊界、其他篇章邊界較柔（如「Vector 量 > 5-20M」是粗略門檻）。實際 production 決策請參考多 vendor 對照 + 自家 workload 量測。
Sibling cross-link 狀態：MySQL ↔ PG sibling、PG 既有 ↔ 新章節 cross-link 已補（refer #136 卡）；本輪同步補 Aurora / CockroachDB / Spanner / Cosmos DB / DynamoDB vendor 頁的反向 sibling 路由，剩餘精修可在各 migration playbook 補更細的 step-by-step 對照。
時間敏感 vendor claim：Aurora DSQL（2024-12 preview / 2025-05 GA）/ pgvector（0.8 iterative scan）/ TimescaleDB version matrix / DSQL extension 支援範圍持續演進、本 vendor 頁以 2025-2026 公開狀態為準、實作前請以 vendor 官方 docs 為準（refer #137 卡）。
補充維度已正文化：Security / RLS / audit logging、cross-region DR、application developer vs DBA 視角分工、YugabyteDB / TiDB migration playbook、specialized PG variants 已補成正文。本輪也補上跨 vendor 反向連結與時間敏感 claim 路由；下一輪可集中在 migration playbook 的操作步驟與 lab 化。

詳細 audit findings 跟修法見 #136 Sibling Vendor Cross-Link Bidirectionality / #137 Vendor Feature 時間敏感性 / #138 Cross-Reviewer Convergence。

常見陷阱

connection 沒 pool 直接連：1000 application instance × 30 connection = 30K connection、PostgreSQL 撐不住
沒 vacuum 治理：dead tuple 累積、table bloat、query 變慢
大表沒 partition：> 1 TB 單表的 vacuum / index rebuild 變成事故
index 不 review：寫吞吐被舊 index 拖垮
跨 AZ sync replication 給寫入吞吐高的 workload：每次 commit 等 standby ack、寫吞吐減半
logical replication 拖太多 publication：可能造成 primary WAL 堆積、disk 爆

下一步路由

完整 T1 對照：01-database vendors index
平行：MySQL vendor、Aurora vendor（managed PostgreSQL）
操作：PostgreSQL Hands-on（local lab、pool、PITR、migration evidence、HA drill）
上游：1.1 高併發資料存取、1.3 Transaction Boundary
下游：1.10 KV / Document DB 容量規劃（PostgreSQL 不適用時的替代）/ 1.11 全球分散式 OLTP（PostgreSQL 不夠用時的升級路徑）
跨模組：9.5 瓶頸定位流程 — connection / replication lag / vacuum 都是 PostgreSQL 常見 bottleneck 源
官方：PostgreSQL Documentation

GitHub Actions

Fri, 01 May 2026 00:00:00 +0000

GitHub Actions 是 GitHub 原生的 CI/CD 工具、承擔三個責任：PR check workflow（test / lint / coverage）、release 自動化 + environment protection rules、跨 platform matrix testing。設計取捨偏向「跟 GitHub 深度整合 + marketplace action 生態 + OIDC 認證雲端 + self-hosted runner」、是 GitHub-hosted 專案的預設 CI 選擇。

本章目標

讀完本章後、你應該能：

寫 workflow（.github/workflows/*.yml）
設計 PR check + matrix testing
用 reusable workflows / composite actions 復用
配置 environment protection + approval gate
用 OIDC + cloud auth（無 long-lived secret）

最短路徑：5 分鐘把 GitHub Actions 跑起來

1# .github/workflows/ci.yml
2name: CI
3on: [pull_request]
4jobs:
5  test:
6    runs-on: ubuntu-latest
7    steps:
8      - uses: actions/checkout@v4
9      - run: npm test

日常操作與決策形狀

Workflow 設計

子議題：

on triggers（push / pull_request / schedule / workflow_dispatch / repository_dispatch）
job / step / action
Matrix（OS / language version / test split）
對應指令範例：gh workflow run、gh run list

Cache 策略

子議題：

actions/cache（語言依賴 / build cache）
Cache key 設計（hashFiles + version）
Cache scope（per branch / per repo）
對應 build speed optimization

Reusable workflows / composite actions

子議題：

Reusable workflow：跨 repo 引用整個 workflow
Composite action：把多 step 包成 action
對應 knowledge cards reusable-action (對應 DRY)

進階主題（按需閱讀）

Self-hosted runner

子議題：

內網資源 / 特殊硬體（GPU）/ macOS
Runner group + scaling
Security：ephemeral runner（每次新建）
對應 07 security

OIDC + cloud auth

子議題：

GitHub OIDC provider
AWS / GCP / Azure 信任 GitHub
無 long-lived access key
對應 supply chain security

Environment protection

子議題：

environment（dev / staging / prod）
Required reviewers
Wait timer
Secrets per-environment
對應 6.8 Release Gate

Workflow security

子議題：

pull_request vs pull_request_target（後者有 secrets / 危險）
third-party action pinning（commit SHA）
GITHUB_TOKEN permissions（最小化）

Deploy workflow

子議題：

Deploy on tag / release
Rolling deploy / blue-green / canary
Rollback action

排錯快速判讀

Workflow 沒觸發

操作原則：on trigger 配置 / branch filter / paths filter。判讀：Actions tab 看 trigger event。

Permission denied

操作原則：GITHUB_TOKEN permissions 不夠。判讀：workflow 加 permissions: 區段。

Cache miss

操作原則：cache key 不穩定 / hashFiles input 變化。

Secret 沒生效

操作原則：secret name / environment 不對 / pull_request from fork 不能用 secret。

Self-hosted runner 卡住

操作原則：runner offline / job queue 滿 / runner group 配置不對。

何時改走其他服務

需求形狀	改走
進階 cache / parallelism	CircleCI
非 GitHub-hosted	GitLab CI / Bitbucket Pipelines / CircleCI
Self-hosted enterprise	Jenkins / Buildkite / Tekton
複雜 pipeline DAG	Tekton / Argo Workflows
Bazel-native CI	BuildBuddy / EngFlow

不在本頁內的主題

各 Marketplace action 細節
GitHub Enterprise self-host
Actions pricing
各語言 setup-* action 細節

案例回寫

案例方向	對應主題
Google：Error Budget 與 Release Gating	把 SLO 消耗轉成 release gate / freeze 的 workflow 入口
Stripe：Idempotency 與零停機遷移	canary deploy / staged rollout 的 CI 節奏
Microsoft：變更治理與可靠性門檻	environment protection + approval gate 對應變更分層

待補 GitHub Actions customer case：大規模 monorepo Actions 採用、OIDC migration、self-hosted runner scaling 案例。

下一步路由

上游概念：6.8 Release Gate
平行 vendor：CircleCI
下游能力：07 security（supply chain）、5 deployment（deploy gate）

Kubernetes

Fri, 01 May 2026 00:00:00 +0000

Kubernetes 是 container orchestration 事實標準、承擔三個責任：workload lifecycle（pod / deployment / probe / rolling update）、cluster networking（service / ingress / DNS）、resource scheduling（resource limit / QoS / autoscaling）。設計取捨偏向「declarative + control loop + extensible」、是 cloud-native 生態的核心抽象。可自管或用 cloud managed（GKE / EKS / AKS）。

對「多服務多實例 container orchestration、需要 rolling update / blue-green / canary、跨雲 / 跨環境統一抽象」這條路徑、Kubernetes 是首選。

本章目標

讀完本章後、你應該能：

用 kubectl 部署 Deployment + Service、配置 probe / resource limit
設計 rolling update / pod disruption budget 避免服務中斷
選 Ingress controller（nginx / traefik / GLBC / ALB Controller）
看懂 pod stuck / probe fail / OOMKilled / drain timeout 訊號
評估 managed（GKE / EKS / AKS）vs 自管 vs Operator 進階場景

最短路徑：5 分鐘把 Kubernetes 跑起來

 1# 1. 本機跑 kind（需先安裝 kind + docker）
 2kind create cluster --name dev
 3
 4# 2. 部署 Deployment + Service
 5kubectl create deployment nginx --image=nginx:stable-alpine
 6kubectl expose deployment nginx --port=80 --type=ClusterIP
 7
 8# 3. 驗證
 9kubectl get pods,svc,deploy
10kubectl port-forward svc/nginx 8080:80

日常操作與決策形狀

kubectl 核心指令

子議題：

資源生命週期：apply / create / delete / get / describe / logs / exec
Rolling update：set image / rollout status / rollout undo
Debug：events / port-forward / cp / top
對應指令範例：kubectl get pods -A、kubectl describe pod 、kubectl logs -f

Workload 設計

Pod lifecycle 是 K8s 的核心抽象。子議題：

Deployment（stateless）/ StatefulSet（stateful）/ DaemonSet（per-node）/ Job / CronJob
Pod 多 container（sidecar / init container）
對應 5.2 K8s deployment

Probe / Resource limit / QoS

子議題：

Liveness（活著嗎）/ Readiness（接流量嗎）/ Startup（啟動完了嗎）— 三 probe 各自責任
Resource limit（requests / limits）+ QoS class（Guaranteed / Burstable / BestEffort）
對應 Platform lifecycle contract

進階主題（按需閱讀）

Rolling update / disruption budget

對應案例 5.C9 反例：cutover without drain。子議題：

maxSurge / maxUnavailable 配置
PodDisruptionBudget 限制 voluntary disruption
Preemption / priority class

Ingress / Service mesh integration

子議題：

Ingress controller 選擇（nginx / Traefik / ALB Controller）
Gateway API（next gen Ingress）
Service mesh integration（Envoy-based Istio / Linkerd）
對應 5.C7 Airbnb Istio

Operator pattern / CRD

子議題：

CRD（CustomResourceDefinition）+ Controller 模式
Operator framework（OperatorSDK / kubebuilder）
常見 Operator：Prometheus / Cert-manager / Argo CD

Managed vs self-managed

對應案例 5.C1 Tradeshift self-managed → EKS、5.C2 Condé Nast EKS、5.C3 Orbitera managed K8s、5.C4 Mobileye EKS、5.C5 Miro EKS。子議題：

Self-managed（kubeadm / Cluster API）的 control plane 維運成本
Managed（GKE / EKS / AKS）的限制（版本鎖定 / managed addon）
遷移路徑跟回退設計

Multi-cluster / Federation

子議題：

Federation v2 / Cluster API multi-cluster
Cross-cluster service mesh（Istio multi-cluster）
對應 5.C6 Airbnb cluster scaling

Cluster autoscaling

子議題：

Horizontal Pod Autoscaler / Vertical Pod Autoscaler
Cluster Autoscaler / Karpenter
跟 09 performance capacity 對照

排錯快速判讀

Pod stuck（Pending / CrashLoopBackOff）

操作原則：先 kubectl describe pod 看 events、再 kubectl logs 看 container 訊息。

1kubectl describe pod            # 看 Events 段的 scheduling / pull / probe 訊息
2kubectl logs  --previous        # 看 crash 前一輪的 container log

判讀路徑：Pending → resource 不足 / nodeSelector 不匹配；CrashLoopBackOff → exit code + log 找原因。

Probe failure 造成不停 restart

操作原則：probe path / initial delay / timeout 配置錯。判讀：describe pod 看 probe events。

OOMKilled

操作原則：memory limit 太低、container 被殺。判讀：describe pod 看 last state reason。修法：raise limit 或優化 application memory。

Rolling update stuck

對應 5.C9 反例。判讀路徑：新 pod 起不來 → readiness 失敗 → 舊 pod 不下線 → 卡住。

Drain timeout

操作原則：kubectl drain 失敗、PDB 限制太緊。判讀：kubectl describe pdb。

何時改走其他服務

需求形狀	改走
單機服務（VM / bare metal）	systemd
Local dev / CI	Docker Compose
AWS managed runtime（不要 K8s）	ECS / Fargate
極簡 PaaS	Cloud Run / Heroku / Fly.io
替代 orchestrator	Nomad / Rancher
Edge / IoT 場景	K3s / MicroK8s

不在本頁內的主題

完整 kubectl 指令 reference
YAML manifest 完整 schema
各 Operator 細節
各語言 client-go API

案例回寫

直接相關案例

案例	主討論議題
5.C1 Tradeshift self-managed → EKS	自管 K8s 遷 managed、零停機切流
5.C2 Condé Nast EKS	多團隊異質集群整併到單一控制面
5.C3 Orbitera managed K8s	平台重置不中斷產品的能力遷移
5.C4 Mobileye EKS	大規模 workload 分批遷 EKS
5.C5 Miro EKS	Managed K8s 跟團隊維運模型對齊
5.C6 Airbnb cluster scaling	手動擴縮 → 自動化容量治理
5.C7 Airbnb Istio	Service mesh 升級分批治理
5.C9 反例：cutover without drain	Rolling update / drain 沒做的傷
5.C10 規模對照	小型 systemd → 中型 K8s → 大型 multi-cluster

下一步路由

上游概念：5.2 K8s deployment
平行 vendor：Docker、Envoy
下游能力：6 reliability（release gate）、8 incident response

OpenTelemetry

Fri, 01 May 2026 00:00:00 +0000

OpenTelemetry（OTel）是 CNCF 開放標準、承擔三個責任：定義 traces / metrics / logs 的資料模型（spec）、提供 vendor-neutral 的 SDK 跟 auto-instrumentation、以 OTel Collector 作為 instrumentation 跟 backend 之間的抽象層。設計取捨偏向「抽象優於 vendor-specific feature」、避免 vendor lock-in 是核心動機。多數現代 observability 平台（Datadog / Honeycomb / Grafana Cloud / Cloud Operations）都接受 OTLP。

本頁先給最短路徑、再展開日常 instrumentation 跟 Collector 部署、最後進階治理（sampling / semantic conventions / logs 成熟度）跟排錯。

本章目標

讀完本章後、你應該能：

用 OTel SDK 或 auto-instrumentation 對應用程式做 instrumentation
配置 OTLP exporter 把 telemetry 送到任一 backend
部署 OTel Collector（agent / gateway 模式）作為 backend 切換抽象層
區分 head-based vs tail-based sampling、選擇對應策略
評估從 vendor SDK 遷移到 OTel SDK 的相容性風險

最短路徑：5 分鐘把 OTel 跑起來

1# 1. 應用程式加 auto-instrumentation（範例：Python）
2# TODO: opentelemetry-bootstrap -a install
3# TODO: opentelemetry-instrument --traces_exporter otlp --metrics_exporter otlp python app.py
4
5# 2. 啟動 OTel Collector
6# TODO: docker run -p 4317:4317 -p 4318:4318 otel/opentelemetry-collector-contrib
7
8# 3. Collector 配置範例
9# TODO: otel-collector-config.yaml with otlp receiver + exporter to backend

最短路徑驗證 telemetry 從 app → Collector → backend 串通。實際 production 要評估 sampling、retention、cardinality。

日常操作與決策形狀

Instrumentation 模式

子議題：

Auto-instrumentation：Java / Python / Node / .NET / Ruby / Go 各語言成熟度不同
Manual instrumentation：開發者寫 trace span / metric instrument
Library instrumentation：opentelemetry-instrumentation-（HTTP client / DB / framework）

OTLP exporter 配置

子議題：

OTLP gRPC（4317）vs HTTP（4318）
Endpoint / headers / authentication 配置
對應指令範例：環境變數 OTEL_EXPORTER_OTLP_ENDPOINT、OTEL_EXPORTER_OTLP_HEADERS

Collector 部署模式

子議題：

Agent：跟應用程式同 host / pod、做 local buffer + enrichment
Gateway：集中部署、跨多 agent 接收、做 sampling / routing
Sidecar：K8s sidecar pattern、跟 pod 同生命週期
對應配置：receivers / processors / exporters pipeline

深入：OTel Collector 部署模式：agent / gateway / sidecar 與 pipeline 設計（三種位置責任分工、pipeline 設計、collector 失效 / 記憶體壓力 / backpressure 故障演練、容量成本邊界）。

進階主題（按需閱讀）

Auto-instrumentation 跨語言成熟度

子議題：

Java：最成熟、auto-instrumentation 廣度最大
Python：成熟、覆蓋主流 framework
Node：成熟、async context propagation 較複雜
Go：較弱（runtime 不支援 monkey patching）、多用 manual
.NET：成熟、跟 Application Insights 對齊
Ruby / PHP：相對較弱、覆蓋主流 framework

Sampling 策略

對應案例 4.C7 Datadog OTel migration。子議題：

Head-based sampling：trace 開始時決定保留與否、低成本但 lose context
Tail-based sampling：trace 完成後決定（依錯誤 / 延遲）、Collector 要 buffer 整個 trace
Sampling rate 配置（global / per-service / probabilistic）
對應工具：OTel Collector 的 tail_sampling processor、Refinery（Honeycomb）

Semantic conventions

子議題：

HTTP / DB / messaging / RPC 等的 attribute 命名規範
Resource attributes（service.name / service.version / deployment.environment）
Span name / status code convention
Migration：應用層用 OTel semantic conventions、避免 vendor-specific naming

Logs in OTel

子議題：

Logs 比 metrics / traces 較晚進 OTel spec（v1.0 較新）
Log signal 設計：log record 跟 span 關聯（trace_id / span_id）
跟 Loki / Elastic / CloudWatch 的整合
從現有 logging library 移轉的路徑（log-forwarding vs SDK）

Vendor SDK vs OTel SDK 遷移

對應案例 4.C4 X-Ray to OpenTelemetry 與 4.C7 Datadog OTel。子議題：

動機：避免 vendor lock-in、多 backend 並存、開源治理
風險：vendor-specific feature 損失（profiling / RUM 整合）
遷移路徑：dual ship → cutover → cleanup
對應 4.C9 反例：OTel migration signal drift

Resource detection

子議題：

自動偵測 cloud provider（AWS / GCP / Azure）resource attributes
K8s resource detector（pod / namespace / cluster）
Container resource detector
對應配置：OTEL_RESOURCE_ATTRIBUTES

排錯快速判讀

Telemetry 沒到 backend

操作原則：先確認 SDK 配置正確、再看 Collector 是否收到、最後看 exporter 是否成功。

1# TODO: 設 OTEL_LOG_LEVEL=debug 看 SDK 內部 log
2# TODO: 看 Collector internal metrics（zPages / Prometheus exporter）

判讀路徑：SDK → Collector → backend、三段各自獨立、要逐層 isolate。

Cardinality explosion

操作原則：metric attribute 含 high-cardinality 值（user_id / session_id）會爆 backend 成本。判讀：看 backend 的 series 數量、找 attribute 來源。

Trace span gap

操作原則：trace 不完整、看 context propagation 是否在跨 service / 跨 thread 邊界丟失。

Auto-instrumentation 不生效

操作原則：確認 SDK 版本跟 library version 對應、agent 啟動方式正確。對應 4.C7 Datadog OTel migration 的踩坑經驗。

Sampling 過頭 / 不足

操作原則：sampling rate 跟 backend 預算 + debug 需求對齊。判讀：debug 時找不到 trace（sampling 過頭）vs backend 成本爆（sampling 不足）。

何時改走其他服務

需求形狀	改走
需要 metrics 後端	Prometheus / Mimir
需要 SaaS APM 整合	Datadog / New Relic
需要 logs 後端	Elastic Stack / Loki
需要 high-cardinality debug	Honeycomb
AWS-native	CloudWatch + X-Ray
GCP-native	Cloud Operations
Error tracking	Sentry

不在本頁內的主題

各語言 SDK 完整 API
OTLP protocol binary format
各 backend 的 OTel 整合細節（見各 backend vendor 頁）
OTel project governance / sig 細節

案例回寫

直接相關案例

案例	主討論議題
4.C4 X-Ray to OTel	從 vendor SDK 遷出 OTel
4.C5 Cloud Trace OTLP	GCP Cloud Trace 接受 OTLP
4.C6 ADOT EKS pipeline	AWS Distro for OTel + EKS
4.C7 Datadog OTel migration	OTLP ingestion / vendor SDK 移轉
4.C9 OTel migration signal drift	（反例）雙軌遷移期的 signal 漂移

跨 vendor 對照

案例	對 OTel 的對應
4.C8 Airbnb K8s scale signals	K8s 規模化下 OTel Collector 拓撲 / 資源訊號分層
4.C10 規模對照	小型直接 SDK / 中型加 Collector / 大型 multi-backend

下一步路由

上游概念：4.17 Telemetry Data Quality
平行 vendor：所有 04 vendor 都可作 OTel backend
下游能力：4.20 Observability Evidence Package

PagerDuty

Fri, 01 May 2026 00:00:00 +0000

PagerDuty 是 on-call / alerting 的事實標準 SaaS、承擔三個責任：alert routing + escalation policy + schedule、incident workflow + response play + runbook automation、postmortem 整合（Jeli 收購）。從 paging 工具演化成完整 IR 平台。

服務定位

PagerDuty 的核心定位是 signal → human → action 的中介層、把 alert source（觀測、SIEM、合成監控、cloud control plane）變成具體某個人手機震動 + 24 小時內可追蹤的 incident timeline。它是 routing engine + on-call schedule 的事實標準、定位有別於 alert source 和溝通平台。

跟上游 07 章的 detection stack 是直接 wire：Splunk ES app 產生的 Notable Event 透過 Splunk-PagerDuty integration 或 SOAR playbook 變成 PagerDuty incident、severity 直接帶過來；Cloudflare WAF 的高分 rate-limit / bot block 透過 webhook 進 PagerDuty Event API v2、再經 Event Orchestration 判斷是丟 SecOps schedule 還是 platform schedule。這條鏈最常壞在 severity 對應不一致（Splunk medium 在 PagerDuty 變 P1）、跟 integration 沒 deduplication key（一次 attack 100 個 Notable Event 各起 100 個 incident）。

跟 Opsgenie / incident.io / Grafana OnCall 的差異在 ecosystem 跟 IR 模型 — PagerDuty 走 enterprise + AIOps + Process Automation 重資料堆疊、incident.io 走 Slack-native + collab-first、Opsgenie 綁 Atlassian、Grafana OnCall 是 OSS 自管。選 PagerDuty 的核心理由通常是 AIOps + Process Automation + Jeli postmortem 整合的 ecosystem maturity、不是 paging 功能本身。

關鍵張力：alert volume ↔ responder burnout 是 PagerDuty 客戶最常見 trade-off。為了「不漏 alert」把 grouping / deduplication 設很寬、結果 on-call 一週被叫醒 20 次、3 個月後人員流失。要看清楚自己 容忍多少漏報換多少 responder sustainability、不是把 alert source 全開到 PagerDuty 當保險。

本章目標

讀完本頁、讀者能判斷：

PagerDuty 在 alert pipeline 中承擔哪一段（routing / schedule / incident workflow）、哪些要外接（Slack 通訊、Jeli postmortem、Process Automation 對接 runbook）
Service / escalation policy / schedule 的 ownership 設計（誰建 service、誰改 escalation、誰能 override schedule）
Event Orchestration 的 deduplication / grouping / dynamic routing 設計、跟上游 SIEM 的 severity mapping 一致性
何時用 PagerDuty、何時走 Opsgenie / incident.io / Grafana OnCall 的取捨

本頁不教 PagerDuty console 操作步驟、也不列 pricing tier — 那些 vendor 官方文件已經完整。本頁重點在 判讀問題：怎麼看一個 PagerDuty deployment 健康與否、哪些 config 是 high blast radius、跟上下游（07 detection / 04 observability / Jeli postmortem）怎麼接。

最短判讀路徑

判斷 PagerDuty deployment 是否健康、最少看四件事：

誰能 ack / escalate / resolve：on-call rotation 有沒有人、escalation policy 第二層第三層是不是同一個人、有沒有 break-glass 流程（primary 失聯時誰補位）。schedule override 是否走 PR / approval、還是 console 直改沒留痕。
Escalation policy 設計：每層 escalation timeout（5min / 10min / 15min）是否符合 SLO、是否有 無人 ack 自動上報主管 規則、跨時區 schedule 是否避免半夜 page 給 off-shift 區域
Event Orchestration 設定：alert deduplication key 是否正確（同一 host + 同一 alert type 合併）、grouping rule 是否避免 alert storm、dynamic routing 是否依 service / severity / time 分軌到不同 schedule
SOAR / Process Automation playbook 觸發點：哪些 incident 自動觸發 runbook（restart / rotate token / scale up）、approval gate 是否設在高風險動作、playbook 失敗有沒有 fallback 回 human page

四件事任一缺失、就是 Drills and On-call Readiness 的待補項目。

日常操作與決策形狀

Service / team / escalation

PagerDuty 的 service 對應一個應用 / component、是 incident 的最小 ownership 單位。一個 service 綁一個 escalation policy（N 層、每層 X 分鐘 timeout）、一個 schedule（rotation + override）。production 部署用 Terraform PagerDuty provider 進版控、不在 console 直改 — 因為 schedule / escalation 是高 blast radius config、誤改可能讓半夜 alert 漏掉。Service 通常按 Service Ownership 對齊組織結構、不是按技術 stack 切：把一個微服務 stack 拆成 10 個 service 看似乾淨、但 incident 起來時 responder 要同時 ack 10 個 incident 對 SLO 不利、合理粒度通常是 一個 product team 一個 service。

Event Orchestration + Response Play

Event Orchestration 是 alert → incident 的工程化路由層、處理 deduplication / grouping / dynamic routing 三件事。deduplication 用 dedup_key（同 host + 同 check type 合併、避免 100 個 alert 起 100 個 incident）、grouping 用 time window + tag（同一服務 5min 內多個 alert 合一）、dynamic routing 依 severity / time / service tag 分軌到不同 schedule。Response Play 則是 incident 起來後自動執行的動作 bundle — page additional responder、建 Slack channel、發 status page、call conference bridge。Response Play 應該走 PR review、不能 console 直加 — 一個誤設的 Response Play 可能在每個 P1 自動 page 整個 leadership。

Severity mapping 跟上游一致性

上游 source（Splunk Notable Event / Datadog monitor / Cloudflare WAF alert）的 severity 跟 PagerDuty incident urgency 要 對應表化、不是各自為政。常見錯位：Splunk medium 在 PagerDuty 變成 high urgency（半夜被吵醒）、或 Cloudflare 高分 bot block 進來只標 low（真實 attack 漏報）。實務做法是寫一張 severity translation table 進 Event Orchestration、source severity → PagerDuty urgency 一對一寫死、變更走 PR review。對應 Incident Severity Trigger 的判讀標準。

核心取捨表

取捨維度	PagerDuty	Opsgenie	incident.io	Grafana OnCall
定位	Enterprise IR platform、AIOps + automation	Atlassian 生態 paging	Slack-native IR collaboration	OSS / 自管 OnCall
部署模型	SaaS only	SaaS（Atlassian Cloud）	SaaS only	Self-hosted（Grafana stack）/ SaaS
Alert routing	Event Orchestration（dedup + group + dyn）	Alert policy + integration	Slack-first、簡化 routing	Integrations + routes（OSS 等效）
Schedule	強 — rotation / override / multi-tz	強 — 跟 Jira / Confluence 整合	中 — schedule 較簡化	中 — 基本 rotation
Workflow / Play	Response Play + Process Automation	Atlassian Automation	Slack-driven workflow（強）	基本 webhook
Postmortem	Jeli（收購、深度整合）	Confluence template	內建 postmortem + learning loop	外接
AIOps	Machine Learning alert clustering、PRCC	基本 grouping	無	無
Pricing	Per-user + 按 feature tier、enterprise 貴	按 user、Atlassian bundle 划算	Per-responder、中等	OSS 免費 / Grafana Cloud 按 active
適合場景	Enterprise + 多 service + AIOps 需求	Atlassian 已用 + 預算敏感	Startup / mid-size + Slack-first 文化	OSS-friendly + Grafana stack 已用
退場成本	高 — schedule / policy / Play 量多	中 — Atlassian 內可遷	中 — Slack 工作流綁深	低 — OSS、可帶走 config

選 PagerDuty 的核心訴求：多 service 大組織 + AIOps 對 alert storm 有 ROI + Process Automation 對接 runbook + Jeli postmortem 整合需求。Slack-first 小組直接 incident.io、Atlassian-heavy 走 Opsgenie、預算敏感 OSS 走 Grafana OnCall。

進階主題

Event Orchestration deduplication / grouping：deduplication 跟 grouping 是兩個層次 — dedup 是 同一事件多次發送只算一個（用 dedup_key）、grouping 是 多個相關事件合成一個 incident（用 time window + service / tag）。設定太寬會漏 alert（不同 root cause 被合併、漏報重要事件）、設定太窄會 alert storm。實務做法是 先寬後窄 — 上線初期用較寬 grouping 觀察、再依 false-merge 案例收窄。

AIOps Machine Learning：PagerDuty AIOps 用 ML 做 alert clustering + probable root cause + change correlation — 多個 alert 自動歸成 cluster、推測 root cause、跟近期 deploy / config change 對照。風險是黑箱：ML 把不相關 alert 合一、SOC analyst 看不到原始事件就 ack；或把真實 incident 歸到 noise cluster。production 應該開、但 保留 manual ungroup 機制 + 定期 audit cluster accuracy。

Process Automation + Splunk SOAR 整合：PagerDuty Process Automation（前 Rundeck）做 runbook 自動執行 — restart / scale / rollback / rotate token。對接 Splunk SOAR 形成 incident enrichment + auto-remediation 鏈：Splunk SOAR 在 incident 起來時自動拉 context（user / host / IP recent activity）寫進 PagerDuty incident note、再依 playbook 觸發 PagerDuty Process Automation 做動作。高風險動作（disable account、rotate prod credential）必走 approval gate、不能 fire-and-forget。

Jeli postmortem 整合（2023 收購後）：PagerDuty incident resolve 後可以一鍵 import 進 Jeli、自動帶 timeline / responder list / Slack transcript、開始做 interview + narrative。對應 Jeli vendor — Jeli 走「learning from incident」方法論、不是只生 root cause report、強調 near miss 跟 human factor 也要分析。

Service ownership / Service Standards：PagerDuty Service Standards 把 service 的 escalation policy / runbook link / business criticality / oncall coverage 做成 checklist、organization 可以看哪些 service 沒達標。對 platform team 是治理工具、避免某 service「沒人 oncall 但有 alert source」。配對 Repeated Incident Toil 的反模式：service 沒人 own 但 alert 一直響、最後變 noise 被全部靜音、真實 incident 進來時也漏報。

Status page 整合：PagerDuty incident 可以自動同步到 Atlassian Statuspage / Instatus 對外 status page、但 自動同步 是雙刃刀 — internal P1 不一定是 customer-facing、誤公告影響品牌。實務做法是 只同步 customer-facing severity 的 incident、用 Event Orchestration 加 tag (customer_facing: true) 才觸發 statuspage update、其他 incident 走人工 publish。

排錯與失敗快速判讀

Escalation 漏配 / primary 失聯沒人補：escalation policy 第二層第三層是同一個人、或 off-shift 時無人 ack — 改成跨層異人 + break-glass policy（自動 page manager-on-call）+ 半年 audit
Schedule 跨時區算錯：把 UTC schedule 套到亞太工程師、結果半夜 page off-shift — schedule 用 follow-the-sun rotation、或在 schedule layer 加 time restriction
Event Orchestration deduplication 太寬：不同 root cause 的 alert 被 dedup 成同一 incident、漏報 — 收窄 dedup_key（加 service + alert_type）、保留 manual unmerge
Event Orchestration grouping 太窄：同一事故 100 個 alert 各起 100 個 incident、alert storm、on-call 看不完 — 放寬 time window grouping、或開 AIOps clustering
AIOps ML 黑箱誤合：真實 incident 被歸到 noise cluster、responder 沒看到 — 開 ML cluster audit dashboard、每月 sample review、保留 manual ungroup 機制
Slack notification stale：PagerDuty Slack app token 過期 / channel 改名、incident 通知沒進 Slack — Slack integration health check + fallback channel + on-call 應該收 mobile push 不只看 Slack
Response Play 自動誤觸：Play 設成 P1 自動 page leadership、結果一個 noise P1 把整個 C-level 半夜叫起來 — Play 必走 PR review、defaults to additional engineer not leadership、leadership page 走人工升級

何時改走其他服務

PagerDuty 不是所有 IR 場景都適合：

需求形狀	改走
Atlassian 生態	Opsgenie
OSS / 預算敏感	Grafana OnCall
Slack-first IR	incident.io
Microsoft Teams	FireHydrant
No-code workflow + AI	Rootly
Postmortem only	Jeli
Status page only	Atlassian Statuspage / Instatus

選對需求形狀比選 vendor 重要：startup 一開始走 Slack-native incident.io、規模上來 alert storm 多了再評 PagerDuty AIOps、Atlassian 重度用戶 Opsgenie bundle 划算。

不在本頁內的主題

各 integration 完整 setup / Pricing 細節 / AIOps ML 內部演算法
Response Play 跟 Process Automation 的具體 playbook 實作（Rundeck DSL）
Jeli 的 narrative + interview workflow（屬 postmortem 章節）

案例回寫

PagerDuty 公開 customer 多為大型 SaaS / 平台、下列案例可作為「paging 設計如何影響事故 detect → ack → mitigate 時間 + 怎麼跟 07 detection 鏈起來」的閱讀脈絡：

案例	跟 PagerDuty 的關係（對照啟示）
GitHub cases	大型平台事故的多輪 paging 與輪值、Event Orchestration grouping 設計 + 跨 service escalation
Cloudflare cases	控制面 vs data plane 的 paging 分軌、不同 severity 走不同 schedule + Response Play
Slack cases	通訊平台失效時 paging 通道的退路、PagerDuty mobile push 是 Slack-first IR 的 fallback
Datadog cases	觀測平台事故的 self-paging 與外部 fallback、AIOps clustering 避免 self-incident alert storm
Microsoft Storm-0558 Signing Key Chain	Splunk Notable Event 進 PagerDuty incident、SOAR playbook 自動 rotate Azure AD app credential、approval gate 在 force re-auth 動作
Snowflake 2024 Credential Abuse	異常 query volume 進 PagerDuty、Process Automation 觸發 Snowflake user disable + IP block、Response Play 同步 page legal / customer success
Microsoft 365 2023 Auth Incident	認證鏈事故跨多 service、Event Orchestration grouping + dynamic routing 把 auth alert 集中到 identity team schedule

下一步路由

上游：Drills and On-call Readiness、Incident Severity Trigger
平行：Opsgenie、Grafana OnCall、incident.io
下游：Incident Decision Log、Jeli（postmortem 接手）
跨類：Splunk（Notable Event source）、Cloudflare WAF（WAF alert source）
官方：PagerDuty Documentation

RabbitMQ

Fri, 01 May 2026 00:00:00 +0000

RabbitMQ 是 AMQP 協議實作的 classic broker、承擔三個責任：訊息持久化與重試（durable queue + ack/nack）、靈活路由（exchange + routing key + binding）、跨服務任務分派（worker pool + DLQ）。設計取捨偏向「處理即承諾、broker 負責重新投遞、consumer 負責 idempotency」、可靠性建立在 ack 機制而非 replication。

對「任務隊列、worker pool、複雜 routing、RPC over messaging」這條路徑、RabbitMQ 是業界主流。本頁先給最短路徑、再展開日常 publisher / consumer 操作與 exchange 設計、最後進階治理（quorum queue、cluster、federation）跟排錯。

本章目標

讀完本章後、你應該能：

用 docker 跑起 RabbitMQ + management UI、驗證 broker 健康
用 CLI / Management API 建 exchange、queue、binding
設計 exchange type（direct / fanout / topic / headers）對齊路由需求
看懂 queue depth、unacked、connection / channel 數量訊號、定位故障層
評估 quorum queue、stream、federation、shovel 等規模化議題

最短路徑：5 分鐘把 RabbitMQ 跑起來

 1# 1. 啟動 RabbitMQ + management plugin
 2docker run -d --name rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:3-management
 3
 4# 2. 建 exchange / queue / binding（rabbitmqadmin 可重現、Management UI 在 http://localhost:15672、預設 guest/guest）
 5docker exec rabbitmq rabbitmqadmin declare exchange name=demo.direct type=direct
 6docker exec rabbitmq rabbitmqadmin declare queue name=demo.q
 7docker exec rabbitmq rabbitmqadmin declare binding source=demo.direct destination=demo.q routing_key=demo
 8
 9# 3. 用 rabbitmqctl 驗證 broker 狀態
10docker exec rabbitmq rabbitmqctl list_queues
11docker exec rabbitmq rabbitmqctl list_exchanges
12docker exec rabbitmq rabbitmqctl list_bindings

最短路徑驗證「broker 起來、UI 能訪、能 enqueue/dequeue」。實際寫程式用 AMQP client、見日常操作。

日常操作與決策形狀

CLI 與 client API

子議題：

CLI 指令對照表（rabbitmqctl / rabbitmq-diagnostics / rabbitmqadmin）
Management API 形狀（HTTP API、適合自動化）
AMQP client 配置：connection / channel / consumer prefetch / publisher confirm
對應指令範例：rabbitmqctl list_queues name messages messages_unacknowledged consumers

Exchange types 與 routing 設計

Exchange 承擔訊息分流責任、不同 type 對應不同路由語意。子議題：

Direct：精準 routing key 匹配（point-to-point）
Fanout：忽略 routing key、廣播到所有 binding queue
Topic：層級式 routing key（* 單層、# 多層萬用字元）
Headers：依 message header 路由（少用）
對應指令：宣告 exchange / queue / binding 的 CLI 與 client 範例

Queue 設計與 ack/nack 策略

Ack/nack 是 RabbitMQ 的 delivery 控制點。子議題：

Durable queue vs transient queue
Manual ack vs auto ack（後者等同 at-most-once）
Prefetch 設定（backpressure + 併發控制）
Dead-letter exchange（DLX）配置
Message TTL 與 queue length limit

進階主題（按需閱讀）

本段主題已展開為 deep article：classic vs quorum vs stream 選型、network partition 與 cluster 一致性、DLQ retry escalation。下列子議題段保留選題判讀入口。

Classic queue vs Quorum queue vs Stream

子議題：

Classic queue：原生持久化 queue、mirrored queue 已 deprecated
Quorum queue：Raft-based、取代 mirrored、跨節點一致性
Stream（3.9+）：append-only log、log-based 模型、類似 Kafka 但仍是 RabbitMQ 體系
三種模型的選擇判讀（throughput、retention、replay 需求）

Federation 與 Shovel

子議題：

Federation：upstream / downstream broker 鏈接、適合鬆耦合跨資料中心
Shovel：點對點轉發、適合單純訊息搬運
跨區 / 多 cluster 場景的選擇

Erlang clustering 與 network partition

子議題：

Cluster 拓樸（disc node、ram node）
cluster_partition_handling 策略（ignore、autoheal、pause_minority）
腦裂偵測與處理

多 vhost / 多租戶

子議題：

Vhost 隔離（namespace、ACL、user permission）
User / Role / Permission 設計
Per-vhost resource limit（max connection、max queue）

Prefetch 與 consumer 併發控制

子議題：

Prefetch count 對 throughput / fairness 的影響
Channel-level vs Consumer-level prefetch
配合 retry budget 控制重試壓力

RabbitMQ Cluster Operator（K8s）

子議題：

Cluster Operator vs 自管 StatefulSet
持久化卷（PVC）與資料保護
升級流程（rolling restart 與資料完整性）

Plugin 機制與多協議

子議題：

MQTT plugin（IoT 場景、橋接 device-to-broker）
STOMP plugin
對應 3.1 broker basics 的 QoS / ACK 機制橋接

排錯快速判讀

Queue 堆積（messages 增加、unacked 不收斂）

操作原則：先看 consumer 是否存在、再看 ack 速率 vs publish 速率、最後看 prefetch / poison message。

1rabbitmqctl list_queues name messages messages_unacknowledged consumers

判讀路徑：無 consumer（client crash）→ consumer 慢（下游卡）→ poison message 卡住（看單一 message redelivery 次數）。

Connection / Channel limit

操作原則：client 設計不當會用滿 connection / channel，看每個 connection 的 channel 數。

1rabbitmqctl list_connections
2rabbitmqctl list_channels

Disk alarm 觸發

操作原則：disk 低於 disk_free_limit、broker 暫停 publisher。判讀：保留期太長 / 訊息大小 / 未消費 queue 過大。

Memory alarm 觸發

操作原則：記憶體超過 watermark、broker 觸發 paging、publisher 變慢。判讀路徑：訊息累積、consumer 失聯、queue 設定錯誤。

Network partition（腦裂）

操作原則：cluster 節點互相不可達、看 cluster_partition_handling 與 partition log。對應 3.C9 語義誤配思路。

何時改走其他服務

需求形狀	改走
高吞吐事件流、長期 replay	Kafka
Managed queue（AWS 生態）	AWS SQS
Managed pub/sub（GCP 生態）	Google Pub/Sub
輕量 messaging + 微服務	NATS
Redis 生態 stream	Redis Streams
IoT device 接入	EMQX / HiveMQ / Mosquitto（MQTT broker、或用 RabbitMQ MQTT plugin）
Workflow + durable execution	Temporal（T4 候選）

不在本頁內的主題

各語言 AMQP client 完整 API（依官方文件）
所有 plugin 細節（只列主流 plugin）
RabbitMQ Streams 跟 Kafka 的詳細對照（見 Kafka vendor 頁）

案例回寫

RabbitMQ 專屬案例（C23-C33）

案例	主討論議題
3.C23 Bloomberg vhost 多租戶	多 vhost + 自助平台化
3.C24 SoundCloud fan-out	音訊處理 pipeline 分隊列
3.C25 Indeed Delay + DLQ	三層 retry escalation
3.C26 GoCardless Hutch	單一 topic exchange 服務 mesh
3.C27 Zalando AWS	雲端自動 master selection / federation 升級
3.C28 WeWork hash ordering	Consistent hash exchange / per-key ordering
3.C29 WeWork Bunny channel pool	AMQP channel 不可跨執行緒
3.C30 Runtastic mirrored bottleneck	Mirrored queue 網路成本
3.C31 Mozilla Pulse	ACL + naming 取代 vhost（反向）
3.C32 LoyaltyLion monitoring	大規模 queue topology 監控
3.C33 Wargaming game portal	異步解耦 game server / portal

跨 vendor 對照

案例	對 RabbitMQ 的對應
3.C9 反例：語義誤配	manual ack + DLX + idempotency 三層責任邊界
3.C10 規模對照	小型直接用 / 中型補 idempotency / 大型分 vhost

MQTT plugin + Cluster Operator 缺直接 customer case：可補 RabbitMQ 官方 native MQTT blog 跟 K8s Operator docs、後續若有 customer 案例可加。

下一步路由

上游概念：0.3 非同步選型、3.1 broker basics
平行 vendor：Kafka、NATS
下游能力：3.2 durable queue、3.4 consumer 設計

Redis

Fri, 01 May 2026 00:00:00 +0000

Redis 是 in-memory data structure store、承擔三個責任：cache serving layer（with eviction）、data structure operation（string / hash / list / sorted set / stream / hyperloglog / geo）、輕量持久化（AOF / RDB）。設計取捨偏向「記憶體優先 + data type rich + 可選持久化」、cache 是主用場、但 data type 讓它跨入 session store / counter / leaderboard / lock 等場景。2024 起授權變動為 RSALv2 / SSPL（OSI 不認）、引發 Valkey fork。

對「通用快取、session store、rate limit counter、leaderboard、distributed lock」這條路徑、Redis 是事實標準。本頁先給最短路徑、再展開日常 CLI / API 與 key 設計、最後進階治理（cluster / persistence / modules）跟排錯。

本章目標

讀完本章後、你應該能：

用 docker 跑起 Redis、用 redis-cli 驗證
用 SET / GET / EXPIRE / DEL / KEYS 操作、區分 6 大 data types 適用場景
設計 key naming + TTL + eviction policy 對齊 cache miss 行為
看懂 hit rate / memory pressure / eviction / replication lag 訊號
評估 Cluster vs Sentinel、AOF/RDB、modules、授權變動下的選擇

最短路徑：5 分鐘把 Redis 跑起來

1# 1. 啟動 Redis
2# TODO: docker run -d --name redis -p 6379:6379 redis:7
3
4# 2. 連線
5# TODO: docker exec -it redis redis-cli
6
7# 3. 驗證 SET / GET / EXPIRE
8# TODO: SET foo bar / GET foo / EXPIRE foo 60 / TTL foo

最短路徑驗證「Redis 起來、能讀寫 + TTL」。實際應用見日常操作。

日常操作與決策形狀

CLI 與 client API

子議題：

redis-cli 指令對照表（SET / GET / DEL / EXPIRE / TTL / KEYS / SCAN / MGET / MSET）
Client library 配置：connection pool / timeout / pipeline / cluster mode
Pub/Sub vs Streams 的選用判讀
對應指令範例：INFO replication、CLIENT LIST、SLOWLOG GET

Key design 與 data types

不同 data type 對應不同資料形狀。子議題：

String：cache / counter / config flag
Hash：object cache（避免反覆 serialize）
List：queue / activity feed（小規模）
Set：membership / tag
Sorted set：leaderboard / time-series sliding window
Stream：log-style queue / event stream
HyperLogLog / Geo：approximate count / 地理座標

Key naming 規範：:::、用 : 分層、避免大 key（單 key > 10KB / list 長度 > 10K）。

TTL 與 eviction 策略

TTL 跟 eviction 是 cache 行為的核心旋鈕。子議題：

顯式 EXPIRE vs SET EX 設 TTL
maxmemory + maxmemory-policy（allkeys-lru / allkeys-lfu / volatile-lru / volatile-ttl / noeviction）
TTL 設計：固定 TTL vs 動態 TTL vs 不設 TTL
對應指令：CONFIG SET maxmemory 2gb、CONFIG SET maxmemory-policy allkeys-lfu

進階主題（按需閱讀）

Cluster vs Sentinel

子議題：

Sentinel：HA 模式、無 sharding、適合單 master 容量足夠
Cluster：sharding 模式、16384 hash slot、橫向擴展容量
Hash tag {...} 強制 multi-key 同 shard
Cluster failover 對 PEL（Streams）跟 distributed lock 的影響

AOF / RDB 持久化策略

子議題：

AOF（append-only file）：fsync 策略（always / everysec / no）、rewrite
RDB（snapshot）：save 策略、backup 還原
混合模式：AOF + RDB
持久化在 cache 場景的取捨（持久化是回填還是 source-of-truth）

Eviction policy 詳細

子議題：

LRU vs LFU：access pattern 對選擇的影響
volatile-* vs allkeys-*：只淘汰有 TTL 的 vs 全 key
approximate LRU 的 sampling 影響
對應 2.3 TTL eviction

Distributed lock

子議題：

SETNX + EXPIRE 模式
Redlock 算法（多 master quorum）+ 取捨爭議
Redlock 何時不夠：fence token / lease renewal
對應 2.5 distributed lock

Pub/Sub vs Streams

子議題：

Pub/Sub：fire-and-forget、訂閱者離線會錯過
Streams：append-only log、consumer group + PEL
何時用 Streams 取代 Pub/Sub
Redis Streams 細節見 03 messaging 模組 Redis Streams vendor

Redis Modules

子議題：

RedisJSON / RedisSearch / RedisTimeSeries / RedisBloom / RedisGraph
Module 隨授權變動受影響、Valkey 部分 fork
Module 在 ElastiCache 的支援限制

授權變動與選型影響

子議題：

2024 RSALv2 / SSPL 變動的影響範圍
對 managed service（ElastiCache 改 default 為 Valkey）的衝擊
從 Redis 遷 Valkey 的相容性路徑
商業 vs OSS 邊界

Hot key 處理

子議題：

Hot key 偵測（redis-cli –hotkeys、MONITOR 慎用）
Hot key 解法：local cache + Redis 兩層、key 拆分（讀多寫少場景）
對應 2.6 high concurrency

排錯快速判讀

Hit rate 下降

操作原則：先看 cache pattern 是否變（新功能 / TTL 變短）、再看 origin 壓力是否擴大。

1# TODO: INFO stats（看 keyspace_hits / keyspace_misses 比例）

判讀路徑：TTL 太短 → eviction 太積極 → key 命名變動造成 cache miss → origin 失敗 retry storm。

Memory pressure / eviction 異常

操作原則：先看 maxmemory + maxmemory-policy 設定、再看 key size 分布。

1# TODO: INFO memory / MEMORY USAGE  / --bigkeys

Hot key

對應案例 2.C5 Shopify Write-Through。判讀路徑：某 key 的 QPS 遠高於其他、單 shard CPU 接近 100%、其他 shard 閒置。

Replication lag

操作原則：replica 跟 master 差距、看 INFO replication 的 master_repl_offset vs slave_repl_offset。對 2.C1 Meta Cache Consistency 的對照。

Cache stampede（雷霆崩潰）

對應反例 2.C9 Cache Stampede Rollout。判讀路徑：TTL 同時過期 → 大量 cache miss → origin 被打爆 → 連鎖失敗。修法：jitter TTL、early refresh、singleflight 模式。

何時改走其他服務

需求形狀	改走
需要 OSI 認可開源授權	Valkey
純 cache、不需 data types	Memcached
極高 throughput / 多核	DragonflyDB
AWS 生態 managed	AWS ElastiCache
Durable Redis-compatible	AWS MemoryDB（介於 cache 與 DB）
大規模 event stream	Kafka / Redis Streams
Process-local cache	Caffeine / Guava Cache（JVM 內、無網路）
Search / full-text	Elasticsearch / OpenSearch（不在本模組）

不在本頁內的主題

各語言 Redis client 完整 API
Redis command 百科（詳查 redis.io/commands）
Redis Stack 商業 modules 細節
AOF / RDB 內部 binary format

案例回寫

直接相關案例

案例	對 Redis 的對應
2.C3 Shopify serialization	Shopify Redis 上做 Marshal → MessagePack 雙軌遷移、payload 編碼演進
2.C5 Shopify write-through	Shopify 在 read-heavy 路徑用 Redis 做 write-through、對應 hot key / 命中率治理
2.C1 Meta cache consistency	invalidation / shard move 一致性議題、Redis Cluster 與 replica 場景共用判讀框架

跨 vendor 對照

案例	對 Redis 的對應
2.C9 Cache Stampede	Redis TTL 切換 / key rename 都會觸發 stampede、需 jitter / singleflight / early refresh
2.C10 規模對照	小型 single instance + AOF / 中型 Sentinel + replica / 大型 Cluster + hash tag
2.C2 Meta mcrouter	Memcached 路由層案例、Redis 對應為 Cluster + proxy（Envoy / Twemproxy）或 client-side routing
2.C4 Meta CacheLib + Kangaroo	分層 cache（DRAM + flash）對照、Redis on flash（RoF / Speedb）的成本決策參考
2.C6 Netflix EVCache	EVCache 基於 Memcached + 跨 AZ replication、Redis 對應為 active-active CRDB / Global Datastore
2.C8 Meta TAO	Graph cache 演進案例、Redis 對應為 RedisGraph（已 deprecated）或自建 graph 索引
2.C7 Cloudflare Cache Reserve	Edge tiered（HTTP cache）對照、Redis 對應為 hot tier + S3 cold tier 自建分層

下一步路由

上游概念：2.2 Cache Aside、2.3 TTL eviction
平行 vendor：Valkey、Memcached
下游能力：2.5 distributed lock、2.6 high concurrency

GitHub Actions：Environment Protection 與 OIDC Cloud Auth

Tue, 23 Jun 2026 00:00:00 +0000

問題情境

CI pipeline 的可靠性驗證在測試階段結束後，還需要兩道控制面才算完整。第一道是 deploy approval gate — 決定誰可以核准 production deploy、在什麼條件下放行。第二道是 credential 安全 — deploy 需要 cloud credential，但 long-lived secret 存在 CI 環境中會擴大洩漏面。

GitHub Actions 用 environment protection rules 處理第一道，用 OIDC federation 處理第二道。兩者搭配讓 deploy 流程同時滿足 6.8 release gate 的放行控制與 07 資安的 credential 最小暴露原則。

Environment Protection Rules

Environment 是 GitHub Actions 的 deploy 分層單位。每個 environment（staging / canary / production）可以獨立設定 protection rules，讓不同風險等級的 deploy 走不同的放行流程。

Protection rule 類型

規則	責任	典型設定
Required reviewers	指定人員核准後才能 deploy	production 需 2 人核准
Wait timer	deploy 前強制等待，讓最後一刻能攔住	production 等 15 分鐘
Deployment branch policy	只允許特定 branch deploy 到該 environment	production 只接受 main / release/*

Required reviewers 是 deploy 層的 release gate。當 workflow job 標記 environment: production，GitHub 會暫停 job 直到指定 reviewer 核准。reviewer 的選擇應對齊服務 ownership — 由該服務的 on-call lead 或 tech lead 核准，避免核准權過於集中或分散。

Wait timer 提供一個緩衝窗口。deploy 前等待 N 分鐘讓團隊有時間檢查 staging 結果、確認沒有進行中的事故、或在發現問題時取消 deploy。timer 長度跟服務風險等級對齊 — 低風險服務可以 0 分鐘，交易路徑可以 15-30 分鐘。

Deployment branch policy 限制哪些 branch 可以觸發特定 environment 的 deploy。這防止 feature branch 意外 deploy 到 production。production 通常只接受 main 或 release branch。

分層建議

staging 用自動 deploy — push 到 staging branch 直接觸發 workflow，無需 approval，回饋速度最大化。production 用 required reviewer + wait timer — 確保每次 production deploy 都經過人工確認與緩衝。canary 介於兩者之間 — 可以自動 deploy 但加 wait timer，讓觀測指標有時間反映。

OIDC Cloud Auth

Long-lived credential 的風險

CI deploy 需要 cloud credential（AWS access key / GCP service account key / Azure service principal）。傳統做法是把這些 credential 存在 GitHub repository secret 或 environment secret 中。long-lived credential 的風險在於：洩漏後攻擊者可以長期使用、rotation 需要手動更新 CI 設定、credential scope 常設得比實際需求更大。

OIDC federation 的運作方式

GitHub Actions 支援作為 OIDC identity provider。workflow 在執行時可以向 GitHub 請求一個 short-lived OIDC token，cloud provider 信任這個 token 後發出 short-lived cloud credential。整個流程不需要在 CI 環境中存放任何 long-lived secret。

流程：workflow 啟動 → 向 GitHub OIDC provider 請求 token → token 帶有 repo / branch / environment 等 claim → cloud provider 的 trust policy 驗證 claim → 發出 short-lived credential（通常 1 小時有效期）。

Cloud provider 配置

AWS：在 IAM 設定 OIDC identity provider（issuer: token.actions.githubusercontent.com）、建立 IAM role 並設定 trust policy 限制 repo + branch + environment。workflow 中用 aws-actions/configure-aws-credentials action 取得 session credential。

GCP：設定 Workload Identity Federation pool + provider、建立 service account 並綁定 pool。workflow 中用 google-github-actions/auth action 取得 short-lived token。

Azure：在 Azure AD 設定 federated credential 給 app registration、限制 repo + branch + environment。workflow 中用 azure/login action。

Trust policy 的安全邊界

OIDC trust policy 必須限制到特定 repo、branch 與 environment。trust policy 寫成 wildcard（信任整個 GitHub org 的所有 repo）等於讓 org 內任何 repo 的 workflow 都能取得 cloud credential。最小權限原則：production environment 的 trust policy 只信任 repo:org/service:environment:production，不信任其他 environment 或 branch。

實作範例

 1# .github/workflows/deploy.yml
 2name: Deploy
 3on:
 4  push:
 5    branches: [main]
 6
 7permissions:
 8  id-token: write
 9  contents: read
10
11jobs:
12  deploy-staging:
13    runs-on: ubuntu-latest
14    environment: staging
15    steps:
16      - uses: actions/checkout@v4
17      - uses: aws-actions/configure-aws-credentials@v4
18        with:
19          role-to-assume: arn:aws:iam::123456789012:role/staging-deploy
20          aws-region: ap-northeast-1
21      - run: ./scripts/deploy.sh staging
22
23  deploy-production:
24    needs: deploy-staging
25    runs-on: ubuntu-latest
26    environment: production
27    steps:
28      - uses: actions/checkout@v4
29      - uses: aws-actions/configure-aws-credentials@v4
30        with:
31          role-to-assume: arn:aws:iam::123456789012:role/production-deploy
32          aws-region: ap-northeast-1
33      - run: ./scripts/deploy.sh production

staging job 自動觸發。production job 等 staging 完成後暫停，等待 environment protection rules 中設定的 reviewer 核准。兩個 job 各自用不同的 IAM role，scope 分離。

Environment secret 與 repository secret 的差異：environment secret 只在該 environment 的 job 中可用。把 production-only 的設定（如 database connection string）存在 production environment secret 而非 repository secret，避免 staging workflow 意外存取 production 資源。

邊界與陷阱

Environment protection rules 在 private repo 上需要 GitHub Team 或 Enterprise 方案。Free 方案的 private repo 無法使用 required reviewers 與 wait timer，只有 public repo 或付費方案可用。

OIDC trust policy 的常見錯誤是 subject claim 設定太寬。sub claim 的格式是 repo:{owner}/{repo}:environment:{name}（使用 environment 時）或 repo:{owner}/{repo}:ref:refs/heads/{branch}（不使用 environment 時）。用 wildcard match 或省略 environment 限制會讓非預期的 workflow 取得 credential。

Wait timer 設定要跟服務風險等級對齊。所有服務統一用 30 分鐘 wait timer 會拖慢低風險服務的 deploy velocity。對齊方式：低風險服務 0 分鐘、中風險 5-10 分鐘、高風險（交易路徑）15-30 分鐘。

Required reviewer 數量跟團隊大小對齊。只有 1 個 reviewer 等於沒有四眼原則；需要 5 個 reviewer 會造成 approval 排隊。2-3 個 reviewer 是多數團隊的平衡點。

整合路由

上游：6.1 CI pipeline（CI gate 通過後才進入 deploy 階段）
下游：6.8 release gate（environment protection 是 deploy 層的 release gate）
下游：6.23 verification evidence handoff（deploy 結果作為 release evidence）
平行：CircleCI contexts + approval jobs（同類功能的不同實作）
案例回寫：Microsoft 變更分層（變更風險分層對應 environment 分層）、Google Error Budget（error budget 消耗時提高 gate 門檻 → 可動態調整 required reviewer 數量）

Auth0

Mon, 18 May 2026 00:00:00 +0000

Auth0 是 Customer Identity Cloud 的代表選項。它承擔三段責任：B2C / B2B app 的使用者登入流程託管、社交與企業 connection 的 token broker、user profile 與 metadata 的 store。當產品把登入交給 Auth0、信任邊界從「我的 app 自管密碼表」變成「tenant 配置 + Action hook 程式碼 + signing key 託管」三件事是否健康。認證在 0.22 能力級買 vs 建裡是 commodity 買的典型、Auth0 正是它的 feature SaaS（dev-tool 端）例子；要不要買、外包到多深、見外包深度卡。

服務定位

Auth0 是 customer identity 的控制面、不是員工 SSO（員工走 Okta Workforce 或 AWS IAM Identity Center）。雖然 Auth0 於 2021 被 Okta 收購、目前屬「Customer Identity Cloud」產品線、跟 Workforce Okta 是 同公司不同 control plane：tenant 叢集、事件分布、signing key 託管路徑都分開、Okta Workforce 的事故（2022 Sitel、2023 support system HAR）並未直接打到 Auth0 customer。

跟自管 Keycloak 比、Auth0 把 Universal Login UI、social connection 預建、Rules / Action runtime、attack protection 都託管出去 — 代價是 SaaS 計費、token issuance / login attempt 都計量、流量大的 B2C 場景遇到 credential stuffing 不擋會吃成本。跟 AWS Cognito / Firebase Auth 比、Auth0 的核心優勢是 developer-first tenant 體驗 + 預建 social connection（Google / Facebook / Apple / Microsoft 等數十種）+ Action hook 寫 JS 客製。

本章目標

讀完本頁、讀者能判斷：

Auth0 該承擔哪一段 customer identity 控制（login flow / token broker / profile store / B2B Organizations）、哪一段該回到自己的 app
Auth0 tenant 的信任邊界與最低稽核需求（admin role、management API token、Action 程式碼、connection 設定）
Auth0 流量出事或母公司事件時的降級路徑（fallback connection、token rotation、anomaly throttle）
何時用 Auth0、何時走 Cognito / Firebase Auth / Keycloak 的取捨

最短判讀路徑

判斷 Auth0 tenant 是否健康、最少看四件事：

誰能做什麼：Dashboard admin、Management API token 的 owner 與 scope、Action 是否走 code review、tenant 之間（dev / staging / prod）是否分離且授權獨立
憑證在哪裡：Management API token / M2M client 的 scope 與 TTL、社交 connection 的 client secret 存放位置、signing key（per-tenant）的 rotation 節奏、是否啟用 Custom Domain（避免 token issuer 暴露 *.auth0.com 域名）
入口如何暴露：登入走 Universal Login（託管 UI）還是 Embedded Login（嵌自家 app）、Cross-Origin Authentication 是否打開、Attack Protection（bot detection / brute-force / breached password / suspicious IP throttling）配置強度
證據是否可回查：Tenant Log 是否同步到 SIEM（Log Stream 推 HTTP / Datadog / Splunk）、登入失敗 / Action 例外 / Management API 變更是否 alert、保留期是否符合合規要求

四件事任一缺失、就是 Audit Log 與 Authentication 邊界的待補項目。

日常操作與決策形狀

Tenant 與環境分離：Auth0 的 tenant 是邏輯隔離的多租戶 SaaS、不是物理叢集。每個環境（dev / staging / prod）開獨立 tenant、避免 dev 的 Action bug 打到 prod 流量、避免共用 client secret 跨環境洩漏。tenant 間用 auth0-deploy-cli 同步配置、Action 程式碼進版控。

Connection 設計：Database Connection（Auth0 託管帳密 store）跟 Social / Enterprise Connection（OIDC / SAML federation 到 Google / Microsoft / Okta）是兩種來源。決策點是 user 是否要進 Auth0 profile store — 純 federation 不存密碼、純 Database Connection 是 Auth0 替 app 管帳密表。混用要清楚 primary identity 與 linked account 的合併規則。

Action / Rule hook 的風險：Action（新框架）跟 Rule（舊框架）讓 tenant admin 在 login pipeline 注入 JS 程式碼（pre / post login、M2M、send email 等）。這是 Auth0 強大但也是 最大的供應鏈攻擊面 — Action 可以 require() npm package、惡意 dependency 會在每個 login flow 執行。應該 pin dependency 版本、code review、用最小權限的 Management API scope、定期掃 dependency CVE（思維對齊紅隊 supply chain 案例）。

Universal Login vs Embedded Login：Universal Login 把登入 UI 託管在 Auth0 domain（或 Custom Domain）、user 跳轉到該頁完成登入後 redirect 回 app — 防 phishing / CSRF 的成本由 Auth0 吃。Embedded Login 把登入表單嵌進自己 app 並用 /co/authenticate 端點 — 看似 UX 順、但要自己防 XSS、CSRF、CORS、credential leak、且要打開 Cross-Origin Authentication（暴露額外攻擊面）。預設選 Universal Login、Embedded 只在 UX 強需求且能承擔安全成本時開。

Management API token / M2M client：Management API 控制整個 tenant（建 user、改 client secret、改 Action 程式碼）。token 不該長期存在程式碼或 CI；改用 M2M Application（client credentials grant）拿短期 token、scope 收到最小（read:users ≠ update:users ≠ update:actions）、走 Secret Management 取用。

Attack Protection 配置：B2C 流量大、登入嘗試本身計費也是攻擊面。Brute-force Protection（單 IP 多失敗鎖 user）、Suspicious IP Throttling（單 IP 多失敗鎖 IP）、Breached Password Detection（已洩漏密碼禁用）、Bot Detection（CAPTCHA / risk score）四個機制都該打開、否則 credential stuffing 既吃成本也提高帳號被接管的機率。

Break-glass 與 fallback：B2C 場景沒有「員工備用 admin」概念、break-glass 是 確保使用者在 Auth0 暫不可用時仍能登入。常見作法：app 端容忍 Auth0 暫時失敗、提供 magic link / email OTP 的替代登入路徑（透過獨立 ESP）、或預先發放長 TTL 的 refresh token 撐過短時故障。tenant 管理面則維持至少 2 個獨立 admin、credential 離線存。

Audit / handoff：Tenant Log 透過 Log Stream 推 SIEM、alert 三類事件 — Management API 對 Action / Connection / Client 的變更（供應鏈）、登入異常突增（credential stuffing）、support impersonation / Auth0 員工 access tenant 的紀錄（control plane）。

核心取捨表

取捨維度	Auth0	AWS Cognito	Firebase Auth	自管 Keycloak
控制面責任	Auth0 託管 issuer / signing / Action runtime	AWS 託管、限 AWS 帳號信任邊界	Google 託管、綁 Firebase / GCP	自己跑 issuer、key、HA、support
Social connection	預建數十種、UI / token broker 完整	主要 OIDC / SAML、social 要自己接	Google / Apple / Facebook 預建、其他要自接	OIDC / SAML 通用、specific provider 要自配
客製化能力	Action JS hook 強、Universal Login 高度客製	Lambda Trigger、UI 客製有限	Cloud Function Trigger、UI 客製中等	任何 — 自己掌握程式碼
計費模型	月活躍 user（MAU）+ B2B Organizations + 進階功能加價	MAU 階梯、AWS 內部其他資源費用	MAU + 簡訊 / phone auth 另計	自管基礎設施成本
成本陡升點	大量 MAU、credential stuffing、Adaptive MFA 加價	Cognito Identity Pool federation 複雜場景	通常便宜、但 phone auth 成本明顯	規模化後運維成本（HA、DR、cert、upgrade）
適合場景	B2C / B2B SaaS、要 social login、developer-first	AWS-heavy 後端、不要求 social 廣度	mobile-first、Firebase 生態內	主權 / 自管要求、不接受 SaaS IdP
退場成本	中高 — user / password hash 可匯出、Action 要重寫	中 — Cognito user pool 可匯出、policy 重寫	中 — Firebase user 可匯出	低 — 自己掌握

選 Auth0 的核心訴求：customer identity + 大量 social / enterprise connection + 要 developer 客製 login flow、且接受 SaaS 計費與第三方控制面風險、能投入 SIEM / Action 程式碼治理 / attack protection 配置。

Microsoft 生態（Entra External ID / 前 Azure AD B2C）是另一個 B2C / B2B 選項、本表沒列入主要競品 — 它在 M365 / Azure 重度組織內是合理選擇、但 social connection 預建廣度跟 developer-centric tenant 體驗仍不及 Auth0。M365 重度 + B2C 需求的組織可同時評估 Entra ID 的 External ID 產品線。

進階主題

Action / Rule 的供應鏈治理：Action 程式碼進版控、走 PR review、auth0-deploy-cli 部署。Action 引用的 npm dependency pin 版本、避免 ^ / ~、CI 跑 SCA 掃 CVE。新增 Action 時 default scope 給 read-only、需要寫操作另外升級。Action secret（OAuth credential、API key）走 Action Secret 管理、不寫死在程式碼。

B2B Organizations：Auth0 Organizations 把同 tenant 內的多客戶（B2B 場景）邏輯隔離 — 每個 organization 有自己的 connection、branding、member。設計點是 user 是 organization member 還是 tenant-wide user、跨 organization 操作的 admin 是否有 organization scope。Organization 之間的隔離是 tenant 內邏輯層、共享底層 control plane、不能等同實體 tenant 隔離。

Adaptive MFA / Step-up Authentication：Auth0 Adaptive MFA 用 device / location / behavioral signal 動態升級 MFA 要求（impossible travel、新裝置、低信任 IP）。屬付費 add-on、本質是把 risk-based 認證內建。對 B2C 場景比強制全 user MFA 友善、但要把 risk threshold 跟 false positive 容忍度 設清楚、避免合法 user 被連續挑戰流失。

Custom Domain：預設登入網域是 .auth0.com、揭露使用 Auth0 與 tenant 名稱、且 issuer 是 Auth0 子網域。Custom Domain 把 issuer 改成自己網域（如 login.example.com）、user 看到的 URL 一致、降低 phishing 對照成本。屬付費功能、production app 預設應該開。

Cross-Origin Authentication 的攻擊面：Embedded Login 必須開 Cross-Origin Authentication、讓 app 域名直接呼叫 Auth0 的 /co/authenticate。風險是 XSS 拿到 token、CSRF 偽造登入、third-party cookie 政策變動讓 silent auth 壞掉。Universal Login 不需要這個、所以同樣風險不存在 — 這是 Universal Login 推薦的核心理由。

排錯與失敗快速判讀

Management API token 散落 / 過權：CI / 後端服務各自存 token、scope 都給 update:users / update:actions — 改 M2M Application + 最小 scope、定期 rotate、用 Secret Management 集中取用
Action 直接 require 未 pin 的 npm package：login flow 每次都拉最新版、惡意 dependency 直接執行 — pin 版本、code review、定期掃 CVE
登入嘗試暴增 / 計費突增：Attack Protection 沒開或門檻太鬆、credential stuffing 吃額度 — 打開 Bot Detection、Brute-force、Suspicious IP Throttling、配合 Anomaly Detection
使用 Embedded Login 又沒控 XSS：自家 app 一旦 XSS、token 直接被偷 — 改 Universal Login、或補上嚴格 CSP / DOM 防護、定期 pen test
Tenant Log 沒進 SIEM：事件只在 Dashboard、無法跨系統 correlation — 配 Log Stream 打到 SIEM、特定事件接 alert runbook
沒 Custom Domain：phishing 對照成本低、issuer 暴露 vendor — 配 Custom Domain、TLS cert 自管或走 Auth0 託管
B2B Organizations 缺 scope 限制：admin 工具沒按 organization scope、單一 admin compromise 跨 organization 擴散 — 思維對齊 Okta Cross-Tenant 2023 的 lesson

何時改走其他服務

需求形狀	改走
員工 SSO / Workforce identity	Okta vendor / AWS IAM Identity Center
自管 / 不接受 SaaS IdP	Keycloak vendor
AWS-only 應用	AWS Cognito
Firebase / mobile-first 生態	Firebase Authentication
Cloud resource 權限（非人類身份）	AWS IAM / Google IAM / Azure RBAC
事件偵測（跨系統）	7.13 偵測覆蓋率與訊號治理
Secret / API key 治理	7.6 秘密管理與機器憑證治理

不在本頁內的主題

Auth0 完整 OIDC / OAuth2 規格細節
Action / Rule 完整 API 與 trigger 清單
B2B Organizations 完整 schema 與 SDK 整合教學
Auth0 定價層級的詳細功能對照
各 social connection provider 的 OAuth app 註冊步驟

案例回寫

Auth0 在 07 沒有直接案例（母公司 Okta 的事件並未直接打到 Auth0 customer），以下案例採對照引用、抽取對 Auth0 customer 的 lesson。要注意的是 缺直接案例不等於 vendor 沒有風險 — Auth0 自 2021 被 Okta 收購以來未公開重大 vendor 級事件、但同類 SaaS IdP 的歷史事件（Okta 集團、signing key 託管、credential stuffing）都是 Auth0 customer 的可預期風險面、不該等到第一次出事才補控制：

案例	跟 Auth0 的關係（對照）
Okta Support System Incident 2023	母公司 Workforce 事件、Auth0 customer 未直接受害；lesson：signing key 受託管時 break-glass 與替代登入路徑必要
Failure: Credential Rotation Without Scope	Management API token / connection client secret 的 rotation 要分域 — 多 tenant / 多 connection 不能用同一把
Cloudflare 2023 Okta Token Follow-Through	上游 IdP 事件後客戶側的 token rotation 節奏；Auth0 customer 應主動 rotate Management API token、不等供應商公告
Uber 2022 MFA Fatigue	Auth0 Adaptive MFA / step-up 的設計目標 — 高風險動作要求 phishing-resistant factor、避免單純 push fatigue
紅隊 supply chain 案例	Action / Rule 引用 npm dependency 的供應鏈攻擊面、思維同 build pipeline 但發生在 login flow

下一步路由

上游：7.2 身分與授權邊界、7.13 偵測覆蓋率與訊號治理
平行：Okta vendor、Keycloak vendor、AWS IAM Identity Center
下游：AWS IAM / Google Cloud IAM（Auth0 認證後的 cloud resource 權限層）
跨模組：8 事故處理 vendor 清單（Auth0 異常如何 routing 進 IR 流程）
官方：Auth0 Documentation

AWS Secrets Manager

Mon, 18 May 2026 00:00:00 +0000

AWS Secrets Manager 是 AWS 原生的 static secret 集中保管 service、核心能力是把 secret 用 KMS 加密儲存、加上 built-in rotation Lambda（針對 RDS / Redshift / DocumentDB）跟 Resource Policy + IAM Policy 雙層 grant、把 secret lifecycle 鎖在 AWS account / IAM 邊界內。設計取捨跟 Vault 不同 — Secrets Manager 不做 dynamic credential、不做 transit encryption、不做內部 PKI、只把 static secret + AWS native DB rotation 這條路徑做到極致。

服務定位

Secrets Manager 的定位是 AWS-only workload 的 static secret 控制面、跟 SSM Parameter Store SecureString 在 存 secret 這層功能重疊、但設計目的不同。Parameter Store 是 parameter 管理（free tier、advanced parameter 每 10000 個約 $0.05、KMS 加密但無 staging label 與 rotation Lambda）；Secrets Manager 是 secret 管理（每個 secret per month $0.40 + API call、有 staging label / rotation Lambda / Resource Policy / Cross-Region Replica）。價差 8 倍以上、選擇基準在 是否需要 rotation 跟 cross-account sharing。

跟 Vault 比、Secrets Manager 是 單一雲、簡單、低運維、Vault 是 跨雲、dynamic credential、高表達力。AWS-only 組織用 Vault 等於多扛一個 HA cluster 運維成本只為了拿 KV engine 跟 RDS rotation、ROI 不划算；反向跨雲組織用 Secrets Manager 等於每個雲都自己一套 secret store、治理鏈會斷。跟 Google Secret Manager / Azure Key Vault 比、設計理念類似（雲廠 managed、KMS 加密、IAM 授權）但 rotation 機制各家不同 — Secrets Manager 用 built-in Lambda 四階段 flow、GSM 用 Pub/Sub event 觸發自寫 Cloud Function、Azure 用 Key Vault rotation policy + Event Grid。

本章目標

讀完本頁、讀者能判斷：

哪些 secret 用 Secrets Manager、哪些可以下放到 Parameter Store、哪些該走 Vault 的 dynamic credential
Secrets Manager 的 雙層 grant 模型（Resource Policy + IAM Policy）跟 KMS encryption key custody 怎麼配
Built-in rotation 跟 Custom Rotation Lambda 的設計邊界、staging label 在 zero-downtime rotation 內的角色
何時 Secrets Manager 已經不夠用、要往 Vault / 跨雲 broker 走

最短判讀路徑

判斷一個 Secrets Manager 部署是否健康、最少看四件事：

誰能 GetSecretValue：IAM Policy 那邊是不是用 secretsmanager:GetSecretValue 限定到 特定 secret ARN（不是 *）、Resource Policy 是不是只允許特定 principal（不是 Principal: *）、跨帳號 share 有沒有用 ABAC tag 限縮
KMS key custody：secret 用 AWS-managed key（aws/secretsmanager）還是 customer-managed key（CMK）— production 應該全部 CMK、key policy 限定 only Secrets Manager service principal 可用、KMS key 持有者跟 secret 持有者要分離
Rotation 設定：rotation 開了沒、rotation interval 多久、Lambda 過去執行 success rate、staging label 在 rotation 過程中是否依序 promote（AWSPENDING → AWSCURRENT → AWSPREVIOUS）
CloudTrail data event：GetSecretValue 是 Data event、預設不記、要手動開 data event logging — 沒開等於事故時看不到 誰拿了 secret、只看得到 management API（CreateSecret / UpdateSecret）

四件事任一缺失、就是 Secret Management 跟 Audit Log 邊界的待補項目。

日常操作與決策形狀

Resource Policy + IAM Policy 雙層 grant：Secrets Manager 跟 S3 bucket policy 同模型 — IAM Policy 控制 principal 端能做什麼、Resource Policy 控制 secret 端允許誰來、兩者要 都同意 才放行。常見錯配：Resource Policy 寫 Principal: "*" 加 aws:SourceAccount condition 想做跨帳號 share、但 condition 漏寫或寫錯就變成公開可讀。跨帳號 share 一定要明確列 Principal: arn:aws:iam::123456789012:role/AppRole、不要靠 wildcard + condition 拼隔離。

IAM Policy 細粒度授權：secretsmanager:GetSecretValue 該限定到 specific secret ARN（不是 *）、配合 ABAC tag condition（secretsmanager:ResourceTag/team = payments）限縮 blast radius。對應 CircleCI 2023 Secrets Rotation — CI 出事時要能依 tag 快速列出 CI runner 可拿的所有 secret、沒這套 tag 就只能盲目 rotate 全部。

KMS encryption key 選 CMK 不是 default：每個 secret 用一把 KMS key 加密、預設用 AWS-managed key aws/secretsmanager、production 應該換 customer-managed key（CMK）。差別在 key policy 是不是自己控 — AWS-managed key 的 policy 同 account 任何 service 可呼叫、CMK 的 key policy 可以鎖到 only Secrets Manager service principal 加 only specific role 可 Decrypt。對應 Storm-0558 的對照啟示：key 的 blast radius 來自 key policy、用 CMK 把 policy 寫窄是減 blast radius 的關鍵動作。

Built-in Rotation Lambda 只限 AWS native DB：Secrets Manager 內建 rotation template 涵蓋 RDS（PostgreSQL / MySQL / MariaDB / Oracle / SQL Server）/ Aurora / Redshift / DocumentDB — 拿 AWS 提供的 Lambda template、設定 rotation interval（最短 1 天、最長 365 天）、Secrets Manager 自動排程觸發。其他 DB（self-hosted PostgreSQL、MongoDB Atlas、Snowflake）或 API key 要寫 Custom Rotation Lambda、走 4-step state machine：createSecret（產新 credential 存為 AWSPENDING）、setSecret（把新 credential 寫到 target system）、testSecret（用新 credential 驗證可連）、finishSecret（promote AWSPENDING → AWSCURRENT）。Lambda 任一步失敗 Secrets Manager 會 rollback、舊 credential 不受影響。

Staging Label（AWSCURRENT / AWSPENDING / AWSPREVIOUS）：staging label 是 指向 version 的 pointer、app 一律用 GetSecretValue 不帶 VersionStage 拿 AWSCURRENT、rotation 過程中 Secrets Manager 先把新 credential 標 AWSPENDING、testSecret 過後 promote 到 AWSCURRENT、舊的降到 AWSPREVIOUS。設計初衷是 zero-downtime rotation — 但 只有 app 端支援 AWSPREVIOUS fallback 期間才有意義：rotation 完成瞬間有些 app instance 還拿著舊 credential，target system 應該同時接受 AWSCURRENT 跟 AWSPREVIOUS（DB rotation template 會在 setSecret 階段保留舊 user 一段時間）。對應 Failure: Credential Rotation Without Scope：scope map 沒做、AWSPREVIOUS 窗口期太短、長尾 batch job 拿到舊 credential 就掛。

Cross-Region Replica：multi-region app 把 secret replicate 到其他 region、replica 在 replica region 有獨立 ARN、KMS key 跟 rotation 都要在 replica region 各自配（不能跨 region 共用 KMS key）。replica 是 讀副本、寫只能在 primary region、rotation 觸發後新 version 自動 sync 到 replica（有秒級延遲）。failover 時 app 直接讀 replica region ARN、不需要 cross-region call。

Cross-Account Sharing：跨帳號 share secret 走 Resource Policy + 對方帳號 IAM Policy 雙向授權 — Resource Policy 列對方 account 的具體 role ARN、對方 role 的 IAM Policy 加 GetSecretValue 對應 ARN。KMS key 也要跨帳號授權（KMS key policy 加對方 role 的 Decrypt 權限）— 漏了 KMS 授權會出現 GetSecretValue 成功但 Decrypt 失敗 的詭異錯誤。

核心取捨表

取捨維度	AWS Secrets Manager	SSM Parameter Store SecureString	Vault	Google Secret Manager	Azure Key Vault
部署模型	AWS managed	AWS managed	自管 cluster	GCP managed	Azure managed
跨雲	弱 — 綁 AWS	弱 — 綁 AWS	強	弱 — 綁 GCP	弱 — 綁 Azure
每月每 secret 成本	~$0.40 + API call	free / advanced ~$0.05/10k	self-hosted 成本	~$0.06 + API call	~$0.03 + operation
Built-in rotation	RDS / Redshift / DocumentDB 內建 Lambda	無	dynamic engine 自動發短期 credential	無 built-in	Key Vault rotation policy（key 為主）
Staging label	AWSCURRENT / AWSPENDING / AWSPREVIOUS	無、用 version number	KV v2 用 version	version 機制	version 機制
Cross-account share	Resource Policy + IAM	不支援（同 account only）	Vault namespace + policy	IAM cross-project	RBAC cross-tenant
Dynamic credential	無（rotation Lambda 是 static 換 static）	無	有（DB / cloud / SSH engine）	弱（IAM impersonation）	弱（Managed Identity）
適合場景	AWS-only + static secret + RDS rotation 為主	AWS-only + 大量低敏 config + 不需 rotation	跨雲 + dynamic credential + 內部 PKI	GCP-only + Workload Identity 已主導	Azure-only + Managed Identity 已主導
退場成本	低	低	中	低	低

選 Secrets Manager 的核心訴求：AWS-only + 大部分 secret 是 static 或 AWS native DB credential + 需要 cross-account share 或 rotation Lambda + 不想 / 沒量能自管 Vault。如果只是要存 config（feature flag、non-sensitive endpoint）、Parameter Store 8 倍便宜；如果跨雲 + 需要 dynamic credential / transit / PKI、Vault 才能滿足。

進階主題

Custom Rotation Lambda 設計：4-step state machine 是 idempotent contract — Lambda 必須能被 Secrets Manager 重試任意步驟而不破壞狀態。常見實作陷阱：createSecret 不檢查 AWSPENDING 是否已存在、重試時又產生一把新的、AWSPENDING 對不上 setSecret 寫進去的；setSecret 沒處理「target system 已經有同名 user」的情況、第二次跑會卡住。Template 提供的 PostgreSQL rotation Lambda 用 cloning approach — 在 DB 內 clone 一份 user、改密碼、保留舊 user 跨 rotation 一個週期、下次 rotation 才 drop。

Resource Policy + ABAC tag 跨帳號：跨帳號 share 時用 ABAC tag 條件比硬列 role ARN 有彈性 — Resource Policy 寫 Condition: aws:PrincipalTag/team = payments、對方 account 任何帶該 tag 的 role 都可讀。代價是 tag 治理 變成 critical control：對方 account 內誰能 attach tag = 誰能拿 secret、IAM Policy 要鎖 iam:TagRole 跟 iam:UntagRole 權限。

Rotation 失敗的監控訊號：Lambda 執行失敗會在 CloudWatch 留 invocation error、Secrets Manager 把 rotation 標記為 failed、但 secret 仍可用（AWSCURRENT 保留舊 version）— 容易出現 半年沒 rotate 成功但 app 看起來正常 的盲區。要監控 SecretsManager.RotationFailed event（EventBridge rule）+ LastRotatedDate metric 超過 rotation interval 1.5 倍就 alert。

跟 AWS IAM 整合：誰可以 GetSecretValue 完全由 IAM 控制、最佳實踐是 workload role 拿 secret（EC2 instance role / ECS task role / Lambda execution role / EKS IRSA）、不要硬把 AWS credential 塞進 secret 再給 application read。Secret 內容應該是 DB password / API token / third-party credential、不應該是 AWS credential（AWS credential 用 IAM role 短期 STS 拿就好）。

CloudTrail data event 的成本權衡：開 GetSecretValue data event 等於每次 secret 取用都進 CloudTrail、高 QPS application 一天可能跑數百萬筆、CloudTrail 成本（每 100k events 約 $0.10）跟 S3 儲存成本會明顯上升。降本作法：在 EventBridge 用 filtering（只送特定 sensitive secret 的 data event 到 SIEM）、CloudWatch Logs 端設 retention 短一點（7-30 天熱資料、長尾走 S3 + Athena）。

排錯與失敗快速判讀

GetSecretValue AccessDenied 但 IAM Policy 看起來對：檢查 Resource Policy 是否限定 source account / VPC、檢查 KMS key policy 是否允許該 role Decrypt — 兩層 grant + KMS 三點任一缺都會 AccessDenied
跨帳號 secret 拿不到：Resource Policy 沒列對方 role、或 KMS key policy 沒給對方 Decrypt 權限 — 跨帳號要同步配三處（Resource Policy + 對方 IAM + KMS key policy）
Rotation 一直失敗但沒人發現：沒設 EventBridge alert on RotationFailed、AWSCURRENT 保持舊 version、app 正常但 secret 過期 — 必設 LastRotatedDate metric alert
App 拿到 stale secret rotation 後爆掉：app 端用了 SDK cache（如 AWS SDK 的 Secrets Manager Cache）、rotation 完成後 cache 沒 invalidate — cache TTL 要短於 staging label 重疊窗口、或實作 retry-on-auth-fail 觸發 cache refresh
CloudTrail 看不到誰拿 secret：沒開 data event logging — 在 CloudTrail trail 設定加上 AWS::SecretsManager::Secret 為 data resource
跨 region replica rotation 失效：rotation Lambda 只在 primary region 配、replica region 沒對應 Lambda — 每個 region 各自配 Lambda、或乾脆只在 primary rotate 讓 replica 自動 sync
AWSPREVIOUS fallback 沒生效 batch job 掛：rotation Lambda finishSecret 太快 drop 舊 user、batch job 拿到舊 credential 連 DB 失敗 — DB rotation template 預設保留舊 user 一個 rotation 週期、custom Lambda 要自己實作雙軌窗口

何時改走其他服務

需求形狀	改走
大量低敏 config / feature flag	SSM Parameter Store（free tier、無 rotation 需求）
跨雲統一 secret 控制面	HashiCorp Vault
Dynamic DB credential（non-AWS DB）	Vault database engine
Workload 拿 AWS credential	AWS IAM role（EC2 instance role / ECS task role / IRSA）— 不要把 AWS credential 塞 secret
Encryption-as-a-service / envelope encryption	AWS KMS Encrypt / Decrypt API、或 Vault transit engine
內部 PKI / mTLS workload cert	cert-manager + AWS Private CA
Secret rotation 跨服務 scope 治理	7.5 Credential Rotation Scoped Evidence

不在本頁內的主題

Secrets Manager 完整 API reference 跟 SDK 用法
每種 RDS engine 的 rotation Lambda template 內部 SQL 細節
AWS pricing 詳細計算（每 region 略有差異）
Terraform / CDK 跟 Secrets Manager 的 IaC 整合
AWS account organization / SCP 怎麼限制 secret 建立

案例回寫

Secrets Manager 在 07 案例庫沒有直接 vendor-level 事件、以下案例採對照引用：

案例	跟 Secrets Manager 的關係（對照）
Failure: Credential Rotation Without Scope	Secrets Manager rotation 必須有 scope map — 跨服務共用同一把 secret 時、AWSPREVIOUS 窗口期 + 雙軌驗證要對齊長尾 batch job、不能單靠 Lambda 自動 promote
CircleCI 2023 Secrets Rotation (red-team)	CI 出事時 Secrets Manager 內所有 CI runner role 可拿的 secret 都要 rotate — 必須事先以 ABAC tag 標 blast radius、不然只能盲掃整個 account
Microsoft Storm-0558 Signing Key Chain (red-team)	對照啟示 — Secrets Manager 的 KMS encryption key 必須走 CMK 而非 AWS-managed key、key policy 限定 only Secrets Manager service principal 且 only specific role 可 Decrypt、把 blast radius 鎖在 key policy 內

下一步路由

上游：7.6 秘密管理與機器憑證治理、7.13 偵測覆蓋率與訊號治理
平行：HashiCorp Vault、Google Secret Manager、Azure Key Vault
下游：AWS KMS（Secrets Manager 加密 key custodian、CMK 與 key policy 治理）
下游：AWS IAM（誰可以 GetSecretValue、跨帳號 share 的 principal 來源）
跨模組：8 事故處理 vendor 清單（secret 外洩事件如何 routing 進 IR 流程）
官方：AWS Secrets Manager Documentation

AWS WAF

Mon, 18 May 2026 00:00:00 +0000

AWS WAF 是 AWS-internal 的 Web Application Firewall、掛在 ALB、CloudFront、API Gateway、App Runner、AppSync 與 Cognito User Pool 的前面，攔截 HTTP/HTTPS 攻擊。它跟 Cloudflare WAF / Fastly Next-Gen WAF 的核心差異是 部署位置在 AWS 內部：流量先經 AWS 邊界進來、再進 Web ACL 過濾、最後抵達 origin；不是在 Cloudflare anycast edge 提早攔。對 AWS-heavy 客戶、AWS WAF 的價值是 跟 AWS IAM / VPC / AWS Shield 同一個控制面；對 multi-cloud / on-prem origin、AWS WAF 觸不到、要回到 edge WAF。

服務定位

AWS WAF 的核心定位是 跟 AWS 服務深度耦合的 L7 防護層。Web ACL 直接掛 AWS resource、規則用 IAM policy 管理、log 進 Kinesis Firehose / CloudWatch Logs / S3、跟 AWS Shield Standard（內含、L3/L4 DDoS）自動整合。這跟 Cloudflare WAF 在 origin 之前的 edge 攔截不同 — AWS WAF 流量 已經進到 AWS 邊界、不是擋在外部。對 origin 跑在 ALB / CloudFront / API Gateway 後的客戶、AWS WAF 是天然選項；origin 在其他雲或地端、AWS WAF 觸不到。

跟 Fastly Next-Gen WAF 相比、AWS WAF 走 signature + managed rule group 偵測模型、不像 Fastly NG-WAF 走語意 / behavioral；AWS WAF 的 Managed Rule Group 來自 AWS Managed 與 AWS Marketplace 第三方（Fortinet、F5、Imperva 等）、客戶端 看不到 rule logic、debug 時要靠 sampled request 反推。

計費模型也是關鍵差異：AWS WAF 按 per-Web-ACL + per-rule + per-request 計費（單 ACL $5/月、單 rule $1/月、$0.60 per 1M request），Managed Rule Group 算多 rule、開太多套 ruleset 與流量大時帳單會明顯漲。Cloudflare 是 plan-tier 計費（Pro / Business / Enterprise）、不會因為多開 rule 線性漲價。

本章目標

讀完本頁、讀者能判斷：

AWS WAF 在 AWS-internal 防護 stack 中承擔哪一段、哪些要靠 AWS Shield / VPC / CloudFront 補位
Web ACL scope（Regional vs CloudFront）的選擇與跨 region 部署成本
Managed Rule Group / Custom Rule / Rate-based Rule 的取捨、Bot Control add-on 是否值得開
何時用 AWS WAF、何時走 Cloudflare WAF / Fastly NG-WAF 的判準

最短判讀路徑

判斷 AWS WAF 配置是否健康、最少看四件事：

Web ACL scope 對不對：CloudFront distribution 必須掛 CloudFront scope（強制在 us-east-1 建立 ACL）、ALB / API Gateway 必須掛 Regional scope（每個 region 各一份）；scope 配錯掛不上去、跨 region 部署是否用 IaC（Terraform / CloudFormation）同步複製 ACL
Managed Rule Group 與 sensitivity：是否啟用 AWSManagedRulesCommonRuleSet（CRS）、AmazonIpReputationList（已知惡意 IP）、AnonymousIpList（VPN / proxy / Tor）、KnownBadInputsRuleSet（已知 exploit pattern）、Marketplace rule 是否在 Count mode 觀察 1-2 週 FP 再切 Block
Logging 有沒有開：Web ACL log 預設關閉、必須手動配 Kinesis Firehose / CloudWatch Logs / S3 destination；event 是否進 SIEM（見 7.13 偵測覆蓋率與訊號治理）、是否能對 sampled request 反推 rule 行為
IAM 邊界：誰能 update Web ACL（wafv2:UpdateWebACL、wafv2:UpdateRuleGroup）、是否限定 admin role 才能改、CI 是否只有 wafv2:Get* / List* 用來 verify、敏感變更是否走 Change Management / Audit Log

四件事任一缺失、就是 Entry Point Protection 邊界的待補項目。

日常操作與決策形狀

Web ACL 與 scope：Web ACL 是 AWS WAF 的 規則容器、必須 attach 到 AWS resource。Scope 兩種：Regional（給 ALB / API Gateway / App Runner / AppSync / Cognito User Pool、每 region 獨立）與 CloudFront（給 CloudFront distribution、必須在 us-east-1 建立、全球生效）。同一個 ACL 不能跨 scope 共用；跨 region 部署同一套規則必須複製 ACL、用 Terraform / CloudFormation 管理避免 drift。

Rule action 五種：每個 rule 觸發時可以做 Block（直接 403）、Allow（跳過後續 rule、放行）、Count（不擋、只記錄、用於 dry-run 觀察 FP）、CAPTCHA（出題給人類解、bot 過不去）、Challenge（silent JS challenge、無感驗證）。新 rule 上線標準動作是先 Count 1-2 週看 sample、確認 FP 在容忍範圍才切 Block。CAPTCHA / Challenge 是 Bot Control add-on 配套、要額外計費。

Managed Rule Group（managed by AWS / Marketplace）：AWS Managed（免費含在 WAF）涵蓋 Common Rule Set（OWASP top10 對應）、Known Bad Inputs、SQL Database、Linux、Unix、Windows、Anonymous IP List、Amazon IP Reputation List、Account Takeover Prevention (ATP)、Account Creation Fraud Prevention (ACFP)。AWS Marketplace（付費）來自 Fortinet / F5 / Imperva / Cyber Security Cloud 等。Marketplace 規則 不公開 rule logic、攔錯時只能用 sampled request 反推、debug 比 AWS Managed 困難。

Custom Rule（statement + 條件）：Custom Rule 用 statement（match condition + transformation）組合：IP Set match、Geo match、Regex Pattern Set、Size constraint、SQL injection match、XSS match、String match（含 header / body / URI / query 各部位）。複雜條件用 AND / OR / NOT 組合、上限是每 Web ACL 5,000 Web ACL Capacity Units（WCU）— 規則越複雜 WCU 越高、Marketplace 大型 rule group 可能直接吃掉一半 budget。

IP Set / Regex Pattern Set：IP Set 存 IPv4 / IPv6 CIDR 清單、Regex Pattern Set 存正則表達式集合。兩者都是 獨立資源、可在多個 Web ACL 引用、單獨更新（不必動 Web ACL 結構）。實務上 threat intel feed 應該 push 到 IP Set、用 Lambda 自動 sync、不用手動加。

Rate-based Rule：限制 單一 aggregate key 在滾動 5 分鐘窗口內的請求數、超過 threshold 觸發 action。aggregate key 可選 IP、Forwarded-IP（看 X-Forwarded-For）、HTTP method、URI path、Header、Cookie 或組合。關鍵陷阱：CloudFront 後 origin ALB 必須用 Forwarded-IP、否則 Rate-based Rule 看到的全是 CloudFront 邊緣節點 IP、所有真實使用者被合併計算、要嘛全擋要嘛全放。

Logging 必須手動開：Web ACL log 預設關閉、destination 三選一：Kinesis Data Firehose（推到 S3 / Splunk / Datadog）、CloudWatch Logs（簡單但貴）、S3（直寫、需自己處理 partition）。production 通常走 Kinesis Firehose → S3 + Athena query、配合 SIEM 拉 alert。沒開 log 等於 攻擊發生時沒證據、事後無法回查。

跟 AWS Shield 整合：所有 AWS WAF 客戶自動含 Shield Standard（L3/L4 DDoS、免費、SYN flood / UDP reflection 等基礎防護）。Shield Advanced 是付費 add-on（$3,000/month per organization + per-resource fee + data transfer out fee）、提供 24/7 DRT（DDoS Response Team）、cost protection（DDoS 期間 AWS service scaling fee 補貼）、進階分析。一般客戶 Shield Standard 已足夠；金融 / 政府 / 高知名度品牌需要 Shield Advanced 的 DRT 與 cost protection。

Lambda@Edge / CloudFront Functions 補位：當 WAF rule statement 表達不出複雜業務邏輯（geofencing + business hour + user tier 組合、JWT claim 解析後判斷 routing）、用 Lambda@Edge（Node.js / Python、跑在 CloudFront 邊緣節點、4 個 phase：viewer-request / origin-request / origin-response / viewer-response）或 CloudFront Functions（純 JS、輕量、低延遲、只在 viewer-request / viewer-response）補位。Lambda@Edge 適合複雜邏輯、CloudFront Functions 適合 header rewrite / 簡單 routing；兩者都不能取代 WAF managed rule、但補位 WAF 表達力上限。

跟 AWS IAM 整合：誰能改 Web ACL 是 IAM policy 決定（wafv2:CreateWebACL、wafv2:UpdateWebACL、wafv2:AssociateWebACL、wafv2:UpdateRuleGroup 等 action）。production 標準配置：admin role 才能 update、CI / 開發者只有 wafv2:Get* / List* 用來 verify、敏感變更走 Change Management + CloudTrail audit log。

核心取捨表

取捨維度	AWS WAF	Cloudflare WAF	Fastly Next-Gen WAF
部署位置	AWS 內部（ALB / CloudFront / API Gateway 前）	Cloudflare global edge（300+ POP）	Fastly global edge / 各 origin agent
Origin 適配	強耦合 — origin 必須在 AWS	強中立 — 任意雲 / on-prem	強中立 — Fastly CDN / 任何 origin
計費模型	per-ACL + per-rule + per-request	plan tier（Free / Pro / Business / Enterprise）	request-based + plan
Managed Rule	AWS Managed（免費）+ Marketplace（付費、logic 不透明）	Cloudflare Managed + OWASP CRS + Exposed Credentials	Signal-based（語意、低 FP、不靠 regex signature）
Rate Limiting	Rate-based Rule（含在 WAF、5 分鐘 window）	Rate Limiting 獨立 product	inline rate limit + Signal
Bot 對應	AWS WAF Bot Control（add-on、付費）	Bot Management（Pro+ add-on）	NG-WAF behavioral bot detection
DDoS 內建	Shield Standard 自動含（L3/L4）、Advanced 加價	同套餐內建	內建 + Fastly DDoS
控制面整合	跟 IAM / CloudTrail / Shield / VPC 同 plane	Cloudflare 控制面、跟其他 Cloudflare 產品同套	Fastly 控制面、agent 跑在 origin
學習曲線	中陡 — Web ACL + WCU + scope + IAM policy 多軌	中 — UI / Rules language / Terraform 完整	中 — agent 安裝 + Signal 語意設定
適合場景	AWS-heavy、ALB / CloudFront 是主要入口	Multi-cloud / on-prem origin、要整套 edge security	高 FP 容忍度低、業務有 schema、想避 regex signature

選 AWS WAF 的核心訴求：AWS-internal app + origin 跑在 ALB / CloudFront / API Gateway / App Runner 後 + 想跟 IAM / CloudTrail / Shield 同套 control plane 治理。Origin 不在 AWS、或要 把攻擊擋在抵達雲之前、應該走 Cloudflare WAF 或 Fastly NG-WAF。

進階主題

AWS WAF Bot Control（add-on）：付費 add-on、用 AWS 自家 bot fingerprinting 區分 verified bot（搜尋引擎）/ signal: automated browser（headless Chrome 等）/ signal: known bot（已標記 IoT / scraper），給每個請求 bot category label。Custom Rule 在 label 上做條件、決定 Block / Challenge / CAPTCHA。比 user-agent 過濾準很多、但要額外計費（per-request）。Bot Control 有兩個 inspection level：common（便宜、基礎指紋）與 targeted（貴、含 JavaScript challenge、CAPTCHA、token-based）。

Fraud Control（ATP / ACFP）：Account Takeover Prevention（ATP）跟 Account Creation Fraud Prevention（ACFP）是 Managed Rule Group 的特殊類別、需付費啟用。ATP 看登入端點的 credential stuffing、ACFP 看註冊端點的 bot signup。兩者都用 AWS 自家 threat intel（被竊憑證 list、行為模型）打 label、客戶側用 Custom Rule 處理。對有 login / signup 端點的 SaaS / 電商有價值、純內部後台不必開。

CAPTCHA / Challenge：AWS WAF 內建 CAPTCHA puzzle 與 silent JS Challenge、可在 rule action 直接呼叫。Challenge 在客戶端執行 proof-of-work、合法瀏覽器無感、headless 工具卡住；CAPTCHA 是視覺題、人類解、bot 不會。Production 標準做法：Bot Control 給 label → Custom Rule 看 label → likely bot 走 Challenge、known bad 走 Block、人類流量直接 Allow。

ACM Private CA + WAF 對 mTLS：AWS WAF 本身不做 mTLS 驗證、mTLS 是 ALB / API Gateway / CloudFront 自己的功能（搭配 AWS ACM Private CA 簽發 client cert）。WAF 在 mTLS 完成後才看 L7 流量、可以用 HTTP header match（mTLS 後 ALB 注入 client cert 資訊到 header）做進一步 rule。Internal API 用 mTLS + WAF 是常見組合。

Lambda@Edge 補 inline business logic：複雜判斷（user tier × geo × business hour × A/B test）WAF rule statement 表達不出來、用 Lambda@Edge 在 viewer-request phase 解析 JWT、查 internal risk API、回 response header 給 WAF 後續判斷。代價：Lambda@Edge 部署只能在 us-east-1、code 更新傳播到全球 edge 要幾分鐘、debug 是分散式 CloudWatch Logs。

排錯與失敗快速判讀

Web ACL 掛不上 CloudFront：scope 配成 Regional、CloudFront 拒絕 attach — Web ACL 必須在 us-east-1 + CloudFront scope 才能掛 CloudFront；ALB / API Gateway 反過來只能掛 Regional scope
Rate-based Rule 全擋 / 全放：CloudFront 後 origin 看到全部都是 CloudFront IP、aggregate key 沒換 Forwarded-IP — 改用 Forwarded-IP（X-Forwarded-For）作 aggregate key，並設 Fallback behavior
Managed Rule Group 誤殺合法請求：CRS High sensitivity 開後 file upload / rich text editor 端點被 Block — 找 sampled request 看 rule_id、用 Scope-down statement 限定該 rule 在某 path 不執行、或開該 rule 為 Count、不要關整個 group
Marketplace Rule 攔不明流量：Marketplace rule logic 不公開、sampled request 看到 rule label 但不知為何 — 切該 rule 到 Count mode 觀察、若無 attack 跡象換 AWS Managed 同類 rule
WCU 超限：Web ACL 上限 5,000 WCU、加 Marketplace + 多個 AWS Managed 就會爆 — 看 Capacity Used、移除重疊 rule、把 Custom Rule 表達式簡化（少用 transformation chain）
Logging 沒設 / 設錯：事件發生後沒有完整 log 可查、只有 sampled request（保留 3 小時、機率抽樣） — 必開 Logging configuration 到 Kinesis Firehose / S3 / CloudWatch Logs、確認 IAM role 有 firehose:PutRecord 權限
IAM 權限過寬：CI account 拿到 wafv2:* 整 zone 都能改 — 收斂到 wafv2:Get* / List* 唯讀、敏感寫入限 admin role + MFA + Change Management
跨 region 部署 drift：手動在 console 改 us-east-1 ACL、其他 region 沒同步 — 用 Terraform / CloudFormation IaC 管理、PR review、CI plan 檢查 drift
Shield Standard 不夠擋大型 L7 DDoS：Standard 只防 L3/L4、L7 attack 靠 WAF Rate-based Rule + Bot Control — 若反覆遭遇大型 L7 DDoS、評估 Shield Advanced 的 DRT + cost protection 是否值得

何時改走其他服務

需求形狀	改走
Multi-cloud / on-prem origin	Cloudflare WAF
低 FP 容忍 / 業務有 schema	Fastly Next-Gen WAF
L3/L4 DDoS 進階防護	AWS Shield Advanced / Cloudflare Magic Transit
純內部 mTLS / east-west	SPIRE + service mesh
Cert lifecycle	AWS ACM / cert-manager
Secrets / API key	AWS Secrets Manager / Vault
複雜業務邏輯 inline 處理	Lambda@Edge / CloudFront Functions

不在本頁內的主題

AWS WAF Classic（v1）的遷移細節 — 本頁全以 WAFv2 為準
完整 WCU 計算規則與每個 statement 的 WCU cost reference
Marketplace 第三方 rule group 各家功能矩陣
AWS WAF 在 GovCloud / China region 的差異
Bot Control / ATP / ACFP 完整 label schema reference

案例回寫

AWS WAF 在 07 案例庫無直接 vendor-level case、但多個 case 對應 WAF 作為 修補窗口期臨時控制 與 entry point 治理 的角色：

案例	跟 AWS WAF 的關係
Log4Shell CVE-2021-44228	對照啟示 — AWS Managed Rule Group 當時推出 Log4Shell 規則作為 emergency mitigation；但 exploitation 通過 WAF 後在後端執行，不能單靠 WAF 防 supply chain
Citrix Bleed 2023 Session Hijack	對照啟示 — WAF 攔不住 edge appliance zero-day、需要「修補 + session 失效 + 異常清查」三同步
Fortinet SSL-VPN CVE 2023-27997	對照啟示 — vendor patch 前的臨時 AWS WAF Custom Rule + Shield Advanced + Origin lockdown 是修補窗口期動作
7.3 入口治理與伺服器防護	AWS WAF 是 entry point protection 的工具、章節原則對應 WAF rule lifecycle 治理（Count → Block、IaC、IAM 收斂）

下一步路由

上游：7.3 入口治理與伺服器防護
平行：Cloudflare WAF、Fastly Next-Gen WAF
下游：7.4 資料保護與遮罩治理（WAF block 不夠時、資料層也要遮罩）
跨類：AWS IAM（誰能改 Web ACL）、AWS ACM（mTLS client cert）、AWS Secrets Manager（rule update 用的 API key）
跨模組：8 事故處理 vendor 清單（WAF block 事件如何 routing 進 IR）
官方：AWS WAF Documentation

Elastic Security

Mon, 18 May 2026 00:00:00 +0000

Elastic Security 是 Elastic Stack（Elasticsearch + Kibana + Beats / Agent）上的 SIEM + EDR + Cloud Security 套件、OSS 起源、現屬 Elastic 商業版的 Solution。它跟 Splunk / Datadog Security / Google Security Operations 的差異在 計費模型 + 查詢語言模型 + ecosystem 開放度、偵測能力本身相近 — Elastic 走 resource-based pricing（按 cluster size 而非 ingestion volume）、且提供 KQL / EQL / Lucene / ES|QL 四種互補的查詢語言。

服務定位

Elastic Security 的核心定位是 Elastic Stack 上的 security solution、底層是 Elasticsearch（資料層）+ Kibana（查詢與 UI 層）+ Fleet / Elastic Agent（採集層）、頂層產品分三條：Elastic SIEM（log aggregation + detection rule + Case + Timeline）、Elastic Defend（前 Endgame 收購而來、EDR + endpoint protection、跟 CrowdStrike / SentinelOne 同層）、Elastic Cloud Security（CSPM + CWP、雲端資源 misconfig 與 workload 防護）。

跟 Splunk 比、Elastic 走 OSS-friendly + resource-based pricing — TB-scale ingestion 不直接漲費用（要 scale node 但邊際成本遠低於 Splunk per-GB 累進）、Sigma rule 社群可直接 import 5000+ 規則；但 Splunk Security Content 跟 SOAR / RBA 等 detection content + SOC tooling 成熟度仍高一個量級。跟 Datadog Security 比、Elastic 跨 on-prem + 多雲、可自管也可 Elastic Cloud SaaS；Datadog 是 SaaS-only、適合純 cloud-native。跟 Google Security Operations 比、Elastic 多查詢語言（KQL / EQL / Lucene / ES|QL）、Google 走 YARA-L 單一統一語言、超大規模 ingestion Google 反而划算。

關鍵張力：多查詢語言模型 同時是 Elastic 的優勢跟負擔。EQL 寫 attack chain sequence 比 SPL correlation 更直接、KQL 過濾快、ES|QL 寫 aggregation 像 SQL 直覺、Lucene 處理 full-text；但 SOC team 要決定哪個 rule 用哪個語言、不能讓每個 analyst 各寫各的。

本章目標

讀完本頁、讀者能判斷：

Elastic Security 在 SOC stack 中承擔哪一段（log aggregation / SIEM / EDR / CSPM）、哪些要外接（Okta IdP log、Vault secret rotation）
KQL / EQL / Lucene / ES|QL 四種查詢語言的職責分工（誰用在哪種 rule、誰負責教育 SOC）
Resource-based pricing 的治理（cluster sizing、hot-warm-cold tier、Searchable Snapshots、Elastic Cloud Serverless）
何時用 Elastic、何時走 Splunk / Datadog / Google Security Ops 的取捨

最短判讀路徑

判斷 Elastic Security deployment 是否健康、最少看四件事：

誰能改 detection rule：Elastic Security app 的 rule editor 權限、detection-rules repo（Elastic 官方 OSS rule 庫）有沒有 fork 進組織版控、rule change 是否走 PR review + staging space 驗證
採集治理：Fleet 統一管 Elastic Agent policy / 還是散落 Beats（filebeat / metricbeat / auditbeat / winlogbeat）各自設定、log source 是否分 hot / warm / cold tier、Searchable Snapshots 是否開
Detection content coverage：Elastic Prebuilt rules + Sigma 社群規則 import 多少 enabled、是否跟 MITRE ATT&CK 對照、EQL sequence 規則覆蓋多少 attack chain pattern
Alert quality / SOC handoff：alert volume per day、Case 跟 Timeline 是否進入日常 SOC workflow、ML anomaly job 是否在線 + threshold 是否 tuned、跟 8 incident response 的 routing 是否定義

四件事任一缺失、就是 Detection Coverage and Signal Governance 邊界的待補項目。

日常操作與決策形狀

Ingestion architecture：log 進 Elastic 三種主路徑 — Elastic Agent + Fleet（現代部署的預設、單一 agent 收 system / endpoint / cloud / app log、中央 Fleet server 統一管 policy）、Beats（filebeat / metricbeat / auditbeat / winlogbeat 等專用 agent、Fleet 推出前的傳統做法、現在持續支援但建議遷移到 Elastic Agent）、Logstash（pipeline-style ETL、用在 enrich / filter / route 複雜場景）。production 通常 Elastic Agent + Fleet 為主、Logstash 補 ETL 缺口。

KQL / EQL / Lucene / ES|QL 的職責分工：四種查詢語言各有 first-class 場景。KQL（Kibana Query Language）是 Kibana 預設過濾語法、user.name : "alice" and event.action : "logon-failed"、簡單直觀、適合 dashboard / Discover 過濾。EQL（Event Query Language）做 sequence pattern matching、sequence by user.name [authentication where event.outcome=="failure"] [authentication where event.outcome=="success" and source.geo.country != "TW"]、表達 attack chain 比 SPL correlation 更直接。Lucene 是底層 full-text query、特殊需要時直接寫。ES|QL（Elasticsearch Query Language、2024+）是新版 SQL-like、FROM logs-* | WHERE event.category == "authentication" | STATS count = COUNT(*) BY user.name、寫 aggregation 直覺；屬新語言、production 採用 cadence 還在跟進中。

Detection rule 種類：Elastic Security 的 rule type 是六種 first-class 概念、不是只有「query rule」一種 — Query rule（KQL / Lucene 觸發）、EQL rule（sequence pattern）、Threshold rule（聚合超過閾值、例如同一 IP 5min 內 login fail > 100）、ML rule（綁 Elastic ML anomaly job、anomaly score 超過閾值觸發）、New term rule（首次出現的 entity、例如某 user 第一次從某國登入）、Indicator match rule（事件 enrich 比對 threat intel feed、IoC hit 觸發）。production rule 經常組合多種 — query rule 做粗篩、EQL rule 抓 sequence、threshold + ML 補 baseline anomaly。

Sigma rule import：Sigma 是 OSS 通用 detection rule 格式（YAML、跨 SIEM 可移植）、社群維護 5000+ 規則。Elastic 支援直接 import Sigma rule 轉成 Elastic detection rule、是 Elastic 拉開跟商業 SIEM 距離的 OSS 槓桿。實務做法：先 import Sigma baseline + 全部走 staging space 跑 false positive 觀察、再 enable 到 production；不要直接全 enable、Sigma rule 跨 SIEM 通用所以 environment-specific tuning 必須自己做。

Case + Timeline：Case 是 incident 容器、聚合 alert + comment + assignment + status；Timeline 是 SOC analyst 的 investigation workspace、可以 pin event / annotate / link related alert、產出 investigation narrative。兩者組合是 Elastic 的 SOC workflow first-class、不是外掛 — 對應 Splunk ES 的 Notable Event + Incident Review、但 Elastic 走 OSS 化、Case 可 export markdown 進 ticketing。

Elastic Defend（EDR）：前 Endgame 收購整合、提供 endpoint detection + prevention（malware block / ransomware protection / behavior detection）、跟 CrowdStrike Falcon / SentinelOne 同層。Elastic Defend 跑在 Elastic Agent 內、policy 從 Fleet 推。實務上多數 SIEM 客戶不會用內建 EDR、而是外接專業 EDR feed 進 Elastic SIEM；但 OSS-friendly + 預算敏感的中型客戶可以直接整合到一個 stack。

Cross-cluster search：跨多個 Elastic cluster 統一查詢（remote_cluster:index-name）、適合 multi-region / multi-tenant SOC、不需要把所有 log 搬到單一 cluster。對應 Splunk Cloud federated search。實務場景：歐洲 GDPR 資料留在 EU cluster、美國 cluster query 過去做 incident investigation 而不複製資料。

ML jobs（anomaly detection）：Elastic ML 內建 unsupervised anomaly detection、pre-built ML job library 覆蓋 SOC 常見場景（user behavior baseline、host login pattern、port scan detection、rare process）。ML rule 綁 ML job、anomaly score 超過閾值觸發 detection rule。對應 Splunk UBA、但 Elastic ML 是 stack 內建、不是 add-on app。

Resource-based pricing 治理：Elastic Cloud 按 cluster size（node count × node size）計費、不按 ingestion volume — 意義是 ingest 多 log 不直接漲費用、但要 scale node 維持查詢效能。實務治理：hot tier（最近 7-30 天、SSD 高效能 node）、warm tier（30-90 天、低 IO node）、cold tier / frozen tier（90 天以上、Searchable Snapshots on S3 / GCS、查詢慢但成本極低）。對應 Splunk SmartStore、但 Elastic frozen tier 把 retention 從幾個月延長到幾年、cost 不線性漲。

核心取捨表

取捨維度	Elastic Security	Splunk	Datadog Security	Google Security Operations
計費模型	Resource-based（node / cluster size）	Ingestion-based（GB/day、累進）	Per-host + per-event（events/month）	Fixed price by data tier（PB-scale 划算）
查詢語言	KQL / EQL / Lucene / ES\|QL 四種互補	SPL（單一強表達力）	Datadog Query（沿用 observability 語法）	YARA-L（統一、結構清楚）
Sequence 表達	EQL `sequence by` 直接表達 attack chain	SPL transaction / streamstats	log + metrics + trace 同 plane	UDM + YARA-L 多事件 rule
部署模型	Self-hosted / Elastic Cloud / Serverless	Self-hosted (Enterprise) / SaaS (Cloud)	SaaS only	SaaS only（Google Cloud）
Detection content	Elastic Prebuilt rules + Sigma 社群 5000+	Splunk Security Content（最豐富、社群活躍）	Datadog Security Rules（中等）	Google YARA-L + Google threat intel
EDR 整合	Elastic Defend 內建（前 Endgame）	外接 CrowdStrike / Defender	Workload Security（容器 focus）	外接（透過 forwarder）
SOAR / Response	Cases + Endpoint response（Elastic Defend）	Splunk SOAR（前 Phantom、業界先驅）	Workflow Automation（基本）	SOAR 內建（前 Siemplify）
適合場景	OSS-friendly、中大型、Elastic stack 已用	Enterprise + 跨 on-prem、預算允許	Cloud-native + observability 已用 Datadog	超大規模 ingestion、Google 雲 + 多雲 SOC
退場成本	中 — Sigma / Lucene / EQL 部分可移植	高 — SPL / detection content / dashboard 量多	中	中

選 Elastic 的核心訴求：OSS-friendly 文化 + resource-based pricing 友善 + Elastic Stack 已作為 observability 在用、團隊有能力跨四種查詢語言（或至少把 EQL 跟 KQL 雙語分工清楚）、能接受 detection content 跟 SOAR 成熟度 trade-off。TB-scale ingestion 時 Elastic 比 Splunk 省 60-80% license cost 是最大誘因、但要算進 cluster sizing 跟 SRE 維運的隱形成本。

進階主題

EQL sequence pattern（時序攻擊鏈）：EQL 的 sequence by 是 Elastic 表達 attack chain 的 first-class 武器、比 SPL correlation 直接。例如 MFA fatigue 寫成 sequence by user.name with maxspan=5m [authentication where event.outcome=="failure"] [authentication where event.outcome=="failure"] [authentication where event.outcome=="success" and source.ip != known_ip]、序列邏輯直接表達。配對 Uber 2022 MFA Fatigue lesson：MFA fail 序列 + 新裝置 success 直接觸發。

Elastic Defend endpoint response：除偵測外、Defend 支援 host isolation（隔離受感染 endpoint 但保留 SOC 連線）、process kill、file quarantine 等 response action、直接從 Kibana Security app 觸發。對應 CrowdStrike Real Time Response。production 採用前要設 approval gate、避免 SOC analyst 誤觸動 production server。

CSPM / CWP（Elastic Cloud Security）：CSPM（Cloud Security Posture Management）對 AWS / GCP / Azure 帳號做 misconfig 掃描（S3 bucket public、IAM over-permission、security group 0.0.0.0/0）、對照 CIS Benchmark；CWP（Cloud Workload Protection）對 Kubernetes workload 跑 runtime detection。屬較新的功能、跟 Wiz / Lacework 等專業 CNAPP 比覆蓋還在追趕。

Cross-cluster search 跨環境 federated query：multi-region SOC 的 first-class 工具 — query 寫 FROM logs-auth-*, eu-cluster:logs-auth-*、Elastic 自動路由跨 cluster。實務注意：跨 cluster query 延遲較高、要設 timeout；資料合規（GDPR）必須留意 query 結果是否包含跨境資料、不是搬資料但 query 結果回傳算不算傳輸要法務確認。

Sigma 規則社群：Sigma 是 OSS detection rule 通用格式、Elastic 是 Sigma 主力使用者（內建 importer + Elastic 工程師參與 Sigma upstream）。實務做法：fork SigmaHQ repo 進組織版控、CI pipeline 自動轉 Sigma → Elastic detection rule、staging space 跑 false positive curve、promote 到 production；不要每次 manually import。

Elastic Cloud Serverless（2024+）：新模型、按 workload type（search / observability / security）計費、不再按 cluster size — 減少 sizing 決策、autoscaling 由 Elastic 託管。屬新模型、production 採用 cadence 還在跟進中、適合 greenfield 部署或 PoC、existing cluster 遷移 roadmap 還在演進。

排錯與失敗快速判讀

Alert volume 爆炸 / SOC 看不完：Sigma rule 全 enable 沒 tune、或 threshold rule 閾值太低 — staging space 跑 1 週統計 FP、tune threshold、加 exception list 排除已知合法 source、ML rule 補 user-specific baseline
EQL sequence rule 跑不動 / timeout：sequence span 太長（24h）或 by field cardinality 太高、查詢成本爆炸 — 縮短 maxspan、限定 index pattern、加 pre-filter 條件
Cluster 查詢慢 / Kibana 卡：hot tier 塞太多舊資料、沒做 hot-warm-cold tier 分層 — 開 ILM（Index Lifecycle Management）policy 自動 rollover、warm tier 用便宜 node、cold / frozen 走 Searchable Snapshots
Fleet agent enrollment 失敗：Fleet server 跟 Elasticsearch 之間網路 / 憑證 / token 問題 — 檢查 Fleet server health、確認 enrollment token 未過期、agent log 看 specific 錯誤
Sigma rule import 後大量 FP：Sigma rule 是 cross-SIEM 通用、沒有 environment-specific exclusion — 不要全 enable、staging tune 後再 promote、加 exception list（known scanner IP / 內部測試帳號）
Resource-based pricing 超預算：node 過度 scale 或 hot tier 留太多 — 開 hot-warm-cold ILM、把 retention 超過 30 天的 index 推到 frozen tier on S3、Searchable Snapshots 是預設應該開
ML job anomaly score 不準：training data 包含已 compromise 期間、baseline 被汙染 — 確認 training window 在乾淨期、定期重訓、配 detection rule 用 anomaly_score > 75 而非 > 50

何時改走其他服務

需求形狀	改走
Enterprise + detection content 最豐富	Splunk
Cloud-native + observability 已用 Datadog	Datadog Security
超大規模 ingestion + Google 雲	Google Security Operations
DLP / sensitive data discovery	Google DLP / Microsoft Purview
Endpoint detection 為主、不要全 stack	CrowdStrike Falcon / Microsoft Defender for Endpoint / SentinelOne
CNAPP 為主（雲端 posture + workload）	Wiz / Lacework / Prisma Cloud（Elastic Cloud Security 較新）
Incident routing	8 事故處理 vendor 清單

不在本頁內的主題

KQL / EQL / ES|QL 完整語法 reference、Lucene query DSL 進階用法
Elasticsearch index sharding / replica / ILM tuning 細節（屬 observability / 資料工程範圍）
Elastic Observability（APM / logs / metrics）— 屬 observability 不屬 security
Elastic Cloud Serverless 詳細 sizing 與 pricing 模型（2024+ 新模型、變動中）
Elastic Stack 自管的維運（cluster upgrade、Kibana plugin 開發）

案例回寫

Elastic Security 在 07 案例庫沒有直接 vendor-level 事件、但所有 detection-related case 都是 SIEM 偵測覆蓋率的對照：

案例	跟 Elastic Security 的關係（對照啟示）
Uber 2022 MFA Fatigue	Elastic EQL `sequence by user.name [auth fail count > 50 in 5min] [auth success from new device]` 直接表達 MFA fatigue pattern、Sigma 社群有現成規則可 import 起步
Microsoft Storm-0558 Signing Key Chain	跨租戶 token 異常驗證需 Elastic Cross-cluster search 跨 Azure AD log + GCP audit log + 自家 app log 同時 query、不需先搬資料
3CX 2023 Desktop App Supply Chain	Elastic Defend 直接看到 desktop app process spawn + 異常網路 callback、不需外接 EDR feed；EQL `sequence` 抓 process → DNS → C2 行為鏈
Detection Engineering Lifecycle (section)	Elastic rule 走 `detection-rules` repo（OSS、Elastic 官方維護）+ Sigma fork + staging space + promote 工程化 lifecycle、不是 Kibana UI 直改
Alert Fatigue and Signal Quality (section)	Elastic 沒有 Splunk RBA 對應、用 ML anomaly rule + threshold rule severity + Case grouping 三層降噪、要設 ML job 重訓 lifecycle

下一步路由

上游：7.13 偵測覆蓋率與訊號治理、Detection Engineering Lifecycle
平行：Splunk、Datadog Security、Google Security Operations
下游：Google DLP / Microsoft Purview（DLP signal 進 Elastic SIEM）
跨類：Okta（IdP log source）、HashiCorp Vault（secret rotation API）、Cloudflare WAF（WAF log + Sigma rule 對接）
跨模組：8 事故處理 vendor 清單（Case → IR routing）、4 observability（Elastic Stack 共用 log pipeline）
官方：Elastic Security Documentation、detection-rules repo

Apache JMeter

Fri, 15 May 2026 00:00:00 +0000

JMeter 的核心責任是把多 protocol 測試與既有企業測試資產轉成可重跑的負載驗證。它適合 GUI 驅動、plugin 生態成熟、HTTP 之外還需要 JDBC、JMS、FTP、mail 或 legacy protocol 的團隊，重點在把測試流程保留成可審查、可交接、可在 non-GUI mode 跑的 artifact。

服務定位

JMeter 是 Apache Software Foundation 的 OSS load testing tool、Java 寫、用 XML 描述 thread group / sampler / listener 組成的 test plan（.jmx 檔）、支援 GUI 與 CLI（non-GUI / headless）雙模式。它是業界最老牌、protocol 覆蓋最廣的壓測工具 — sampler 直接覆蓋 HTTP、JDBC、JMS、SOAP、FTP、SMTP、IMAP、TCP、JUnit、OS process 等。

跟 k6 比、JMeter 走 GUI-driven + protocol 廣、k6 走 code-first（JavaScript）+ HTTP 為主；JMeter 適合 QA 團隊維護、k6 適合 dev / SRE 寫進 CI。跟 Locust 比、JMeter 用 XML + plugin、Locust 用純 Python class、custom client 彈性 Locust 強但 protocol 內建支援 JMeter 廣。跟 Gatling 比、JMeter 偏 GUI / 多 protocol、Gatling 偏 JVM DSL（Scala / Java / Kotlin）+ async runtime、單機 throughput Gatling 較高但 protocol 廣度與既有資產承接 JMeter 勝。

關鍵張力：GUI / protocol 廣度 ↔ 單機 throughput / CI 友善度 是選 JMeter 的根本取捨。GUI 適合 QA 團隊與跨角色協作、.jmx 又有 plugin 生態與十多年累積；代價是 XML diff 難 review、GUI listener 吃記憶體、CI 整合相比 k6 / Gatling 多一層 packaging。

JMeter 適合測試資產已經存在的組織。當團隊有大量 .jmx 測試計畫、QA 團隊用 GUI 維護 scenario、或壓測需要跨 HTTP、JDBC、JMS 與其他 plugin protocol，JMeter 的價值在於承接組織流程，而不只是產生 HTTP 負載。這個定位讓 JMeter 接到 9.3 壓測工具選型與 9.10 Production-Side 驗證。它能支援 production-like test 的多系統 dependency，但 evidence package 要補上測試計畫版本、plugin 版本、runner 配置與結果保存方式。

適用場景

多 protocol 壓測是 JMeter 的主要入口。企業服務常同時需要測 HTTP API、JDBC query、JMS queue、FTP 或 mail flow，JMeter 的 sampler 與 plugin 生態能讓同一份測試計畫覆蓋多種 dependency。

GUI 協作適合非純工程團隊。QA、測試中心或受監管環境常需要可視化測試設計、審核與交接，JMeter 的 GUI 能降低跨角色溝通成本。

Legacy 測試資產適合保留 JMeter。既有 .jmx 檔案、listener、plugin 與報表流程如果已經運作多年，重寫到 k6、Gatling 或 Locust 的機會成本要用維護收益抵銷。

最短判讀路徑

判斷 JMeter deployment 是否健康、最少看四件事：

Thread group 設計：thread count / ramp-up / loop count / duration 是否反映真實流量模型、有沒有用 Stepping Thread Group（plugin）或 Concurrency Thread Group 控制 arrival rate、不是把 thread 當「user」直接綁
Listener 配置：GUI listener（View Results Tree / Aggregate Report / Graph）只在 design / debug 階段開、正式跑必須改 Simple Data Writer 輸出 JTL、結果分析交給離線 HTML report 或外部 Grafana
Distributed mode 設定：單機 thread 上限約 3000-5000（受 JVM heap 與 thread context switch 限制）、超過要走 master + slave（remote engine）；slave 機器 plugin / JMeter version / JVM 參數要跟 master 一致、否則結果不可信
GUI vs CLI 模式區分：GUI 是 design / debug only、production load 一律走 jmeter -n -t plan.jmx -l result.jtl；GUI 跑大規模測試會把 listener 拉爆記憶體、結果反而失真

四件事任一缺、就是 9.3 壓測工具選型邊界的待補項目。

選型判準

判準	JMeter 的價值	需要補的能力
多 protocol	sampler 與 plugin 覆蓋廣	plugin 版本治理與測試環境一致性
GUI 協作	非工程角色可讀可改	code review、diff 與版本控制紀律
既有資產	`.jmx`、listener、報表可延續	scenario cleanup 與 artifact 標準化
分散式執行	remote engine 可擴負載	runner sizing、網路瓶頸與結果合併

多 protocol 價值來自 dependency coverage。當 workload model 包含 database、queue、file transfer 或 legacy endpoint，JMeter 可以把不同 dependency 的壓力放在同一個測試計畫中觀察。

GUI 協作價值來自跨角色可見性。這個優點會帶來版本控制成本，因為 XML diff 不容易 review；團隊要補上 naming、folder structure、parameterization 與 review checklist。

跟其他工具的取捨

JMeter 和 k6 的主要差異是 workflow。JMeter 偏 GUI、plugin 與既有企業流程；k6 偏 code-first、CLI、threshold 與 CI artifact。

JMeter 和 Gatling 的主要差異是 scenario 表達。JMeter 用 test plan、thread group、sampler 與 listener 組裝；Gatling 用 JVM DSL 描述 simulation，較適合工程團隊維護複雜 flow。

JMeter 和 Locust 的主要差異是自訂能力。JMeter 依賴 plugin 與 sampler，Locust 可以直接用 Python library 實作 custom client；如果 protocol 特別特殊，Python 團隊可能更適合 Locust。

JMeter 和 Vegeta 的主要差異是複雜度。Vegeta 適合快速 HTTP saturation probe；JMeter 適合多步驟、多 dependency 與可交接測試計畫。

取捨維度	JMeter	k6	Locust	Gatling
描述語言	XML（`.jmx`）+ GUI	JavaScript	Python（class-based）	Scala / Java / Kotlin DSL
Protocol 覆蓋	HTTP/JDBC/JMS/SOAP/FTP/SMTP/TCP	HTTP/WebSocket/gRPC	HTTP + 任何 Python lib custom	HTTP/JMS/MQTT
單機 throughput	中（thread-per-user）	高（Go goroutine）	中（gevent / async）	高（Akka async）
Runtime model	JVM thread	Go runtime	Python gevent	JVM async actor
CI 友善度	需 packaging `.jmx` + plugin	強 — 單一 JS file + CLI	強 — pip + Python file	強 — sbt / Maven + Scala file
GUI	完整 GUI（design / debug）	無（CLI only）	Web UI（runtime monitoring）	無（HTML report only）
Distributed	Master + Slave（remote engine）	k6 Cloud / Operator	Master + Worker	Gatling Enterprise / FrontLine
適合場景	Enterprise QA + 多 protocol	Dev / SRE + HTTP-heavy + CI	Python 團隊 + custom protocol	JVM 團隊 + 複雜 scenario

操作成本

JMeter 的主要成本是測試計畫治理。.jmx 檔案可以累積大量 listener、debug sampler、hard-coded variable 與過期 assertion，長期不整理會讓壓測結果失去可追溯性。

Runner 成本來自 JVM 與 listener。GUI listener 適合開發階段觀察，不適合大規模壓測；正式測試要使用 non-GUI mode，把結果輸出成 JTL、HTML report 或外部 metrics。

Plugin 成本來自版本漂移。不同 runner、不同工程師機器或 CI image 的 plugin 版本如果不一致，同一份測試計畫可能產生不同結果，因此要把 plugin 清單、JMeter 版本與 container image 固定下來。

Evidence Package

JMeter 結果應回寫到 evidence package。最小欄位包括 test plan version、JMeter version、plugin list、runner topology、thread group 設定、ramp-up、duration、p95 / p99、error rate、throughput、target saturation metric 與 known gap。

欄位	JMeter 證據來源
Source	`.jmx`、JTL、HTML report、dashboard link
Time range	test start / end
Query link	APM / Prometheus / DB / queue 查詢連結
Data quality	test plan version、plugin version
Confidence	runner topology、production similarity
Known gap	未覆蓋 protocol、資料偏差、listener overhead

Evidence package 的核心用途是讓結果可審查。JMeter 測試計畫常由多人維護，gate decision 要能追到哪一版 .jmx、哪一組 runner、哪一批測試資料與哪一個目標環境。

進階主題

JMeter Plugins 生態：jmeter-plugins.org 社群維護的 plugin 集合補齊原版 JMeter 的不足 — Custom Thread Groups（Stepping / Ultimate / Concurrency / Arrivals）讓 thread schedule 反映真實 arrival rate、PerfMon 抓 remote server CPU / memory、Throughput Shaping Timer 直接以 RPS 為目標而非 thread count、Dummy Sampler 拿來 mock dependency。Plugin Manager 統一安裝、CI image 要把 plugin 清單固定（PluginsManagerCMD.sh install ）避免漂移。

BlazeMeter Cloud / Distributed execution：自建 distributed mode（master + slave 跨多 VM）成本高 — slave 機器要同 JMeter 版本、同 plugin、同 JVM 參數、RMI port 開通、結果回傳網路足夠。BlazeMeter（Perforce / 前 CA）是 JMeter SaaS、直接吃 .jmx 跑 cloud-scale 壓測、附 geo-distributed runner、適合短期 spike 測試不想自建 distributed cluster 的團隊。trade-off 是 vendor lock-in 跟 per-test 計費 — 長期高頻測試自建較划算。

Distributed mode 細節：master 機器發 control plane（thread group 配置、test plan 分發）、slave 跑 thread 並回傳 sample 結果。瓶頸常出在 master 收結果（RMI / 自訂 protocol），不是 slave 跑不動 — 大規模測試應該關掉 GUI listener、用 Backend Listener 把 metric 即時推到外部時序資料庫、master 只收彙整指標而非每個 sample。同步要點：所有 slave 用同一份 .jmx 與 test data CSV，CSV 不能依賴 master local path。

Backend Listener + Grafana 整合：JMeter 原生 Backend Listener 支援 InfluxDB / Graphite / Elasticsearch、把 active thread / response time / hit / error 即時推出去、Grafana 配 official JMeter dashboard 即時看 throughput / latency curve。這個組合取代 GUI listener、是 distributed mode 的標準觀測方式 — listener overhead 從 master 移到外部時序系統、master 不再被 GUI 拉爆。配合 4 observability 的時序資料庫已有時、JMeter metric 進同一個 Grafana、跟 application 端的 latency / error 並列、加速 6.13 Performance Regression Gate 的對照判讀。

排錯與失敗快速判讀

GUI 模式吃記憶體爆 / OOM：GUI listener（View Results Tree / Graph）會把所有 sample 留在 heap、跑大規模就 OutOfMemoryError — 設計階段才開 GUI、正式跑切 jmeter -n non-GUI、listener 用 Simple Data Writer 寫 JTL 而非 in-memory aggregate
Listener 拖累 throughput / 結果失真：太多 listener 同時開、每個 sample 都被多個 listener 處理、JMeter 自身成為瓶頸 — 正式測試只留 Simple Data Writer + Backend Listener、結果分析離線跑 jmeter -g result.jtl -o report/ 產 HTML
Thread group 計算錯 / 真實流量對不上：把 thread 當「user」直接設、忽略 think time + ramp-up、結果壓出來的是 thread 全速跑而非業務流量 — 改用 Concurrency Thread Group 或 Throughput Shaping Timer 直接以 RPS 為目標、配 Constant Timer 模擬 think time
Distributed mode 結果跟單機對不上：slave 機器 plugin / JMeter version / JVM heap 不一致、或 CSV 路徑只存在 master — 把 slave 環境 container 化（同 Docker image）、CSV 隨 .jmx 一起分發、--remote-start 統一啟動
.jmx XML diff 不可 review / merge conflict 多：多人同時改測試計畫、GUI 改完 XML 結構大變 — 拆 fragment（Test Fragment + Module Controller）、scenario 分檔、parameterization 走外部 CSV / properties、PR review 看截圖 + 跑結果而非 raw XML diff
Plugin 版本漂移 / CI 結果不可重現：dev 機器 plugin 跟 CI image 不同版 — 固定 plugin manifest、CI image 用 PluginsManagerCMD.sh install-for-jmx plan.jmx 從 plan 自動安裝、版本鎖到 image tag
HTTPS / TLS 連線數爆炸：JMeter 預設每 thread 一個 TLS handshake、large thread count 把 server TLS 拖垮、結果反而測到 TLS 不是 app — 開 HTTP Cache Manager 跟 KeepAlive、必要時調 httpclient4.idletimeout

案例回寫

JMeter 在 09 案例庫中適合作為 enterprise load test 承接點。它可回寫到 9.C15 Tixcraft 售票壓測的 pre-event validation、9.C17 BookMyShow ticketing 的售票流量模型、9.C1 Prime Day readiness 的 staged validation、9.C13 Hotstar IPL 1860 萬同時觀看的全球直播 pre-event rehearsal、以及 9.C14 Standard Chartered 跨 7 個受監管市場的 Aurora 4000 TPS 容量驗證。

這些案例提供的是複雜業務流程與活動前驗證節奏。JMeter 頁引用案例時，要把 case 轉成 thread group、ramp-up、data set、dependency sampler 與 result artifact，並讓負載數字回到業務流程判讀 — 例如 Hotstar 的「集中地理區 CDN 壓力」要在 JMeter 用 per-region thread group 模擬、不是把全球流量塞進單一 runner。

下一步路由

MySQL

Wed, 13 May 2026 00:00:00 +0000

MySQL 是大型網路服務的常見選擇、簡單 query 效能跟 database sharding 生態（Vitess / PlanetScale）成熟。GitHub、Shopify、Slack、Facebook（YouTube 從 MySQL 起家）等大規模服務的核心 OLTP 多採 MySQL。InnoDB engine 的 row-level lock、clustered index、buffer pool tuning 都被深度驗證。

教學路線：高併發 OLTP 與分片生態

MySQL 服務頁的教學目標是把「簡單 SQL 查詢」推進到高併發 OLTP、replication、online schema change 與 sharding governance。讀者讀完後要能判斷 MySQL 何時是成熟預設、何時已經進入 Vitess / PlanetScale 或 application sharding 的討論。

學習段	核心問題	對應段落
OLTP 基線	MySQL 適合哪種大量簡單查詢與交易路徑	定位、適用場景
Replication	replica、failover、lag 與 read scaling 如何影響服務	容量特性、容量規劃要點
Schema change	online schema change 與 migration 如何保護高流量服務	容量規劃要點、預計實作話題
Sharding	Vitess、PlanetScale 與 application sharding 何時變成主線	跟其他 vendor 的取捨
替代路由	何時轉 PostgreSQL、Aurora、DynamoDB 或 distributed SQL	不適用場景、下一步路由

定位：高併發簡單 SQL + 強分片生態

MySQL 跟 PostgreSQL 是 SQL OLTP 兩大主流、但設計取捨明顯不同：

MySQL 偏 簡單 query 效能 + 分片生態 — InnoDB clustered index 對 primary key range query 特別快、Vitess 提供超大規模透明 database sharding
PostgreSQL 偏 特性深度 — 詳見 PostgreSQL vendor page

選 MySQL 的核心訴求：需要超大規模分片（> 100 TB、> 100K WPS）、簡單 query 為主、已用 MySQL 生態工具鏈（gh-ost、pt-online-schema-change）。

容量特性

單一 primary 寫吞吐：

標準 InnoDB：10K-30K WPS（依 row size、commit sync、index 數量）
高階 instance + 優化 schema：50K-100K WPS
超過此級別 → Vitess sharding 或 PlanetScale

Connection 上限：

預設 max_connections = 151、實務常設 1000-5000
每個 connection thread stack ~3 MB + session buffer 累積、active 高峰時 ~8-10 MB（thread + sort/join buffer）
仍建議 ProxySQL / connection pool 限制 backend connection 數

Replication：

async / semi-sync / GTID-based
跨 AZ async lag 通常 < 100ms
跨 region 通常用 chain replication 或 binlog 同步

Storage 上限：

單一 table 64 TB（InnoDB 設計上限）
實務超過 1 TB 表建議分片

適用場景

1. 大規模 OLTP + 分片需求：

流量 > 50K WPS、必須進入 database sharding 設計
用 Vitess / PlanetScale 透明 sharding、應用層幾乎不必改
對應產業：超大網路服務（GitHub、Shopify、Slack）

2. 簡單 query 為主：

primary key lookup、簡單 range query
不太用 CTE、window function、複雜 JOIN
InnoDB clustered index 對這類 workload 特別快

3. 既有 MySQL 生態工具：

gh-ost / pt-online-schema-change（online schema migration）
Orchestrator（HA topology 管理）
ProxySQL（query routing + connection pool）
Maxwell / Debezium MySQL（CDC）

4. 強一致 transaction 但容忍部分 SQL 功能缺失：

不需 partial index、不需 JSONB indexing
不需 PostGIS、用 spatial extension 夠

5. Aurora MySQL（managed 路徑）：

從自管 MySQL 上 AWS、保留 wire protocol
詳見 Aurora vendor page

不適用場景

1. 需要 PostgreSQL 等級的 SQL / JSON 特性：

複雜 CTE、recursive query、window function
JSON Schema validation、JSONB GIN indexing
PostGIS 等深度 extension

2. 全球 multi-region active-active write：

MySQL 設計是 single primary、跨 region 是 async
替代：Aurora DSQL、Spanner、Vitess multi-cluster

3. 大規模 OLAP：

MySQL 定位在 OLTP，analytics workload 交給 OLAP 系統
替代：ClickHouse、BigQuery、Snowflake

4. KV 簡單查詢 + sub-10ms p99：

跟 PostgreSQL 一樣有 parsing / planning 開銷
替代：DynamoDB、Redis

跟其他 vendor 的取捨

vs PostgreSQL：

詳見 PostgreSQL vendor page 對比段
摘要：MySQL 適合超大規模分片、PostgreSQL 適合進階 SQL 特性

vs Aurora MySQL（同 wire protocol）：

MySQL（自管 / RDS）：可跨雲、彈性高
Aurora MySQL：AWS managed、storage / compute 分離、更多 read replica
選自管 MySQL：跨雲需求、預算敏感
選 Aurora MySQL：AWS 生態深、需要 storage scaling

vs PlanetScale（Vitess managed）：

MySQL（自管 + Vitess）：完全控制、可自管分片
PlanetScale：managed Vitess、branch-based schema migration
選 MySQL + Vitess：team 有能力管 Vitess、預算敏感
選 PlanetScale：想 zero ops、branch-based workflow

vs TiDB：

MySQL：single-primary、傳統分片靠 Vitess
TiDB：MySQL wire protocol 相容、HTAP（OLTP + OLAP 同庫）、跨 region 強一致
選 MySQL：已有 MySQL 投資、不想換引擎
選 TiDB：需要跨 region 強一致 + OLAP 同庫

vs Vitess（self-managed sharding layer）：

Vitess 本質是 MySQL 上層的 sharding layer
由 YouTube 設計、捐贈 CNCF
適合超大規模 MySQL 集群、需要透明 sharding

vs DynamoDB（document/KV 替代）：

MySQL：SQL、有 transaction、ad-hoc query、connection-based
DynamoDB：KV、partition 透明、無 connection 限制、5 個 9 SLA
選 MySQL：需要 ad-hoc query、複雜 JOIN、SQL transaction
選 DynamoDB：access pattern 固定、AWS-only、想避免 connection limit 問題
詳見 1.10 KV / Document DB 容量規劃的 connection model 對比

vs Spanner / CockroachDB / Aurora DSQL（distributed SQL）：

MySQL + Vitess：自管 sharding、operational 重、跨雲可用
Spanner / CockroachDB / Aurora DSQL：分散式 SQL、跨 region 強一致、transparent sharding
選 MySQL + Vitess：已有 MySQL 投資、有能力管 Vitess、預算敏感
選 distributed SQL：需要 multi-region 強一致、不想自管 sharding
詳見 1.11 全球分散式 OLTP

vs MongoDB（document 替代）：

MySQL：SQL + JSON column 補充
MongoDB：document 為主、aggregation pipeline 強、schema-flexible
選 MySQL：主要結構化、少量半結構化
選 MongoDB：document 占主要 schema、aggregation 工作負載

容量規劃要點

1. Sharding 是 MySQL 大規模的核心：

單一 MySQL primary 寫吞吐有上限
Vitess / PlanetScale 用 keyspace + shard 切分
shard key 設計類似 DynamoDB partition key — 必須均勻
大規模案例：Shopify（多 shard 分散）、Slack（per-team sharding）

2. Online schema change 是必備：

ALTER TABLE 直接跑會 lock 整個 table
gh-ost（GitHub）/ pt-online-schema-change（Percona）/ Vitess online DDL 用 ghost table 漸進 migrate
大表 schema change 可能跑 hours / days、要排程

3. Replication 跟 GTID：

GTID-based replication 比 binlog position 容易管 topology
semi-sync replication 保證至少一個 standby ack 才 commit
async replication 高吞吐但 lag 較大

4. Connection management：

ProxySQL 是 MySQL 生態的 connection pool 標準
提供 query routing（讀 → replica、寫 → primary）
對應 9.C29 Lemino case — RDB connection limit 議題對 MySQL 同樣適用

5. InnoDB tuning：

innodb_buffer_pool_size：dedicated server 70-75%、shared server 30-50%（詳見 InnoDB Tuning）
innodb_flush_log_at_trx_commit：1（durable）vs 2（faster）vs 0（fastest, 不安全）
innodb_io_capacity：依 storage 類型調整

Anti-recommendation 與升級路由

MySQL 的成熟生態容易讓讀者過早引入重工具。這一段補上 deep article audit 提到的 anti-recommendation 缺口：先說何時維持簡單 MySQL 路徑，再說何時升級到 ProxySQL、Orchestrator、gh-ost、Vitess、PlanetScale 或 distributed SQL。

機制	維持簡單設計的條件	升級訊號	主要引用路徑
Replication	單 primary + 1-2 replica，lag 可被 read routing 容忍	failover 反覆手動、GTID gap、semi-sync fallback	Replication Topology、Orchestrator Failover
Online schema change	小表、maintenance window 足夠、MySQL 8.0 instant DDL 可 cover	大表 ALTER 需 hours、metadata lock 影響 production	Online Schema Change Tools、6.11 Migration Safety
ProxySQL	application pool + primary endpoint 已能控制連線	read/write routing、lag-aware routing、connection storm	ProxySQL Config、Connection Pool
Vitess / sharding	單 primary 寫入與資料量仍在可維護範圍	> 50K WPS、> 100 TB、shard key 已明確、跨 shard query 可接受	Vitess Sharding、Database Sharding
PlanetScale	團隊已有 DBA / SRE 能力管理 Vitess 或自管 MySQL	想把 Vitess ops、schema branch workflow 與 failover 交給平台	→ PlanetScale、Vitess → PlanetScale
Distributed SQL	workload 仍是 single-region OLTP 或 Vitess 可解	multi-region 強一致、cross-shard transaction 是核心需求	1.11 全球分散式 OLTP

Replication 的簡單路徑是 GTID + async replica + 明確 read routing。當 failover 仍靠人工判斷、replica re-pointing 反覆出錯、或 semi-sync fallback 沒有被監控時，才需要把 Orchestrator、ProxySQL 與 incident runbook 放進同一條 HA 路徑。

Online schema change 的簡單路徑是先判斷 MySQL 8.0 instant / inplace DDL 能否 cover。只有大表 rewrite、長時間 metadata lock、FK / trigger 複雜互動或 maintenance window 不足時，才讓 gh-ost / pt-online-schema-change 成為主線工具。

Sharding 的簡單路徑是延後到資料形狀穩定後再做。Vitess 能把 MySQL 推到超大規模，但它也引入 VTGate、VTTablet、VReplication、VSchema、resharding workflow 與跨 shard transaction 邊界；shard key 還沒穩定時，應先用 schema、index、read replica、partition 與容量治理延長單 primary 壽命。

Managed sharding 的簡單路徑是先確認團隊想轉移哪一層責任。PlanetScale 解的是 Vitess operation、branch-based schema workflow 與 managed failover；FK、cross-shard query、connection pool 與 cost model 仍要在 migration playbook 中驗證。

Deep article + Migration playbook（已完成）

主題	文章	類型
Replication topology（async / semi-sync / GTID）配置	replication-topology	Deep article
gh-ost / pt-online-schema-change 對比	online-schema-change-tools	Deep article
ProxySQL 配置跟 query routing	proxysql-config	Deep article
Orchestrator failover 設計	orchestrator-failover	Deep article
InnoDB tuning（buffer pool / log / IO）	innodb-tuning	Deep article
Binary log + Maxwell / Debezium CDC	binlog-cdc	Deep article
Vitess sharding 設計	vitess-sharding	Deep article
8.0 modern SQL（CTE / window / JSON_TABLE）	modern-sql-features	Deep article
Group Replication / InnoDB Cluster 部署	group-replication	Deep article
Query optimization deep dive	query-optimization	Deep article
Partitioning（range / list / hash / sub-partition）	partitioning	Deep article
PITR + Backup strategy	pitr-backup	Deep article
Lock contention（gap / next-key / deadlock）	lock-contention	Deep article
Hands-on 操作路線	hands-on	操作型章節群
5.7 → 8.0 major version upgrade	major-version-upgrade	Migration playbook（Type E）
從自管 MySQL 遷到 Aurora MySQL	migrate-to-aurora	Migration playbook（Type C）
從自管 MySQL 遷到 PlanetScale	migrate-to-planetscale	Migration playbook（Type E）
自管 Vitess 遷到 PlanetScale	migrate-vitess-to-planetscale	Migration playbook（Type C）
從 MySQL 遷到 PostgreSQL	migrate-to-postgresql	Migration playbook

補充正文路由

當前 deep article、migration playbook、補充正文與 hands-on 已 cover ops / schema / failover / tuning / SQL features / sharding / backup / migration / security / audit / document / OLAP / memory / metadata lock 等維度。下列補充正文用來承接 overview 中提到的延伸議題：

Encryption at rest + TLS in transit + key management：對應 PG TLS-mTLS 議題
Audit log + SIEM 整合：MySQL Enterprise Audit Plugin 跟 Splunk / Elastic Security 整合
MySQL Document Store（X-Protocol）：少用但對特定 use case 有興趣
Multi-source replication topology：1 個 replica 從 N 個 primary 拉、用於 sharded environment 整合
HeatWave（MySQL OLAP add-on）：Oracle 推的 HTAP solution、跟 ClickHouse / Snowflake 對比
Cross-buffer memory contention deep dive：buffer pool / connection thread / temp table / sort buffer 之間的 RAM 競爭、跟 OS swap 互動
Metadata lock deep dive：DDL / long-running SELECT / FK 互動造成的 stalls

上述補充篇已完成正文，並保留既有路由。Encryption / TLS / key management 接 TLS / mTLS 與 Secret Management；audit log 接 Audit Log 與 07 資安資料保護；Document Store 接 MongoDB vendor 與 1.10 KV / Document DB 容量規劃；multi-source replication 接 Replication Topology；HeatWave 接 OLAP 替代路由；memory contention 接 InnoDB Tuning；metadata lock 接 Lock Contention 與 Online Schema Change Tools。

已知 limitation（多輪 audit 結論）

17 篇 batch 跑過 4-reviewer audit（寫作規範 / 跨檔一致性 / 技術準確性 / 結構性質疑）後留下的 limitation：

Framework bias：5 篇 migration playbook 全落在 Type A / C / E、沒一篇 Type B / D / F。這反映 MySQL 領域 migration 的本質（多數情境是 schema 差 / operational 轉手 / paradigm shift）、也可能反映 6 type framework 的覆蓋限制
Anti-recommendation 已補 overview 路由：本頁新增「Anti-recommendation 與升級路由」作為總入口；各 deep article 之後仍可逐篇補「何時維持簡單設計」段。
Real case anchor 已下沉：本頁「真實案例 anchor」把 Shopify、Slack、GitHub gh-ost、YouTube / Vitess 與既有 09 case 串回 deep article；Shopify CDC、gh-ost workflow、YouTube / Vitess 與 Netflix Aurora consolidation 已補到對應 deep article 的 production case 段。
PG 對比 narrative：對比段公允度尚可、但 PG 弱點（vacuum ops 開銷 / connection-per-process model / replication slot 治理）較少在 MySQL 視角展開、單方面對比偶有偏 MySQL 不利

案例對照

MySQL 沒有直接的 09 case（大規模 MySQL 多在 engineering blog、不在 vendor case study）、但作為 baseline / 遷移源在多處出現：

案例	跟 MySQL 的關係
9.C23 Netflix Aurora consolidation	從多套 RDBMS（含 MySQL）統一到 Aurora MySQL
9.C20 Zomato TiDB → DynamoDB	TiDB（MySQL 相容）→ DynamoDB 對比
9.C29 Lemino RDB connection limit	MySQL connection 限制問題（同 PostgreSQL）

真實案例 anchor

MySQL 真實案例的責任是把大規模 OLTP 的機制壓力放回正文。案例不只證明「某公司使用 MySQL」，而是提供 schema change、CDC、sharding、connection、queue 整合或 managed migration 的壓力來源。

案例 / 來源	回收的工程訊號	對應正文路由
Shopify Debezium CDC over sharded MySQL	100+ shard、~150 Debezium connector、BFCM 100K records/sec、snapshot lock 與 oversized payload	Binary Log + CDC、Database Sharding、Kafka vendor
Slack Job Queue 演進到 Kafka + Redis	成長期把背景工作拆成多條傳遞路徑，揭露單一資料路徑與 queue 路徑分工	MySQL 只承擔 OLTP source of truth；queue / cache 路徑回 03 Message Queue
gh-ost / GitHub operation workflow	大表 schema change 需要 throttle、pause / resume、cutover 控制	Online Schema Change Tools
YouTube / Vitess	MySQL sharding layer 需要 VTGate、VTTablet、VReplication、VSchema	Vitess Sharding、Database Sharding、→ PlanetScale
9.C23 Netflix Aurora consolidation	多套 RDBMS 整併到 managed Aurora，揭露 operation transfer driver	→ Aurora、Aurora vendor
9.C29 Lemino RDB connection limit	surge 場景 connection limit 讓 RDB 退到 DynamoDB 類 access pattern	ProxySQL Config、1.10 KV / Document DB 容量規劃

案例下沉規則是先放 overview，再進 deep article。當某個案例只支撐服務定位，留在本頁；當案例提供具體操作訊號，例如 Shopify 的 Debezium connector scaling、GitHub 的 gh-ost workflow 或 YouTube 的 Vitess topology，對應 deep article 要保留 production case 段、讓讀者能從機制直接跳到案例。

常見陷阱

直接 ALTER TABLE 大表：lock 表 hours、production 停擺、必須用 online schema change
不用 GTID：replication topology 變更困難、recover from failure 容易出錯
buffer pool 太小：cache miss 高、IOPS 飆升
shard key 選錯：hot shard 出現、整體吞吐達不到名義
connection 沒 pool：跟 PostgreSQL 同樣問題、用 ProxySQL
semi-sync 對高吞吐 workload：每次 commit 等 ack、寫吞吐降一半

下一步路由

完整 T1 對照：01-database vendors index
平行：PostgreSQL vendor、Aurora vendor（managed MySQL）
操作：MySQL Hands-on（local lab、ProxySQL、OSC、replication failover、backup restore、Vitess sandbox）
上游：1.1 高併發資料存取、1.3 Transaction Boundary
下游：1.10 KV / Document DB 容量規劃（MySQL 不適用時的替代）
跨模組：9.5 瓶頸定位流程 — connection / replication / lock contention 常見 MySQL bottleneck
官方：MySQL Documentation、Vitess、PlanetScale

Apache Kafka

Fri, 01 May 2026 00:00:00 +0000

Kafka 是 distributed event streaming platform、承擔三個責任：log-based 訊息儲存（partition + replication）、事件流分發（consumer group 各自進度）、跨系統事件總線（schema-aware contract）。設計取捨偏向「寫入即承諾、可長期保留、多 consumer 各自 replay」、broker 級可靠性與 consumer 端 idempotency 拆開、broker 不負責業務正確性。

對「事件驅動架構、CDC、跨系統事件分發、長期保留 + replay」這條路徑、Kafka 是業界事實標準。本頁先給最短路徑、再展開日常 producer / consumer 操作與 topic 設計、最後進階治理（多租戶、跨區、自動修復）跟排錯。

本章目標

讀完本章後、你應該能：

用 docker-compose 跑起 Kafka + KRaft、驗證 broker 健康
用 CLI 建 topic、produce / consume 訊息、看 partition 分布
設計 producer acks / idempotence / consumer commit 策略對齊 delivery semantics
看懂 consumer lag、ISR shrink、rebalance 訊號、定位故障層
評估 multi-tenant、cross-region、tiered storage、self-healing 等規模化議題

最短路徑：5 分鐘把 Kafka 跑起來

最短路徑用 KRaft 模式（取代 ZooKeeper、單節點即可跑）、避免初學者卡在 ZK 安裝。

 1# 1. 啟動 Kafka（apache/kafka 內建 KRaft、單一容器即含 broker + controller）
 2docker run -d --name kafka -p 9092:9092 apache/kafka:latest
 3
 4# 2. 建 topic（CLI 在容器內 /opt/kafka/bin/）
 5docker exec kafka /opt/kafka/bin/kafka-topics.sh --create --topic demo --partitions 3 \
 6  --bootstrap-server localhost:9092
 7docker exec kafka /opt/kafka/bin/kafka-topics.sh --describe --topic demo \
 8  --bootstrap-server localhost:9092
 9
10# 3. 驗證 produce / consume
11docker exec kafka bash -c "echo hello | /opt/kafka/bin/kafka-console-producer.sh \
12  --topic demo --bootstrap-server localhost:9092"
13docker exec kafka /opt/kafka/bin/kafka-console-consumer.sh --topic demo \
14  --from-beginning --max-messages 1 --bootstrap-server localhost:9092

最短路徑只驗證「broker 起來、能寫能讀」。實際寫程式用 producer / consumer client、見日常操作。

日常操作與決策形狀

CLI 與 client API

子議題：

CLI 指令對照表（kafka-topics / kafka-configs / kafka-consumer-groups / kafka-acls）
Producer client 配置：acks / batch.size / linger.ms / compression / enable.idempotence
Consumer client 配置：auto.offset.reset / enable.auto.commit / max.poll.records / max.poll.interval.ms
對應指令範例：kafka-topics.sh --describe、kafka-consumer-groups.sh --describe --group

Topic 設計

Topic 承擔事件的邏輯邊界。子議題：

Partition 數規劃（並行度 vs metadata 成本）
Replication factor 與 min.insync.replicas（資料保護等級）
Retention policy（time-based vs size-based、compact vs delete）
Key 策略（ordering 範圍、hot partition 避免）

Producer 與 Consumer 設計

設計決定 delivery semantics 實際達成。子議題：

Producer：acks=0/1/all 對應的可靠性取捨、idempotence、transaction 邊界
Consumer：commit 策略（auto vs manual）、commit 時機與 at-least-once / at-most-once 對應
Consumer group：rebalance protocol（eager vs cooperative）、static membership
對應指令：producer 配置範例、consumer 配置範例、kafka-consumer-groups.sh --describe

進階主題（按需閱讀）

本段主題多數已展開為 deep article：consumer rebalance 與 lag 診斷、replication / ISR / exactly-once、retention 與 tiered storage、Schema Registry 與 schema 演進、multi-tenant quota 與 ACL 治理。下列子議題段保留每個主題的選題判讀入口。

Multi-tenant 與配額治理

對應案例 3.C6 Uber Kafka 事件平台。子議題：

Producer / Consumer quota（byte rate、request rate）
ACL 設計（principal、resource、operation）
Topic 命名規範與 ownership
對應指令：kafka-configs.sh --alter --add-config 'producer_byte_rate=...'、kafka-acls.sh --add

Cross-region 與分層叢集

對應案例 3.C1 Meta FOQS 與 3.C4 LinkedIn Tiered Clusters。子議題：

MirrorMaker 2 配置（active-active vs active-passive）
分層叢集策略（critical / standard / experimental）
跨區 consumer 路徑與 routing freshness

Topic 生命週期治理

對應案例 3.C3 LinkedIn TopicGC。子議題：

Topic 活躍判準（last produce / consume timestamp）
自動回收條件與稽核
Metadata 壓力訊號（controller log、partition 數量上限）

Replication 與 exactly-once 升級

對應案例 3.C9 反例：語義誤配。子議題：

acks=all + min.insync.replicas ≥ 2 + producer idempotence
Kafka transaction 與 read_committed 邊界
端到端 exactly-once（Kafka Streams 場景）

Self-healing 與自動修復

對應案例 3.C7 LinkedIn Self-Healing。子議題：

可自動修復故障類型（disk full、broker offline、under-replicated partition）
自動修復 vs 人工升級邊界
修復過程的證據鏈納入觀測

KRaft 與 Schema Registry

子議題：

KRaft mode 取代 ZooKeeper（運維簡化、metadata 治理）
Schema Registry（Confluent / Apicurio）與 Avro / Protobuf
Schema 演進策略（forward / backward / full compatibility）

Tiered storage

子議題：

冷熱分層（hot tier on local disk、cold tier on S3）
Retention 設計與成本
Read 路徑差異（hot vs cold）

Kafka Connect 與 CDC

子議題：

Source connector / Sink connector 模型
Debezium CDC pipeline 與 outbox 整合
Connect cluster 治理與 schema evolution

排錯快速判讀

Consumer lag 暴增

操作原則：先看 lag 是「均勻分布」還是「集中在少數 partition」、再定位 consumer 慢 vs partition 不平衡。

1kafka-consumer-groups.sh --describe --group  --bootstrap-server localhost:9092
2# 輸出含 CURRENT-OFFSET / LOG-END-OFFSET / LAG 逐 partition 列、可看 lag 集中在哪幾個 partition

判讀路徑：consumer 慢（CPU / GC / 下游 I/O）→ producer 突增 → partition 不平衡（key 分布）。

ISR shrink 與 under-replicated partition

操作原則：ISR 縮小代表 follower 跟不上 leader、看 broker 健康 / 網路 / disk。

1kafka-topics.sh --describe --under-replicated-partitions --bootstrap-server localhost:9092
2# 輸出為空代表所有 partition 同步正常；列出的 partition 即 ISR 落後者

Rebalance storm

操作原則：consumer 頻繁加入 / 離開觸發 rebalance、看 session.timeout.ms 與 max.poll.interval.ms。

Offset reset 或重複消費

對應反例 3.C9。判讀路徑：commit 策略錯誤、broker 端 offset 過期、auto.offset.reset = earliest。

Schema 不相容

操作原則：producer 升級 schema、consumer 未升、看 compatibility level。

何時改走其他服務

需求形狀	改走
任務隊列（中等吞吐、複雜 routing）	RabbitMQ
Managed queue（AWS 生態、簡單）	AWS SQS
Managed pub/sub（GCP 生態）	Google Pub/Sub（遷移路徑見 Kafka → Pub/Sub）
輕量 messaging + 微服務通訊	NATS
Redis 生態內 stream	Redis Streams
Managed Kafka	AWS MSK / Confluent Cloud（見 3.C2）
Kafka 相容、單 binary	Redpanda（T2 候選）
多租戶 + 分層儲存原生	Apache Pulsar（T2 候選）

不在本頁內的主題

各語言 client API reference（依官方文件）
Kafka Streams / ksqlDB（另開 stream processing 章節）
Confluent 商業功能（Confluent Cloud、Control Center）

案例回寫

既有通用案例（C1-C10）

案例	主討論議題
3.C1 Meta FOQS	跨區 queue、tenant 遷移節奏
3.C2 VMware → MSK	自管轉 managed、ACL / cutover
3.C3 LinkedIn TopicGC	Topic 生命週期治理
3.C4 LinkedIn Tiered Clusters	分層叢集策略
3.C5 Slack Kafka+Redis	多 broker 組合拓樸
3.C6 Uber Kafka	多租戶 + 平台治理
3.C7 LinkedIn Self-Healing	自動修復
3.C8 Cloudflare Queues	全球交付（對比）
3.C9 反例：語義誤配	Replication + idempotence 升級
3.C10 規模對照	不同規模下的佇列模型

Kafka 專屬案例（C11-C22）

案例	主討論議題
3.C11 Pinterest Tiered Storage	Broker-decoupled tiered storage / S3
3.C12 Pinterest Shallow Mirror	MirrorMaker CPU/memory 優化
3.C13 Shopify Debezium CDC	Sharded MySQL CDC pipeline
3.C14 Yelp Schematizer	Schema Registry + 強制 compatibility
3.C15 Airbnb Spark Streaming	Partition-task 解耦 / data skew
3.C16 Robinhood Faust	Python stream processing 生態
3.C17 Walmart MPS	Partition-consumer 1:1 解耦 / K8s 擴張
3.C18 Wix Greyhound	TLLSR consumer troubleshooting
3.C19 Wix Multi-cluster	Metadata scaling ceiling / 分群
3.C20 Spotify 遷出 Kafka	（反例）early Kafka 版本可靠性硬限制
3.C21 Goldman Sachs MSK	MM2 + LB + timeout 整合 pitfall
3.C22 Trivago KEDA	Consumer lag 驅動 scale-to-zero

KRaft 缺直接 customer case：目前依官方 KIP-833 / Confluent 公告為準、後續若有 customer 一手案例可補。

下一步路由

上游概念：0.3 非同步選型、3.1 broker basics
平行 vendor：RabbitMQ、NATS
下游能力：3.4 consumer 設計、6.12 idempotency / replay

CircleCI

Fri, 01 May 2026 00:00:00 +0000

CircleCI 是獨立 CI/CD 平台、承擔三個責任：強進階 cache（layer-aware）+ parallelism（test splitting）、跨 VCS（GitHub / Bitbucket / GitLab）、resource class 彈性（含 macOS / ARM / GPU）。設計取捨偏向「進階 cache + 並行加速 + cross-VCS」、適合需要極致 build speed 跟 macOS runner 的團隊。

本章目標

讀完本章後、你應該能：

寫 .circleci/config.yml workflow
設計 cache + workspace 加速 build
用 parallelism + test splitting
選 resource class（CPU / memory / macOS / GPU）
評估 CircleCI vs GitHub Actions 的選用

最短路徑：5 分鐘把 CircleCI 跑起來

 1# .circleci/config.yml
 2version: 2.1
 3jobs:
 4  test:
 5    docker: [{image: cimg/node:20}]
 6    steps:
 7      - checkout
 8      - run: npm test
 9workflows:
10  ci:
11    jobs: [test]

日常操作與決策形狀

Pipeline / workflow / job 模型

子議題：

Pipeline（一次 trigger 的執行）
Workflow（多 job 編排、DAG）
Job（一組 step）
對應指令範例：circleci local execute（本地測 config）

Orb 重用

子議題：

Orb = package of reusable config（types / commands / jobs / executors）
Public orb registry（circleci.com/developer/orbs）
Private orb for company

Cache + workspace

子議題：

Cache：跨 build 保留（dependency / build artifact）
Workspace：同 workflow 內 job 之間傳遞
Cache key 設計（與 GitHub Actions 類似）

進階主題（按需閱讀）

Parallelism + test splitting

子議題：

Job parallelism N
Test splitting by timing / name / class
對應 test suite 加速

Resource class

子議題：

small / medium / large / xlarge / 2xlarge
macOS / Arm / GPU classes
跟 cost 平衡

Self-hosted runner

子議題：

Runner agent
適合：內網 / 特殊環境

OIDC integration

子議題：

OIDC token → AWS / GCP（無 long-lived secret）
跟 GitHub Actions 同 pattern

Approval job

子議題：

type: approval job：人工介入
對應 6.8 Release Gate

Cross-VCS support

子議題：

GitHub / Bitbucket / GitLab
跟 GitHub Actions 只 GitHub 對比

排錯快速判讀

Build 慢

操作原則：cache miss / test 沒 split / resource class 太小。

Cache 不命中

操作原則：cache key 設計問題 / key change。

Parallelism 不均勻

操作原則：test split strategy（timing 最好但要 historical data）。

Approval 卡住

操作原則：approval job 沒人按 / on-call 不在。

何時改走其他服務

需求形狀	改走
GitHub-hosted	GitHub Actions
Self-hosted enterprise	Jenkins / Buildkite / Tekton
GitLab-hosted	GitLab CI
複雜 DAG / K8s-native	Tekton / Argo Workflows
預算敏感	GitHub Actions / self-hosted Jenkins

不在本頁內的主題

各 Orb 細節
CircleCI Server（self-host enterprise）
Pricing 細節

案例回寫

案例方向	對應主題
Stripe：Idempotency 與零停機遷移	canary deploy / approval job 的部署節奏
Shopify：BFCM 容量治理與 Game Day	峰值前 CI workflow 跑 capacity test
Microsoft：變更治理與可靠性門檻	approval job 對應變更分層審查

待補 CircleCI customer case：大規模 CircleCI 採用、macOS / iOS CI 加速案例、CircleCI → GitHub Actions 遷移案例。

下一步路由

上游概念：6.8 Release Gate
平行 vendor：GitHub Actions
下游能力：07 security、5 deployment

Docker

Fri, 01 May 2026 00:00:00 +0000

Docker 是最早 popularize container 的工具、承擔三個責任：container image build（Dockerfile / BuildKit）、local container runtime（docker run / Compose）、image distribution（Docker Hub / private registry）。設計取捨偏向「dev experience + image format standard」、production orchestration 多被 Kubernetes + containerd 取代、但 image build / dev workflow / OCI image 仍是事實標準。

對「Local dev / CI container 工具、image build pipeline、小規模 dev 環境」這條路徑、Docker 是首選。

本章目標

讀完本章後、你應該能：

寫 Dockerfile + 跑 docker build / run
用 multi-stage build / BuildKit 優化 image
用 Docker Compose 編排 dev 環境
配置 image registry + scanning + SBOM
評估 Docker Desktop license 對團隊的影響、選替代（Podman / Rancher Desktop）

最短路徑：5 分鐘把 Docker 跑起來

 1# 1. 安裝（macOS 擇一）
 2brew install --cask docker            # Docker Desktop（商業企業需付費授權）
 3# brew install podman                 # 替代方案：Podman（無 daemon、免費）
 4
 5# 2. 跑 container
 6docker run -d -p 8080:80 --name web nginx:stable-alpine
 7docker ps && docker logs web
 8
 9# 3. Build + push image
10docker build -t myapp:1 .
11docker tag myapp:1 ghcr.io//myapp:1
12docker push ghcr.io//myapp:1

日常操作與決策形狀

Dockerfile 設計

子議題：

FROM / RUN / COPY / WORKDIR / EXPOSE / CMD / ENTRYPOINT
Multi-stage build（build stage + runtime stage 分離）
Layer cache 設計（COPY 順序影響 cache hit）
對應指令：docker build --no-cache、docker history

BuildKit / Buildx

子議題：

BuildKit：新 builder、parallel + cache mount + secret + SSH agent
Buildx：cross-platform build（amd64 / arm64）
Cache backend（local / registry / S3 / GHA）
對應指令：docker buildx create --use、docker buildx build --platform=linux/amd64,linux/arm64

Docker Compose

子議題：

docker-compose.yml：service / network / volume 配置
適合：local dev 多 container（DB + cache + app）
不適合：production（用 K8s）
對應 5.2 K8s deployment

進階主題（按需閱讀）

Image security / scanning / SBOM

子議題：

Trivy / Grype / Snyk image vulnerability scanning
SBOM 產生（syft / Docker scout）
Sign image（cosign / notary v2）
對應 07 security supply chain

Image registry 選擇

子議題：

Docker Hub（public + rate limit issue）
雲端：ECR / GCR / Artifact Registry / ACR
Self-host：Harbor / GitLab Container Registry / Nexus
對應 image pull credentials 管理

Docker Desktop license

子議題：

2021 改授權：商業企業（> 250 員工 / > $10M）需付費
替代：Podman Desktop / Rancher Desktop / Colima / Lima
替代品的 daemon / rootless 差異
對應企業 IT 採購決策

Containerd / CRI-O 在 production

子議題：

K8s 1.24+ 移除 dockershim、改用 containerd / CRI-O
Docker image 跟 containerd 相容（OCI standard）
production 不用 Docker、用 containerd

Image size 優化

子議題：

Base image 選擇（distroless / alpine / scratch）
Multi-stage build + layer combine
Build context（.dockerignore）
跟 image scanning 跟 deploy speed 對應

Rootless / 安全強化

子議題：

Rootless mode（Docker / Podman 都支援）
User namespace mapping
Seccomp / AppArmor / SELinux profile
對應 07 security container security

排錯快速判讀

Image build cache 不命中

操作原則：COPY 順序錯、.dockerignore 缺、變動的 layer 在前面。

1docker build --progress=plain --no-cache -t myapp:debug .   # 逐層輸出、比對哪層吃時間
2docker history myapp:debug                                  # 看每層大小

Image 過大

操作原則：base image 太重 / 沒 multi-stage / build context 過大。判讀：docker history 看 layer 大小。

Container 起不來

操作原則：docker logs + docker inspect 看 exit code + state。

Network port 不通

操作原則：-p mapping vs EXPOSE 差異、host network vs bridge network、firewall。

Volume 權限問題

操作原則：container UID 跟 host UID 不對齊、rootless mode 特別容易踩。

何時改走其他服務

需求形狀	改走
Production orchestration	Kubernetes
Rootless / 安全強化	Podman
替代 Docker Desktop（cost）	Rancher Desktop / Colima / Lima
純單機 service	systemd
雲端 managed container	ECS / Cloud Run / Container Apps
Build-only（無 daemon）	Buildah / Kaniko / BuildKit standalone

不在本頁內的主題

Dockerfile 完整 reference
Docker Compose v2 進階配置
Container runtime spec（runc / OCI）
各 registry 完整 API

案例回寫

跨 vendor 對照

案例	對 Docker 的對應
5.C3 Orbitera managed K8s	Container image 是平台遷移的可攜介面、orchestrator 換但 image 不換
5.C10 規模對照	小規模直接 Docker / Compose、中大型才走 K8s（Docker 退到 build only）

待補 Docker 案例：Docker Hub rate limit incident、企業 license 遷移到 Podman 案例、image scanning supply chain 案例。

下一步路由

上游概念：5.1 container runtime
平行 vendor：Kubernetes、systemd
下游能力：07 security（image scanning / SBOM）

Opsgenie

Fri, 01 May 2026 00:00:00 +0000

Opsgenie 是 Atlassian 出品的 on-call 平台、承擔三個責任：alert routing + escalation policy、跟 Atlassian 套件（Jira Service Management / Statuspage / Confluence）深度整合、heartbeat monitoring（被動觀察 service 是否還在）。已被併入 Jira Service Management Cloud、原獨立服務逐漸 deprecated。

服務定位

Opsgenie 的核心定位是 Atlassian 生態內的 on-call 元件、跟 PagerDuty 比、它的差異在 跟 Jira Service Management / Confluence / Statuspage 的整合深度、paging 能力本身相近：ticket、runbook、status page、incident 都在同一個身份體系（Atlassian Identity）內、不用跨 SaaS 串 SSO 跟 webhook。Atlassian-heavy enterprise 通常已經買了 JSM / Confluence / Statuspage、再買獨立 PagerDuty 等於多一條供應商線、ROI 不一定划算。

2025 年 Atlassian 公開宣布 Opsgenie 將在 2027 年 4 月 EOL、原 Opsgenie standalone 客戶要遷移到 Jira Service Management Premium / Enterprise 內建的 on-call 能力。這是現有 Opsgenie 客戶在 2025-2027 期間的最大議題、新案不該再選 Opsgenie standalone。

本章目標

配置 Opsgenie team / schedule / escalation
設計 alert routing 與 deduplication
整合 Jira Service Management / Statuspage / Confluence
用 Heartbeat monitoring 守護 cron / scheduled job
評估 Opsgenie → JSM Cloud 遷移路徑

最短判讀路徑

判斷 Opsgenie deployment 是否健康、最少看四件事：

誰能 ack alert：schedule rotation 是否真的有人在線、override 機制是否被濫用（永久 override 掩蓋人力缺口）、escalation policy 的 final step 是否有 fallback team 而非無限循環
跟 JSM migration plan：是否已盤點 standalone Opsgenie 跟 JSM on-call 的 feature gap、現有 integration（Datadog / Prometheus webhook、Slack routing、custom API）在 JSM on-call 是否 parity、API token / Terraform config 的轉換路徑
Atlassian Identity 整合：是否走 Atlassian Access（IdP SSO + SCIM provision + audit log）、還是停留在 Opsgenie 自己的 user store；後者在 migration / offboarding / compliance 都是坑
Slack notification routing：alert routing 規則是 fan-out 到所有 team channel（吵雜）還是 priority-based（P1 → on-call DM + channel、P3 → channel only）；Slack 是事實上的 incident war room、routing 不對 SOC 就漏接

四件事任一缺失、就是 Drills and On-call Readiness 邊界的待補項目。

最短路徑

1# 1. Atlassian admin 啟用 Opsgenie / JSM
2# 2. 建 team / schedule
3# 3. 配置 integration（Datadog / Prometheus webhook）
4# 4. 試 alert + escalation

日常操作與決策形狀

Team / schedule / escalation

子議題：

Team 對應 service 或 component
Schedule rotation / override
Escalation policy（多 step / responder）

Alert routing + Atlassian 套件整合

子議題：

Routing rule（priority / source）+ deduplication
Jira Service Management（ITSM workflow）
Statuspage（incident → public update）
Confluence runbook
Slack / Teams 通知

核心取捨表

取捨維度	Opsgenie	PagerDuty	incident.io	Grafana OnCall	JSM Premium on-call
生態錨點	Atlassian（JSM / Confluence / Statuspage）	獨立 SaaS、整合廣	Slack-first、incident workflow	Grafana stack（OSS-friendly）	Atlassian 內建
計費模型	按 user / month	按 user / month + add-on	按 user / month	OSS 免費 / Grafana Cloud 付費	包在 JSM Premium / Enterprise license
身份整合	Atlassian Identity / Access SSO	自家 + SAML / SCIM	Slack identity + SAML	Grafana auth + OAuth	Atlassian Identity（原生）
Runbook / postmortem	Confluence runbook + 基本 postmortem	Runbook Automation + Jeli postmortem	內建 incident timeline + retrospective	Grafana dashboard runbook（弱）	Confluence + JSM workflow
長期路徑	2027/4 EOL、移到 JSM on-call	持續演進、Process Automation 加深	持續演進、IR workflow 強化	持續演進、OSS 路線	跟 JSM 同步演進
適合場景	既有 Opsgenie 客戶 migration 期、無新案	不在 Atlassian 生態、跨工具堆疊	Slack-native IR、incident workflow 重	OSS / 預算敏感、Grafana 已用	Atlassian-heavy enterprise

選 Opsgenie 的核心訴求現在 只有一個：既有客戶在 EOL 前的 migration 緩衝期。新案應該直接走 JSM Premium on-call（已在 Atlassian 生態）、PagerDuty（不在 Atlassian 生態）或 incident.io（Slack-native）。

進階主題（按需閱讀）

Heartbeat monitoring

子議題：主動 ping 監控、schedule heartbeat（cron / batch job 守護）。Heartbeat 是 被動 alert 的補位 — cron 跑完該打 ping、ping 沒到就 alert；常見坑是 network 路徑或 outbound proxy 擋掉 ping、cron 其實正常但 Opsgenie 收不到、變成 false positive 半夜叫人。

Atlassian 整合深度

子議題：Issue creation / sync、SLA / OLA tracking、audit log。跟 PagerDuty + Jira webhook 比、Opsgenie 的差異是 同身份體系 + native field mapping — incident 直接綁 JSM ticket、Statuspage component 跟 Opsgenie service 同 schema、Confluence runbook 在 Opsgenie alert 內可直接 inline 預覽。

Team-based routing 跟 service ownership

子議題：team 對應 service / component 的 ownership model、global schedule 跟 team-local schedule 的分層、cross-team escalation（DB team alert escalate 到 platform team）。跟 PagerDuty 比 Opsgenie 的 team 是 first-class concept、跟 JSM project / Confluence space 雙向綁、ownership 邊界比 PagerDuty service 更貼近組織結構。

Atlassian Identity SSO + audit

子議題：Atlassian Access 統一 IdP SSO（Okta / Azure AD / Google Workspace）+ SCIM 自動 provision / deprovision、audit log 集中。沒走 Atlassian Access 的 Opsgenie 是 身份孤島 — 離職員工 JSM 已 deprovision 但 Opsgenie schedule 還在、半夜還會被 page。

Opsgenie → JSM Cloud / JSM Premium on-call 過渡

子議題：原 Opsgenie 用戶遷移時程（Atlassian 官方公告 2027/4 EOL）、功能 parity 盤點（migration 前確認 integration / API / Terraform config 都有對應）、API 兼容（Opsgenie REST API 在 JSM 上是否保留 / 改路徑）。migration 不是換工具、是換產品架構 — schedule / escalation / integration / runbook 的 ID 都會變、要規劃 parallel run 期 而非 cutover。

排錯快速判讀

Alert 不觸發：integration / API key / routing rule
Heartbeat false alarm：cron 跑了但 ping 沒到 / network
Atlassian 整合斷裂：JSM permission / project mapping
通知 missed：mobile app / push / SMS provider
Escalation 跨時區壞掉：schedule timezone 設錯（team timezone vs user timezone）、override 把全 24hr 都蓋掉、final step 沒 fallback team — 跑 game day 驗證實際 paging 路徑、不只看 config
Stale schedule：有人離職但 schedule 沒撤、半夜叫到前同事；走 Atlassian Access SCIM auto-deprovision、或定期 schedule audit
Atlassian Cloud authentication trap：API token 過期 / 換 region / Atlassian Access policy 變更導致 integration 全斷；token 走 secret manager、Atlassian Access policy 變更前先 dry-run integration
JSM migration drift：migration 期間 standalone Opsgenie 跟 JSM on-call 兩邊 schedule / escalation 不同步、alert 兩邊都觸發或都沒觸發；parallel run 期要有 single source of truth 跟 reconciliation script

何時改走其他服務

需求形狀	改走
不在 Atlassian 生態	PagerDuty
OSS 偏好	Grafana OnCall
Slack-native IR	incident.io
Microsoft Teams + IR	FireHydrant
新案、Atlassian-heavy	JSM Premium / Enterprise 內建 on-call（取代 Opsgenie standalone）

不在本頁內的主題

Jira Service Management 完整 ITSM workflow / Atlassian Cloud admin / Statuspage 細節
JSM Premium on-call 完整 feature set（屬 Atlassian product roadmap、跟 Opsgenie EOL 公告同期演進）
Atlassian Access 完整 IdP / SCIM 設定（屬 identity 模組）

案例回寫

Opsgenie 是 Atlassian 自家產品：Atlassian 內部 incident routing / on-call 走 Opsgenie + Jira Service Management、其多租戶事故的協作流程是 Opsgenie 在大型 IR 場景的代表樣本。Atlassian-heavy enterprise 看這個案例的角度不是「PagerDuty 也能做」、而是「同身份體系 + JSM ticket / Confluence runbook / Statuspage 在 14 天事故內怎麼協作」— 這是 Opsgenie 在生態整合上的代表性場景。

案例	對應主題
Atlassian cases	14 天事故的 incident commander 輪值與 paging 節奏

下一步路由

Prometheus

Fri, 01 May 2026 00:00:00 +0000

Prometheus 是 CNCF graduated 的 metrics 系統、承擔三個責任：pull-based metrics scraping（service discovery + scrape）、PromQL 查詢與 recording rules、Alertmanager 告警與路由。設計取捨偏向「短中期 metrics + 簡單部署 + cloud-native 整合」、長期儲存交給 Mimir / Thanos / Cortex。是 Kubernetes 生態 metrics 的事實標準。

對「K8s metrics、service metrics、需要 PromQL 表達能力、自管 metrics 棧」這條路徑、Prometheus 是首選。

本章目標

讀完本章後、你應該能：

用 docker 跑起 Prometheus、配置 scrape target
用 PromQL 查詢 metrics、寫 recording rules / alerting rules
設計 service discovery（K8s / Consul / file_sd）
看懂 cardinality 訊號、避免 label explosion
評估長期儲存（Thanos / Mimir / Cortex）跟 remote write 的選擇

最短路徑：5 分鐘把 Prometheus 跑起來

先建最小 config 檔（Prometheus scrape 自己）：

1# prometheus.yml
2global:
3  scrape_interval: 15s
4
5scrape_configs:
6  - job_name: "prometheus"
7    static_configs:
8      - targets: ["localhost:9090"]

啟動並驗證：

 1# 1. 啟動 Prometheus
 2docker run -d --name prom -p 9090:9090 \
 3  -v "$(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml" \
 4  prom/prometheus
 5
 6# 2. 確認 target 正常（等 15 秒讓第一次 scrape 完成）
 7curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].health'
 8
 9# 3. 查詢驗證
10curl -s 'http://localhost:9090/api/v1/query?query=up' | jq '.data.result[].value[1]'

up 回傳 "1" 代表 Prometheus 能 scrape 自己。瀏覽器訪 http://localhost:9090 可用 PromQL UI 互動查詢。實際 production 要配 retention、alerting rules 與 HA。

日常操作與決策形狀

Scrape 配置與 service discovery

子議題：

Static config：手動列 target、適合小規模
File SD：動態檔案、適合外部系統推送
Kubernetes SD：K8s API server 動態發現
Consul SD：跟 Consul service registry 整合
對應配置：scrape_configs 區段

PromQL 查詢

子議題：

Instant query vs range query
Aggregation：sum / avg / max / min / count + by / without
Rate / increase（counter 處理）
Histogram quantile（histogram_quantile + bucket）
對應指令：HTTP API /api/v1/query

Recording rules / Alerting rules

子議題：

Recording rules：預先計算昂貴 query、降低 dashboard 查詢成本
Alerting rules：定義 alert condition + for duration + labels / annotations
Alertmanager：去重 / 抑制 / 分組 / routing
對應配置：rule_files

Deep Article

Prometheus 容量規劃與故障模式：單機容量邊界、cardinality 與 retention 的資源模型、常見故障模式與判讀
PromQL 與 Recording Rules 實務：常見 SLI 查詢模式、recording rules 設計慣例、效能陷阱與故障判讀
Remote Write 與長期儲存整合：remote write 配置、Mimir / Thanos / Cortex 三家比較、故障模式與容量規劃

進階主題（按需閱讀）

High availability

子議題：

Prometheus 沒原生 HA — 跑兩個 instance scrape 同 target、靠下游去重
Thanos：sidecar 模式、跨 Prometheus instance 查詢統一
Mimir：fully replicated metric storage（多 Prometheus → Mimir）
對應案例 4.C8 Airbnb K8s scale signals

Cardinality 管理

對應案例 4.C2 Gaming peak cardinality。子議題：

Cardinality = unique label combinations 數量
High-cardinality label（user_id / request_id / trace_id）會炸 Prometheus
偵測：prometheus_tsdb_head_series metric
修法：drop label / aggregation / 改用 traces backend（Honeycomb）

Remote write / read

子議題：

Remote write：Prometheus → 長期儲存（Mimir / Cortex / Thanos / Datadog / Grafana Cloud）
Remote read：查詢時拉長期儲存資料
用 receiver / agent 模式（無 local TSDB）
對應配置：remote_write / remote_read

Exporters 生態

子議題：

Node exporter（host metrics）
Blackbox exporter（HTTP / TCP / ICMP probing）
Database exporters（postgres / mysql / redis）
應用層 metrics：用 client library（prometheus_client）原生暴露
對應 ServiceMonitor / PodMonitor（Prometheus Operator）

Prometheus Operator（K8s）

子議題：

CRD：Prometheus / ServiceMonitor / PodMonitor / PrometheusRule / Alertmanager
自動發現 ServiceMonitor 物件、不手動改 scrape config
kube-prometheus-stack Helm chart
對應 4.C6 ADOT EKS 對照

Pull vs Push model

子議題：

Pull model（Prometheus default）：service discovery、health check 自然
Push model（Pushgateway）：適合 short-lived job、不建議常駐 service
為何 Pushgateway 不推：cardinality 不易管、scrape semantics 違反

排錯快速判讀

Scrape failure

操作原則：先看 target 是否健康、再看 network 跟認證。

1curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health, lastError}'

Cardinality explosion

操作原則：series 數量持續增長、可能 OOM。

1curl -s 'http://localhost:9090/api/v1/query?query=prometheus_tsdb_head_series' | jq '.data.result[].value[1]'

對應 4.C2 Gaming peak 的處理路徑。

Query 過慢

操作原則：query 過大範圍 / aggregation 過多 → Recording rules 預先聚合。

Alert flapping / noise

操作原則：alert 觸發頻繁但無實際問題、調整 for: duration、加 absent() check、用 Alertmanager inhibition。

Memory pressure

操作原則：Prometheus retention 跟 cardinality 決定 memory。判讀：cardinality 太大 → remote write 卸載長期儲存。

何時改走其他服務

需求形狀	改走
長期 retention（年級）	Thanos / Mimir / Cortex / Grafana Cloud
需要 logs / traces	Grafana Stack (Loki/Tempo) / Elastic
Auto-instrumentation	OpenTelemetry + Prometheus exporter
SaaS turnkey	Datadog
High-cardinality debug	Honeycomb
AWS-native	CloudWatch + Managed Prometheus
Pure push model	StatsD / InfluxDB（不在本模組）

不在本頁內的主題

PromQL 完整 syntax reference（prometheus.io/docs/prometheus/latest/querying/）
Exporter 內部實作
Alertmanager routing tree 細節
Operator CRD spec

案例回寫

直接相關案例

案例	主討論議題
4.C2 Gaming peak cardinality	Cardinality 管理 / freshness 取捨
4.C6 ADOT EKS	AWS Distro + Prometheus 整合
4.C8 Airbnb K8s scale	K8s metrics + Prometheus 規模化

跨 vendor 對照

案例	對 Prometheus 的對應
4.C7 Datadog OTel migration	從 Prometheus + Datadog 雙軌走向 OTel 對齊
4.C9 OTel migration signal drift	（反例）Prometheus 指標跟新管線的語意對不齊
4.C10 規模對照	小型單 instance / 中型 Operator / 大型 + Mimir

下一步路由

上游概念：Metrics Basics
平行 vendor：Grafana Stack（Mimir）、OpenTelemetry
下游能力：4.20 Observability Evidence Package

Valkey

Fri, 01 May 2026 00:00:00 +0000

Valkey 是 2024 年從 Redis 7.2.4 fork 的開源專案、承擔三個責任：維持 Redis API 相容（drop-in 替換）、提供 OSI 認可的開源授權（BSD 3-clause）、由 Linux Foundation 託管避免單一公司控制。設計取捨偏向「相容 Redis 既有 client / 工具 + 開源治理透明 + 多雲廠商共同維護」、不追求功能超越 Redis Inc。

對「既有 Redis 部署、需要 OSI 認可授權、多雲避免 vendor lock-in、合規敏感」這條路徑、Valkey 是 Redis 的替代首選。AWS / Google / Oracle / Ericsson 等共同支援、AWS ElastiCache 已把 Valkey 設為 default engine。

本章目標

讀完本章後、你應該能：

跑起 Valkey、用 redis-cli 驗證 API 相容性
評估從 Redis 遷移到 Valkey 的相容性風險（module / Stack 功能）
看懂 Valkey vs Redis Inc 的版本對應跟功能差距
評估管雲端 managed Valkey（ElastiCache）的選用判斷
區分 Valkey 跟 Redis 商業版本對你的合規 / 採購 / SLA 影響

最短路徑：5 分鐘把 Valkey 跑起來

 1# 1. 啟動 Valkey（Redis API 相容、可直接用 redis-cli）
 2docker run -d --name valkey -p 6379:6379 valkey/valkey:8
 3
 4# 2. 驗證讀寫（valkey-cli 與 redis-cli 命令一致）
 5docker exec valkey valkey-cli SET foo bar   # → OK
 6docker exec valkey valkey-cli GET foo       # → bar
 7
 8# 3. 確認版本：Valkey 同時回報相容的 redis_version 與自身 valkey_version
 9docker exec valkey valkey-cli INFO server | grep -E "redis_version|valkey_version|server_name"
10# redis_version:7.2.4    ← client library 以此判斷相容性（fork 自 Redis 7.2.4）
11# server_name:valkey
12# valkey_version:8.1.8   ← Valkey 自身版本

第三步是相容性的關鍵證據：既有 Redis client library 看到 redis_version:7.2.4 就以 Redis 7.2.4 的行為運作、無需改 code；valkey_version 才是 Valkey 自身的演進線。實機驗證於 valkey/valkey:8 image、最後檢查日 2026-06-16。實際遷移路徑見進階主題：從 Redis 遷移。

日常操作與決策形狀

CLI 與 client API

子議題：

valkey-cli vs redis-cli：兩個 binary 都可連 Valkey、命令一致
Client library 配置：所有 Redis client 自動相容（無需 Valkey-specific client）
對應指令範例：INFO server 顯示 valkey_version 而非 redis_version

跟 Redis 的相容邊界

子議題：

Core data types / commands：100% 相容（fork 自 Redis 7.2.4）
Eviction / persistence / cluster：相容
Pub/Sub / Streams：相容
不相容：Redis 7.4+ 引入的功能、Redis Stack 商業 modules

遷移評估

子議題：

AOF / RDB 文件格式相容、可直接拷貝資料目錄
Client library 完全相容、無需改 code
監控工具相容（RedisInsight 雖偏 Redis Inc、但基本命令通用）
需確認 modules 使用狀況（Stack modules 未必有 Valkey fork）

進階主題（按需閱讀）

從 Redis 遷移

子議題：

評估 module 使用：列出當前使用的 Redis modules、確認 Valkey 對應替代
評估 Redis 7.4+ 功能使用（Functions、CLIENT NO-TOUCH 等）
遷移路徑：rolling restart with replica swap / 雙寫 / 直接 cutover
對應雲端 managed：AWS ElastiCache for Valkey 自動遷移工具

授權合規評估

子議題：

為何 Redis 改 RSALv2 / SSPL — OSI 認知（不算 OSI 認可開源）
Valkey BSD 3-clause — 商業使用無限制
對 SaaS 供應商：Redis 限制把 Redis 當成 service 對外提供、Valkey 無此限制
對企業 / 公部門：開源合規政策可能要求 OSI 認可、Valkey 通過、Redis 不過

Module 生態相容性

子議題：

Valkey 計畫自有 modules（valkey-search / valkey-bloom 等）
Redis Stack modules（RedisJSON / RedisSearch）部分有 fork
評估你用的 modules 是否有 Valkey 替代、否則考慮遷 module-free 設計

雲端 managed Valkey

子議題：

AWS ElastiCache for Valkey（成本比 Redis 低 ~20%、AWS 推）
GCP Memorystore（規劃 Valkey 支援）
Azure Cache（規劃中）
managed 邊界跟 ElastiCache for Redis 一致

跟 Redis 8 的功能差距

子議題：

Redis 8 新功能對 Valkey 的影響（功能落後幾個月）
Valkey 自有 roadmap（valkey.io/blog 追蹤）
何時 Redis 新功能值得遷回（罕見、通常 Valkey 跟上）

排錯快速判讀

Client 連不上（API 相容問題）

操作原則：先確認 Valkey 回報的相容版本、再對照 client library 支援到 Redis 哪個版本。

1valkey-cli INFO server | grep -E "redis_version|valkey_version"
2# redis_version:7.2.4    ← client library 用這個判斷相容性
3# valkey_version:8.1.8

絕大多數情況直接相容、若失敗多是 client library 太舊（不支援 Redis 7.2 對應版本）。

Module 不可用

操作原則：Valkey 對 Redis Stack modules 不一定有 fork、看 Valkey modules 清單。

監控工具相容性

操作原則：RedisInsight 連 Valkey 可能 partial 工作（部分 vendor-specific 命令缺）、用通用工具（valkey-cli、Prometheus + redis_exporter）較穩。

Performance regression（vs Redis）

操作原則：Valkey 跟 Redis 7.2.4 為 baseline、效能應接近、差距 < 5% 屬於正常。明顯回歸要看 Valkey roadmap 是否有 known issue。

何時改走其他服務

需求形狀	改走
依賴 Redis Stack 商業 modules	Redis（Redis Inc 商業版）
純 KV cache 不需 data types	Memcached
極高 throughput / 多核	DragonflyDB
AWS managed	AWS ElastiCache（已 default Valkey）
Durable Redis-compatible	AWS MemoryDB
跨雲 fully-portable	Valkey self-host（無 vendor lock-in）

不在本頁內的主題

完整 Valkey command reference（valkey.io/commands）
Linux Foundation governance 細節
各語言 client compatibility matrix
Redis Stack module 對應替代清單

案例回寫

直接相關案例（沿用 Redis 同源案例 + 待補 Valkey-specific case）

Valkey 從 Redis 7.2.4 fork、API 與行為 100% 相容、Redis-on-Valkey 同源案例可直接套用。截至本文時 Valkey-specific production case 仍累積中。

案例	對 Valkey 的對應
2.C3 Shopify serialization	Payload 雙軌遷移策略 client-side 實作、Valkey 跟 Redis 行為一致
2.C5 Shopify write-through	Write-through 在 Valkey 上跟 Redis 同樣 API、無遷移風險
2.C1 Meta cache consistency	invalidation / shard move 一致性議題、Valkey Cluster 沿用 Redis Cluster 模型

待補 Valkey-specific 案例：Linux Foundation Valkey customer adoption stories、AWS ElastiCache for Valkey 客戶遷移個案、re:Invent 2025+ talks、企業 OSI 合規驅動的遷移路徑公開分享。

跨 vendor 對照

案例	對 Valkey 的對應
2.C10 規模對照	Valkey 跟 Redis 規模化路徑一致（fork 同源）、小型 single / 中型 Sentinel / 大型 Cluster
2.C9 Cache Stampede	TTL jitter / singleflight 通用、Valkey 行為跟 Redis 一致
2.C2 Meta mcrouter	Memcached routing 案例、Valkey 對應為 Cluster + client-side routing 或 Envoy Redis proxy
2.C6 Netflix EVCache	EVCache 為 Memcached based、Valkey 對應為 Global Datastore（ElastiCache for Valkey）

下一步路由

上游概念：2.2 Cache Aside
平行 vendor：Redis（fork 源頭）、ElastiCache
下游能力：跟 Redis 完全一致、見 Redis vendor 頁的下游連結

Datadog Security

Mon, 18 May 2026 00:00:00 +0000

Datadog Security 是 Datadog observability platform 上的 security 套件、跟 Datadog logs / metrics / APM / infrastructure 共用同一個 control plane 與 data plane。它的設計起點不是 SIEM、是 把資安訊號當成 observability 的一個維度：alert 不只看 log、可以同時 pivot 到 APM trace、infra metrics 與 host context。這個定位決定了它的優勢（cloud-native + 混合 incident 偵測）與限制（SaaS-only + 計費隨 host 量線性漲、不適合 on-prem-heavy 或預算敏感場景）。

服務定位

Datadog Security 由四個 product 構成、共用 Datadog Agent 與 backend：Cloud SIEM（log-based detection、跟 Splunk Enterprise Security 同類）、Cloud Security Management (CSM) — 涵蓋 CSPM（cloud config posture）與 Cloud Workload Security (CWS)（container / Linux runtime via eBPF）、App and API Protection (AAP、前 ASM) — RASP-style 在 app runtime 收 attack signal、Sensitive Data Scanner — scan log 中的 PII / credential 並 redact。

跟 Splunk 比、Datadog 走 observability-first + security 是 view、Splunk 是 security-first。Splunk 在 enterprise SOC tooling 深度（SOAR playbook、RBA、CIM data model）與跨 on-prem 部署上更成熟、Datadog SaaS-only 但跟 APM / Infra 同 plane、混合 incident（latency 異常是攻擊還是容量？）的判讀路徑更短。跟 Elastic Security 比、Elastic 可跨 on-prem + OSS、Datadog 只給 SaaS；Elastic 要自己整合 observability 訊號、Datadog 出廠就有。跟 Google Security Operations 比、Google 走 fixed-price by data、PB-scale 划算、Datadog 隨 host 線性漲、中等規模友善但破千 host 後 cost 曲線變陡。

關鍵張力：observability 與 security 同 plane 是 Datadog 最大賣點、也是 cost 風險來源。host count 跟 events/month 同時是 observability 跟 security 的計費基準、security 加上去後 bill 不會獨立 — 預算要從 整個 Datadog 帳單 看、不是 security 單列。

本章目標

讀完本頁、讀者能判斷：

Datadog Security 在 SOC stack 中承擔哪一段（log SIEM / CSPM / 容器 runtime / WAF-runtime / log DLP）、哪些要外接（Vault、Okta IdP log、edge WAF）
observability + security 同 plane 的優勢何時成立、何時是 vendor lock-in 風險
Cloud SIEM 計費（events/month + indexed）跟 Standard / Flex Logs retention tier 的成本治理
何時用 Datadog、何時走 Splunk / Elastic / Google Security Ops 的取捨

最短判讀路徑

判斷 Datadog Security 部署是否健康、最少看四件事：

Datadog Agent coverage：agent 是否裝在所有 host / container / serverless wrapper、log forwarder 是否覆蓋 cloud control plane（AWS CloudTrail / GCP Audit Log / Azure Activity Log）、IdP（Okta）audit log 是否進來 — 缺一個就是 detection 盲點
Detection rule ownership：Cloud SIEM rule 是用內建還是 custom、custom rule 是否走 Git 版控（Terraform datadog_security_monitoring_rule）、staging 環境是否 dry-run 24-48hr 才 promote production
CSPM compliance check 治理：CIS / NIST / PCI baseline 開哪些、findings 是否進 ticket workflow、misconfig 修復 SLA 有沒有定義（critical 24hr、high 7d、medium 30d）
Events/month + Indexed Log 預算：Cloud SIEM 按 events/month + indexed event 計費、新加 source 前是否估算 ingestion impact、Standard / Flex Logs retention tier 是否依 log priority 分流

四件事任一缺失、就是 Detection Coverage and Signal Governance 邊界的待補項目。

日常操作與決策形狀

Datadog Agent 採集：log / metrics / trace / security event 走同一個 Agent、用 integration（150+）抓 cloud / SaaS / database / queue。security event 跟 observability event 在後端用 attribute tag（env、service、host、trace_id）關聯、查 incident 時可以從 log alert pivot 到同 trace_id 的 APM trace 看 attack 發生的 application context。

Cloud SIEM detection rule：rule 形式類似 SPL 的 query — source:okta @evt.name:user.authentication.auth_via_mfa @outcome:failure 加 signal aggregation（rolling window count、new value、anomaly detection、impossible travel）。內建 rule 跟 MITRE ATT&CK 對應、跟 Splunk Security Content 同類但 rule 數量較少；custom rule 走 Terraform provider 進版控、不在 UI 直改 production。

CSPM compliance check：scan AWS / GCP / Azure 配置 vs CIS / NIST 800-53 / PCI / SOC 2 baseline、發現 misconfig（public S3 bucket、overly permissive IAM、不安全 SG rule）。跟 Wiz / Prisma Cloud 同類但跟 Datadog Infra 同 dashboard、findings 可以直接看到 affected resource 的 metrics / log。優勢是 資安發現可以直接看業務影響、限制是 graph-based attack path（Wiz 強項）不及專業 CNAPP。

Cloud Workload Security（CWS）：用 Linux eBPF probe 在 kernel 層觀察 container / process behavior、偵測 cryptominer / privilege escalation / 異常 syscall / file integrity 變動。跟 Falco 同類但跟 Datadog Infra 同 plane、CWS alert 可以直接 pivot 到該 container 的 CPU / memory / trace。Linux eBPF 對 kernel 版本敏感、舊 kernel 部份功能不可用、production 前要確認 fleet kernel matrix。

App and API Protection（AAP）：RASP-style protection、Datadog APM library 在 application runtime 收 attack signal（SQLi / XSS / SSRF / 異常 traffic pattern）。跟 Cloudflare WAF / AWS WAF 不同層 — WAF 在 edge / CDN、AAP 在 app runtime 看到的是真實 request handler / DB query。兩者互補不互斥：edge WAF 擋 volumetric attack 跟已知 pattern、AAP 補 app-specific business logic abuse。

Sensitive Data Scanner：scan ingest 進來的 log、用內建或 custom pattern 偵測 PII / credential / payment card / API key、發現後可以 redact、quarantine 或 alert。是 DLP-lite — 比不上 Google DLP / Microsoft Purview 的 sensitive data discovery / classification / lineage 全套、但對 log 中誤洩 secret 的場景夠用、是 detection signal source 也是 DLP 補位。

Notebooks + Workflow Automation：Notebooks 是 incident investigation 用的 query workbook、混 log query + metric chart + APM trace + 註記、跟 Splunk Search 比較像 Jupyter notebook 的 SOC 版。Workflow Automation 是輕量 SOAR、接 PagerDuty / Slack / Jira / Webhook / Vault API、playbook 走 visual builder + Python。SOAR 深度不到 Splunk SOAR、但對中等規模 SOC（10-50 人）的常見 response 動作（rotate credential / block IP / open ticket）夠用。

Standard Logs / Flex Logs + retention tier：log 進 Datadog 後分 Indexed（hot、可全文搜尋、貴）、Flex Logs（warm、retention 長、查詢延遲較高、cost 1/3-1/5）、Archive（cold、丟 S3 / GCS、純儲存）三層。Cloud SIEM detection 跑在 indexed log 上、所以 哪些 log 走 indexed 直接決定 detection coverage 跟 bill。tier 1 source（IdP / cloud control plane / payment）必 indexed、tier 2 source（app log）按 sampling、tier 3（debug）走 Flex 或 Archive。

核心取捨表

取捨維度	Datadog Security	Splunk	Elastic Security	Google Security Operations
設計起點	Observability + security 同 plane	Security-first、log 統一查詢平台	Search-first、ELK stack 延伸	Massive scale ingestion、Google threat intel
計費模型	Per-host + per-event（events/month）	Ingestion-based（GB/day、累進）	Resource-based（node / cluster）	Fixed price by data tier（PB-scale 划算）
部署模型	SaaS only	Self-hosted / SaaS	Self-hosted / Cloud / Serverless	SaaS only（Google Cloud）
觀測整合	Native — log + APM + metrics + infra 同 query	需自接（Splunk Observability 另收）	需自接（Elastic Observability 另開）	弱 — 跨產品 federation
雲端 posture (CSPM)	內建（CSM）	第三方 add-on / Cisco 整合	第三方 / Wazuh	第三方 / Mandiant 整合
容器 runtime	內建 CWS（eBPF）	需 Falco / 第三方	Elastic Defend	需 Falco / 第三方
App runtime（RASP）	內建 AAP	需第三方	第三方	第三方
SOAR / Response	Workflow Automation（輕量）	Splunk SOAR（業界先驅）	Cases + Endpoint response	SOAR 內建（前 Siemplify）
適合場景	Cloud-native + 已用 Datadog + 中等規模 SOC	Enterprise + 跨 on-prem、預算允許	OSS-friendly、Elastic stack 已用	超大規模 ingestion、Google 雲

選 Datadog 的核心訴求：已經用 Datadog observability、cloud-native 為主、SOC 規模中等（10-50 人）、需要 observability + security 同 plane 的 incident 判讀路徑。on-prem 為主、預算敏感（host 量 1000+）、需要 enterprise SOAR / RBA 深度、走 Splunk；OSS-friendly、跨 on-prem、走 Elastic。

進階主題

Cross-product correlation（log + APM + metrics 同 trace_id）：Datadog 最特別的偵測形狀 — security alert 不只 log line、而是綁 trace_id 的 integrated incident view。例如 API endpoint 出現 SQLi 嘗試、Cloud SIEM 開 signal、同時 APM 看到該 request 的 DB query 跟 latency、infra 看到該 host 的 CPU。對「query latency 異常是不是被攻擊」這種混合 incident 偵測有結構性優勢、跟 Snowflake 2024 Credential Abuse 的調查路徑直接對應。

CWS Linux eBPF 行為偵測：eBPF probe 在 kernel 層、不需要 kernel module、不影響 process performance（< 1% overhead）。可以偵測的行為包括 file integrity（/etc/passwd 被改）、process tree（bash → curl → /tmp/payload 異常 chain）、network connection（容器對外連 cryptominer pool）、syscall pattern（ptrace 用於 process injection）。跟 Falco 同樣用 eBPF、差別是 Datadog CWS 不需要單獨部署 + 跟 Datadog 其他 signal 同 plane。

Datadog Threat Intelligence：內建 threat feed（malicious IP / domain / file hash）、自動標記 log / network event 命中 IoC。可以加自家 STIX/TAXII feed、不過深度比不上 Mandiant / Recorded Future / 專業 TI platform；中等規模 SOC 夠用、嚴重 APT 對抗場景要外接專業 TI。

跟 Datadog Incident Management 整合：security signal 可以直接開 Datadog Incident（內建 incident channel + timeline + post-mortem template）、跟 PagerDuty 同類但跟 observability 同 plane。對 資安事件升級成全公司 incident 的場景（Change Healthcare 2024 Operations Impact 那種規模）可以共用 incident commander 視角、不用兩套 timeline 拼起來。

排錯與失敗快速判讀

Cloud SIEM 偵測 lag / 沒 alert：events 沒進 indexed log（走了 Flex）、retention tier 設錯 — 檢查 log pipeline rule 是否把 security-critical source 標 indexed
Events/month 暴衝：debug log / verbose log 進 Cloud SIEM index、CWS event 量爆 — log pipeline 前置 filter（Datadog Observability Pipeline 或 Cribl）、CWS rule 收斂 noisy 行為
CSPM findings 100+ 沒人修：findings 沒進 ticket workflow、沒分 priority — 整合 Jira / ServiceNow、severity 對應 SLA、findings 老化超 30 天升級
CWS 在舊 kernel host 沒資料：eBPF feature 對 kernel 版本敏感（< 4.18 部份功能不支援）— 升級 kernel 或標記該 host 為 CWS-incompatible、補位用 host-based agent
AAP false positive 卡 user：RASP 在 app runtime 直接 block、誤殺正常 request — AAP 先走 monitor mode 1-2 週收 baseline、tune 後再轉 protect mode
Sensitive Data Scanner miss PII：custom pattern 沒寫對、log format 嵌套（JSON 內又是 JSON）— 用 sample log 跑 dry-run、scanner 跑在 ingest 階段不是 retroactive
Workflow Automation playbook 黑箱：自動 rotate credential 結果誤殺 prod service account — playbook high-impact action 走 approval gate、default 走 containment 不走 deletion

何時改走其他服務

需求形狀	改走
Enterprise + 跨 on-prem、預算允許	Splunk
OSS-friendly / Elastic stack 已用	Elastic Security
超大規模 ingestion + Google 雲	Google Security Operations
嚴格 DLP / 資料分類	Google DLP / Microsoft Purview
Cloud posture graph / attack path	Wiz / Prisma Cloud / Lacework
Edge WAF / volumetric attack	Cloudflare WAF / AWS WAF
Endpoint EDR	CrowdStrike Falcon / Microsoft Defender for Endpoint
Incident routing	8 事故處理 vendor 清單

不在本頁內的主題

Datadog Agent 完整 configuration reference、custom check 撰寫
Datadog observability（APM / RUM / Synthetics / DBM）細節 — 屬 4 observability 模組
Cloud SIEM rule 完整語法 reference
CWS eBPF probe 撰寫（custom rule via Agent Expression Language）細節
Datadog Incident Management workflow（屬 8 IR 模組）

案例回寫

Datadog Security 在 07 案例庫沒有直接 vendor-level 事件、但 observability + security 同 plane 的偵測形狀讓部份案例的調查路徑變短、值得對照：

案例	跟 Datadog Security 的關係（對照啟示）
Snowflake 2024 Credential Abuse	Query volume + 連接數 + CPU 負載異常是 Datadog 同 plane 的強項、Cloud SIEM rule + DBM metrics 同 query 不用 SIEM + 監控工具拼接
Change Healthcare 2024 Operations Impact	業務中樞事件的影響評估、APM + Infra 可秒級判斷 latency 異常源自資安 vs 容量、Datadog Incident 共用 IC 視角
Mailchimp 2023 Support Tool Abuse	APM span correlation 可看到單一 operator 短時間跨多 tenant access 的 trace pattern、log-only SIEM 看不到 application-level tenant 切換
Uber 2022 MFA Fatigue	Cloud SIEM detection rule 配 Okta MFA log + APM error rate correlation、不靠單一 log source
Detection Coverage and Signal Governance (section)	Standard / Flex Logs + retention tier 是 detection coverage 治理的工具、tier 1 source 必 indexed、tier 2 / 3 走 Flex / Archive

下一步路由

上游：7.13 偵測覆蓋率與訊號治理、Detection Engineering Lifecycle
平行：Splunk、Elastic Security、Google Security Operations
下游：Google DLP / Microsoft Purview（DLP signal 進 Datadog）
跨類：Okta（IdP log source）、HashiCorp Vault（Workflow Automation 拉 API）、Cloudflare WAF / AWS WAF（edge WAF log 進 Cloud SIEM、AAP 在 app 層補位）
跨模組：4 observability（同 Agent / 同 plane）、8 事故處理 vendor 清單（Datadog Incident → IR routing）
官方：Datadog Security Documentation

Fastly Next-Gen WAF

Mon, 18 May 2026 00:00:00 +0000

Fastly Next-Gen WAF（NG-WAF）的核心定位是 用語意分析 + behavioral detection 取代 regex signature 的 web application firewall。它前身是 2020 年被 Fastly 收購的 Signal Sciences、跟 Cloudflare WAF / AWS WAF 的根本差異不在覆蓋面、在 偵測 mindset — 不靠 pattern 比對、靠解析請求語意（這段內容像不像 SQL、像不像 shell command）跟跨請求行為模式（同一 token 在多 endpoint 連續觸發異常）下判斷。產出是 低 false positive 的 inline block 模式可以直接上 production、不需要先養 Log Mode 兩週、不需要 SOC 全職人員跟 rule 戰。

服務定位

Fastly NG-WAF 設計的第一順位是 production 可直接走 Block 模式。Signature WAF 的成本不在 rule 本身、在 false positive — 一條 SQLi pattern 可能誤判合法 SQL-like 字串（搜尋查詢、CSV 上傳）、production 開 Block 立刻炸合法流量、所以多數 signature WAF 跑在 Detect / Log Only 模式、攔不下真正攻擊。Fastly NG-WAF 走 Signal 模型：每個請求被解析後標記若干 Signal（SQLi、XSS、CMDI、Traversal、Anomaly 等）、再依 threshold-based rule（N 個 Signal 在 M 秒內聚集）才動作 — false positive 自然降低、Block 模式可開。

跟 Cloudflare WAF 的對照：Cloudflare 走 signature + managed rule + ML 三層、覆蓋廣但需要 sensitivity tuning；Fastly NG-WAF 預設低 FP 但需要 客戶自己定義業務語意（哪些 path 是 admin、哪些 header 不該出現、哪些 anomaly 對自家業務代表攻擊）— 用 Tag + Match Conditions 表達。跟 AWS WAF 的對照：AWS WAF 跟 ALB / CloudFront / API Gateway 整合深、跨雲弱；Fastly NG-WAF 部署模型多樣（Edge / Agent / Cloud）、跨 AWS / GCP / on-prem / K8s 一致。

關鍵張力：低 FP 的代價是要花時間理解自家業務語意。Signature WAF 是「裝上就有保護」、Fastly NG-WAF 是「裝上有 baseline、業務 anomaly 要自己標」。沒有人定義 Tag + Power Rules、就只用到產品 30% 能力。

本章目標

讀完本頁、讀者能判斷：

Fastly NG-WAF 的 Signal / Tag / Rule / Mode 四個核心 first-class concept 各承擔什麼責任
Edge / Agent + Module / Cloud Proxy 三種部署模型的選擇條件
Account Takeover Protection、Bot Protection、API discovery 三個進階 module 的適用情境
何時用 Fastly NG-WAF、何時走 Cloudflare WAF / AWS WAF 的取捨

最短判讀路徑

判斷 Fastly NG-WAF 配置是否健康、最少看四件事：

部署模型對齊架構：Fastly Edge inline（流量本來就過 Fastly CDN）/ Agent + Module（自管 Nginx / Apache / IIS / Envoy / .NET 加 sigsci-agent local process）/ Cloud Proxy（Fastly 接 origin proxy）三選一或混用、是否覆蓋所有入口（含 admin、internal API、staging）
Signal 與 Tag 設計：預設 Signal（SQLi / XSS / CMDI / Traversal / Backdoor / Anomaly）是否全開、業務語意 Tag（admin-path、internal-only、payment-flow）是否定義並掛上 Match Conditions、Power Rules 是否組合多 Signal / Tag 走 threshold-based action
Rule mode 與 threshold：Site-level 跟 Corp-level Rule 是 Block 還是 Off、threshold（連續幾個 Signal / 多久窗口）是否依 endpoint 業務調整、Template Rule（ATO、Bot）是否啟用
Logging 與 sigsci-agent token 治理：Syslog / HTTP webhook / S3 / SIEM（Splunk / Datadog / Sumo Logic）整合是否 production-grade、sigsci-agent 連回控制面的 token 是否進 HashiCorp Vault、跨環境 token 是否分離

四件事任一缺失、就是 Audit Log 與 Entry Point Protection 邊界的待補項目。

日常操作與決策形狀

部署模型選擇：Fastly Edge inline 是最簡部署、流量已過 Fastly CDN 就 inline 加 NG-WAF、沒有額外 agent 要管；Agent + Module 是 self-managed Nginx / Apache / IIS / Envoy / HAProxy / .NET / Java（Tomcat）等加裝 sigsci-module（process 內 module 攔請求）+ sigsci-agent（本機 daemon、跟 Fastly 控制面 sync rule、collect event）— 適合 origin 不過 Fastly CDN、或 internal API；Cloud Proxy 是 Fastly 提供 reverse proxy 端點、客戶 DNS 指過去、origin 在後面 — 適合不想改 origin、又沒用 Fastly CDN。三種混用常見、大企業 edge 用 Fastly Edge、internal service 用 Agent + Module。

Signal 是已知攻擊指標：Fastly NG-WAF 預定義 Signal 包含 SQLi / XSS / CMDI（command injection）/ Traversal（路徑穿越）/ Backdoor / RCE / Anomaly 等。Signal 是 語意解析結果 — request body 被 parser 拆解（JSON / form / multipart）、每個欄位看「這像不像某類攻擊」、不是 regex 比對。意義是 encoding 變化攔不住（base64 / URL encode / Unicode normalize 都會被解開）、跟 signature WAF 的脆性對比明顯。

Tag 是客戶自定 Signal：用 Match Conditions（path / method / IP / header / body content / query 參數）定義「什麼樣的請求叫某 tag」、例：Path: /admin/* AND Source IP NOT IN internal_cidr → tag: admin-external-access。Tag 之後可以走 Rule 處理（看到 admin-external-access 就 alert / block）。Tag 是 Fastly NG-WAF 表達 業務語意 的主要工具、不是用來補強 Signal。

Rule 三層：Site-level Rule（單一 site / property）/ Corp-level Rule（整個 organization 共用、用於 corp-wide block list、跨 BU 統一 policy）/ Template Rule（Fastly 提供的預設複合 rule、如 ATO template、Bot template）。Rule 表達式組合 Signal / Tag / Source IP / Path / Method、走 Block / Off。Power Rules 是進階版 — 支援 threshold + 時間窗口 + 多條件 AND/OR、例：「同 IP 在 60 秒內觸發 5 個 SQLi Signal 就 Block 10 分鐘」。

Mode 兩種：Block（攔截、回 406 / 自訂 status）/ Off（不動作、純 log）。沒有 Cloudflare 的 Sensitivity 滑桿 — 因為 Signal 本身已是語意判讀結果、不需要敏感度調整、調整在 threshold（多少 Signal 才動作）。

Account Takeover Protection（ATO）：偵測 credential stuffing pattern — 同 IP 多 login fail、跨 IP 同 account 多 login、impossible travel、unusual UA。Fastly NG-WAF 內建 login endpoint detection（自動 / 手動標記 /login、/auth/signin 等）、配合 ATO Template Rule 直接 inline 處理（rate limit、challenge、block）。對應 Identity Boundary 的 ATO 對策、但是在 WAF 層直接攔、不等 IdP 內 ATO 邏輯。

Bot Protection：跟 Cloudflare Bot Management 同類、走 behavioral + browser fingerprint + JS challenge、區分 verified bot / likely bot / human。比 user-agent 過濾穩、headless browser 攔得住。

API discovery：Fastly NG-WAF 自動學習 site 的 API endpoint 與 schema、偵測 schema drift（突然出現的多餘欄位、缺欄位、type mismatch）— 比手動維護 OpenAPI schema 輕量、適合內部 API 多但沒寫完整 OpenAPI 的團隊。

Logging 與 sigsci-agent 治理：所有 event 走 Fastly NG-WAF 控制面 + 客戶端 Syslog / HTTP webhook / S3 / SIEM（Splunk / Datadog / Sumo Logic）。sigsci-agent 連回控制面用 Site API key — 該 key 進 HashiCorp Vault、跨環境 prod / staging 分離、rotation 走標準 secret rotation 流程、不能寫死在 agent 配置檔。

核心取捨表

取捨維度	Fastly Next-Gen WAF	Cloudflare WAF	AWS WAF
偵測模型	Signal / 語意分析 / behavioral（低 FP）	Signature + Managed Rule + ML	Signature + Managed Rule + Lambda 自訂
部署位置	Fastly Edge / Agent + Module / Cloud Proxy	Cloudflare global edge	AWS region 內 ALB / CloudFront / API Gateway 前
Block 模式可行性	高 — 預設低 FP、production 可直開	中 — 需 sensitivity tuning + Log Mode 觀察	中 — managed rule FP 需排除、custom rule 自管
業務語意表達	Tag + Match Conditions + Power Rules（threshold）	Custom Rule（Rules language）+ Bot Score	JSON policy + Lambda 自訂
自管伺服器支援	強 — sigsci-agent + module 覆蓋 Nginx / Apache / IIS	弱 — 必須流量過 Cloudflare edge	弱 — 必須走 AWS service
ATO 內建	是 — Template Rule 直接 inline	Exposed Credentials Check（部分覆蓋）	AWS WAF Fraud Control（加價）
Bot Protection	內建（同層產品）	加價 add-on（Pro / Business / Enterprise）	AWS WAF Bot Control（加價）
API discovery	內建（auto schema learning）	API Shield（Enterprise）	API Gateway request validator
學習曲線	中 — Signal / Tag mindset 要轉、agent 安裝要熟	中 — UI 易上手、Rules language 表達力強	較陡 — JSON policy + 多 AWS service 整合
價格	較高 — Enterprise tier 為主、按請求量計	分層（Free / Pro / Business / Enterprise）	按 rule + request 量、起步低
適合場景	低 FP 要求、API 重、自管伺服器多、跨雲 / on-prem	多雲 / on-prem origin、要整套 edge security suite	AWS-heavy、ALB / CloudFront / API Gateway 是主入口

選 Fastly NG-WAF 的核心訴求：production 直接 Block + API / schema-rich 業務 + 自管伺服器需要 inline agent + 跨雲 / on-prem mix、且有預算支付 Enterprise tier。純 AWS-internal 簡單 web app 用 AWS WAF 整合更直接；要整套 edge security suite 用 Cloudflare。

進階主題

VCL + Edge custom rule：Fastly Edge 部署模式下、NG-WAF 跟 Fastly CDN 的 VCL（Varnish Configuration Language）共存、複雜邏輯可寫 VCL 在 NG-WAF 處理前後攔截 — 例：geo block 在 VCL 做、NG-WAF 處理通過的請求。Compute@Edge（Fastly 的 edge serverless、類 Cloudflare Workers）也可以接 NG-WAF 結果做進一步處理。代價是 VCL / Compute@Edge code 變另一條 ops trace、要有版控與 staging。

ATO 進階 — credential stuffing 場景：login endpoint 接 ATO Template Rule 後、可進一步整合 已洩漏 credential check（類 Have I Been Pwned 整合）、failed login burst → progressive challenge（先 CAPTCHA、再 block）。對應 Identity Boundary 的 IdP ATO 邏輯、Fastly 在 WAF 層攔的好處是 攻擊不會打到 IdP、減少 IdP 端 rate limit 壓力。

Bot Protection 進階：browser fingerprint + behavioral pattern + JS challenge 三層、可掛 bot score threshold 在 Power Rules 內、配合 ATO 做 high-risk login flow（bot score 高 + login endpoint → 強 challenge）。

Agent + Module 在 K8s / VM：K8s 場景 sigsci-agent 走 sidecar 或 DaemonSet、sigsci-module 在 ingress controller（Nginx Ingress Controller 加 sigsci-nginx module）；VM 場景 sigsci-agent 走 systemd service、module 隨 web server 啟動。跨環境 token 隔離（prod / staging / dev）走 Vault dynamic secret 或環境變數注入、不寫死配置檔。

Corp-level Rule 共用：多 BU / 多產品線在同一 Corp（Fastly NG-WAF 的 organization 概念）下、Corp Rule 跨所有 Site 生效 — 適合表達「全公司禁 IP X」「全公司 ATO Template 都開」、避免每個 Site 重複配置。

排錯與失敗快速判讀

Signal 沒觸發、攻擊穿過：Encoding 異常 / parser 沒解析該 content-type — 確認 Content-Type 正確、body 大小沒超過 sigsci-module 限制（預設 100KB）、Signal scope 是否包含該 endpoint
Tag 沒掛上：Match Conditions 寫錯（path 大小寫、trailing slash、wildcard 語意）— 在 Fastly NG-WAF console 用 Rule Evaluation 工具測試 request 是否命中
Block 模式誤殺：Power Rules threshold 太低、單一合法請求觸發多 Signal — 調 threshold 或加 Site Rule exception 排除特定 path / source
sigsci-agent 跟控制面失聯：Site API key 過期 / firewall block out-bound / agent 版本太舊 — agent log 看 connection status、輪換 token 走 Vault、保持 agent 在 supported version range
sigsci-module load 失敗：web server 啟動報 module 載入錯 — 確認 module 版本跟 web server major version 對齊（Nginx 1.20 對 sigsci-nginx 對應版本）
ATO Template 沒攔到：login endpoint detection 沒標到自家 path — 手動在 console 標記 login endpoint 路徑
Logging gap：Syslog / webhook 送失敗、SIEM 沒收到 — 確認 destination accept、TLS cert 沒過期、retry policy
跨環境 token 漏氣：staging token 流到 prod、改 staging 影響 prod rule — Vault 環境分離、token 加標籤、定期 audit token usage

何時改走其他服務

需求形狀	改走
AWS-only + ALB / CloudFront origin	AWS WAF
多雲 + 要整套 edge security suite	Cloudflare WAF
純 internal mTLS / east-west	SPIRE + service mesh
Cert lifecycle	cert-manager / Let’s Encrypt
Bot management 為主要訴求、預算敏感	Cloudflare Bot Management 入門 / AWS WAF Bot Control
DDoS L3/L4 為主	Cloudflare Magic Transit / AWS Shield Advanced

不在本頁內的主題

Signal Sciences 收購前的 product line 演進細節
完整 Signal 清單與每個 Signal 的內部解析邏輯
VCL / Compute@Edge 完整語法 reference
Fastly CDN 本身的 caching / TLS / origin shielding 細節
Enterprise 合約細節、各國資料駐留選項

案例回寫

Fastly NG-WAF 沒有直接 vendor-level 公開事件、案例庫對照引用以「behavioral detection 在 zero-day / supply chain 場景的 inline mitigation 角色」為主：

案例	跟 Fastly NG-WAF 的關係
Log4Shell CVE-2021-44228	對照啟示 — Anomaly Signal 對 JNDI pattern 有 immediate inline detection、不需等 vendor signature 更新；但 exploitation 進後端後仍要靠 supply chain 治理
Citrix Bleed 2023 Session Hijack	對照啟示 — WAF 攔不住 edge appliance zero-day、需要「修補 + session 失效 + 異常清查」三同步、NG-WAF Power Rules 可在窗口期提供臨時 anomaly 偵測
Fortinet SSL-VPN CVE 2023-27997	對照啟示 — vendor patch 前用 Power Rules + Tag 快速部署臨時 mitigation、收斂可達來源是修補窗口期的標準動作
7.3 入口治理與伺服器防護	Fastly NG-WAF 是 entry point protection 的工具、低 FP 設計讓 production Block 模式可行、跟 signature WAF 的部署成本曲線根本不同

下一步路由

上游：7.3 入口治理與伺服器防護
平行：Cloudflare WAF、AWS WAF
下游：7.4 資料保護與遮罩治理（WAF block 不夠時、資料層也要遮罩）
跨類：HashiCorp Vault（sigsci-agent Site API key 存放）、Okta（Fastly admin 走 SSO）
跨模組：8 事故處理 vendor 清單（WAF block 事件 routing 進 IR）
官方：Fastly Next-Gen WAF Documentation

Google Secret Manager

Mon, 18 May 2026 00:00:00 +0000

Google Secret Manager（GSM）是 GCP 原生的 static secret 集中保管 服務、設計上刻意保持簡單：只負責 secret 儲存、版本管理、IAM 授權、跟 Cloud KMS 整合的 envelope encryption。rotation orchestration、cross-region replication policy、dynamic credential issuing 都不在 GSM 自己做、留給上層用 Cloud Function / Cloud Run 自組。跟 AWS Secrets Manager 最大的差異是 沒有 built-in rotation Lambda — rotation logic 要自己寫、GSM 只提供 Rotation Schedule + Pub/Sub event 當觸發點。

服務定位

GSM 的定位是 GCP-native 的 secret 集中點、解決三件事：把 secret 從 environment variable / Cloud Build substitution / GitHub secret 收回單一受控位置；用 Google Cloud IAM 的 role binding on secret resource 控制誰能讀；走 Workload Identity Federation 讓 GKE / Cloud Run / 外部 workload（GitHub Actions / AWS / Azure）安全取用、避免長期 service account key 散落。

跟 Vault 比、GSM 沒有 dynamic credential engine、沒有 transit / PKI engine、沒有跨雲統一介面 — 但運維成本接近於零、跟 GCP IAM / KMS / Cloud Logging 的整合是 first-class。跟 AWS Secrets Manager 比、GSM 把 rotation orchestration 推給應用層、自由度高但代價是 rotation 流程要自己設計；跟 Azure Key Vault 比、兩者 mindset 相近（單雲、IAM-driven、CMEK 整合）、各自綁雲。

本章目標

讀完本頁、讀者能判斷：

哪些 secret 適合 GSM（GCP-only、static、靠 IAM 授權即可）、哪些該走 Vault 或其他雲端 native
GSM 最低安全設定（CMEK、Data Access audit、Workload Identity Federation、IAM Conditions）
自寫 rotation Cloud Function 時必須處理的 版本切換窗口 跟 fallback 邏輯
何時 GSM 不夠用、要往 Vault / Berglas / Cloud HSM 走

最短判讀路徑

判讀一個 GSM deployment 是否健康、最少看四件事：

誰能讀 secret：secret resource 上的 IAM binding 是不是用最小單位授權（per-secret、不是 project-level roles/secretmanager.secretAccessor）、有沒有上 IAM Conditions 限定時間 / IP / resource tag
Key custody 分離：encryption key 是 Google-managed default key、還是 Cloud KMS CMEK？CMEK 的 key 持有 admin 跟 secret access admin 是不是分人
取用路徑：workload 取 secret 是走 service account key（壞模式、長期憑證散落）還是 Workload Identity Federation（GKE WIF / 外部 OIDC token exchange）
證據是否可回查：Admin Activity audit 預設開、Data Access audit（AccessSecretVersion 誰呼叫）預設關、production 要手動 enable + 接 Cloud Logging sink 推到 SIEM

四件事任一缺失、就是 Audit Log 與 Secret Management 邊界的待補項目。

日常操作與決策形狀

IAM Conditions 收 scope：GSM 的 secretAccessor role 預設綁到 secret resource、但組織常見錯配是給整個 project 上 roles/secretmanager.secretAccessor — 等於整個 project 所有 secret 都能讀。應該用 per-secret binding、再加 IAM Conditions（resource.name.endsWith('prod-db-password')、request.time < timestamp('...')）限縮時間窗口。對應 Okta Cloudflare 2023 supply chain 的對照啟示：第三方 token scope 過寬時、上游事件直接傳導下游、IAM Conditions 是收 scope 的工具。

Secret Version + Alias 模型：每個 secret 有 monotonic version（v1、v2、v3…）、預設 alias latest 指向最新 enabled version。rotation 不是「更新現有 secret」、是 建立新 version + 把舊 version disable。應用端要支援 讀新 version 失敗時 fallback 舊 version、或在 rotation Cloud Function 內實作 雙軌驗證窗口（新版本上線後一段時間舊版還能讀、確認所有 consumer 切過去再 destroy 舊版）。沒這層設計、一次 rotation 就會打掉沒及時更新的 consumer。

CMEK（Customer-Managed Encryption Key）：GSM 預設用 Google-managed key、production 應該指向 Cloud KMS CMEK。意義是 把 key 持有跟 secret 取用分離 — 即使 secret admin 被攻破、沒有 CMEK 的 decrypt 權限拿不到明文。代價是 CMEK key region 跟 secret replication 要對齊（key 在 us-central1 但 secret 設 automatic replication = key 進不去其他 region、secret access 會失敗）。

Replication 策略：automatic 是 GCP 自動跨 region replicate（高可用、不需要管 region 一致性、但 data residency 受 GCP 全球策略支配）；user-managed 是手動指定 region list（精細控制資料駐留、適合有 GDPR / 跨境合規需求的場景、但 region 加減要自己管 + CMEK key 要在每個指定 region 都存在）。一個常見錯配：選 user-managed 但只設一個 region — 等於沒有跨 region 冗餘、該 region 出事 secret 完全讀不到。

Rotation 是自管 schedule：GSM 提供的不是 rotation logic、是 Rotation Schedule（cron 或固定間隔）、到期會發 Pub/Sub message 到指定 topic、由 自己寫的 Cloud Function / Cloud Run 訂閱該 topic 執行實際 rotation（呼叫上游系統 API 生新 credential、寫成新 secret version、disable 舊 version）。對應 Failure: Credential Rotation Without Scope：rotation Cloud Function 必須自己處理 scope map（哪些 consumer 用了同一把 secret）跟 雙軌驗證窗口（confirm 所有 consumer 切到新版本才 disable 舊版）、不像 AWS Secrets Manager 有 built-in 四階段 flow（createSecret → setSecret → testSecret → finishSecret）。

Workload Identity Federation 取用：external workload（GitHub Actions / AWS workload / Azure workload / on-prem K8s）用 WIF 拿 GSM secret 是現代預設模式 — workload 用自己的 OIDC token（GitHub OIDC、AWS STS）跟 GCP STS 交換 short-lived access token、再用 token 呼叫 GSM。避開了「長期 service account JSON key 散落 CI / 第三方環境」的問題。GKE 內 workload 走 GKE Workload Identity（pod ServiceAccount → GCP service account 綁定）取 secret、也是同 mindset。

Audit log 治理：GSM 的 audit 分兩層 — Admin Activity（create / delete / IAM 變更、預設開、免費）、Data Access（AccessSecretVersion、預設關、開啟有 log 量跟 BigQuery export cost）。production 不開 Data Access = 事故時 連 secret 被誰取過都查不到、必須在 project IAM Audit Config 開、Cloud Logging sink 推到 SIEM 或 BigQuery（見 7.13 偵測覆蓋率與訊號治理）。

核心取捨表

取捨維度	Google Secret Manager	HashiCorp Vault	AWS Secrets Manager	Azure Key Vault
部署模型	GCP managed	自管 cluster（HA + replication）	AWS managed	Azure managed
跨雲	弱 — 綁 GCP	強 — 同一介面跨 AWS / GCP / Azure / on-prem	弱 — 綁 AWS	弱 — 綁 Azure
Rotation 模型	自寫 Cloud Function（Pub/Sub trigger）	dynamic engine 自動 lease	built-in Lambda 四階段 flow	自寫 Function App（Event Grid trigger）
Dynamic credential	無（靠 IAM impersonation 替代）	DB / cloud / SSH engine 完整	RDS rotation 有、cloud STS 較弱	較弱（依靠 Managed Identity）
Encryption key	Google-managed default / Cloud KMS CMEK	自管 / KMS auto-unseal	AWS KMS CMK	Azure Key Vault key
External workload	Workload Identity Federation（成熟）	AppRole / Kubernetes / OIDC auth	IAM Roles Anywhere（較新）	Managed Identity / Workload Identity
運維成本	低	高 — HA、upgrade、replication 自己顧	低	低
適合場景	GCP-heavy + WIF 已主導 + static secret 為主	跨雲、dynamic credential、內部 PKI	AWS-heavy + 需要 built-in rotation 收斂	Azure-heavy + Managed Identity 已主導
退場成本	低	中 — dynamic engine 接線多	低	低

選 GSM 的核心訴求：workload 主要跑在 GCP（GKE / Cloud Run / Cloud Build）、已經用 Workload Identity Federation 收 service account key、secret 形態以 static 為主（DB password、third-party API key、private key）、rotation 邏輯願意用 Cloud Function 自寫。要跨雲、要 dynamic credential、要內建 rotation flow、需要 transit encryption — 走 Vault。

進階主題

CMEK + Cloud KMS 雙軌權限分離：production 應該至少把 prod secret 的 CMEK key 跟 secret IAM 分到不同 admin group — secret admin 可以建 / 改 secret 但不能 decrypt（沒 KMS cloudkms.cryptoKeyDecrypter），KMS admin 可以管 key 但不能讀 secret 內容。對應 Microsoft Storm-0558 signing key chain 的對照啟示：key 不離 KMS 邊界、跟 HSM-bound 同 mindset；CMEK 是把這個原則內建到 secret 路徑。

Berglas（OSS pattern）：Berglas 是 Google 開源的 GSM client library + CLI、在 Cloud Run / Cloud Function / GKE 啟動時把 sm://... 參考自動 resolve 成實際 secret value、注進環境變數或檔案。比起應用端寫 SDK 取 secret 的好處：secret 不進 container image / build manifest、只有 runtime 取得；缺點是多一層 dependency、且 Berglas 自己有 IAM 需求要管。

GKE Workload Identity 取用：GKE pod 用 ServiceAccount → IAM service account 綁定（透過 iam.gke.io/gcp-service-account annotation）、pod 內呼叫 GSM API 自動帶 GCP service account 身份、metadata server 簽 token。比起把 service account JSON key mount 進 pod、Workload Identity 沒有長期 credential 在 pod 內、credential rotation 由 GCP metadata 自動處理。

Secret rotation Cloud Function 樣板：訂閱 secret 的 rotation topic（Pub/Sub）、message 帶 secret name 跟 trigger reason；Function 內呼叫上游系統 API（DB / SaaS）生新 credential、用 secretmanager.AddSecretVersion 寫新 version、等一段時間（雙軌驗證窗口）後 DisableSecretVersion 舊 version、最後 DestroySecretVersion 完成 rotation。雙軌窗口的長度必須大於 consumer 的最長 cache TTL、否則沒及時 refresh 的 consumer 會在 disable 後失敗。

Pub/Sub event subscription（new in 2023+）：除了 rotation schedule 自動發 event、GSM 也支援對 secret 任意變更（new version、IAM change）發 Pub/Sub message、可接 SOAR / SIEM 做 secret 異常變更告警（例：非 CI service account 在週末新增 secret version）。

排錯與失敗快速判讀

取 secret 拿到 PERMISSION_DENIED：通常是 IAM binding 在 project 層但 secret 在某 sub-resource、或 IAM Conditions 把當前 caller 排除 — 用 gcloud secrets get-iam-policy 直接看 binding、確認 condition 表達式
CMEK 設定後突然讀不到 secret：CMEK key region 跟 secret replication region 不對齊、或 caller 沒有 KMS decrypt 權限 — 確認 key 在所有 replication region 都有版本、secret accessor service account 有 cloudkms.cryptoKeyDecrypter
Rotation Cloud Function 跑了但 consumer 認證失敗：雙軌窗口太短或 consumer 沒實作 latest version 失敗 fallback、舊版 disable 後孤兒 consumer 直接斷 — 把雙軌窗口拉到 cache TTL × 2、補 fallback 邏輯
Data Access audit 沒紀錄：預設關、要在 project IAM Audit Config 明確開 secretmanager.googleapis.com 的 DATA_READ — 不開等於沒辦法回答「事故當下誰讀了 secret」
External workload 拿不到 secret：Workload Identity Federation 的 provider attribute mapping 沒對齊（GitHub OIDC token 的 repository claim 沒被 map 到 attribute condition）— 走 gcloud iam workload-identity-pools providers describe 看 mapping、用 token introspection 驗實際 claim
Secret version 累積過多：rotation 只 disable 不 destroy、版本無限長 — 加 lifecycle policy（手動 / Cloud Function 排程）destroy 超過 N 個版本以前的舊版
GKE pod 用 Workload Identity 但拿不到 secret：通常是 GKE 沒 enable Workload Identity feature、或 iam.gke.io/gcp-service-account annotation 拼錯、或 GCP service account 沒給 K8s ServiceAccount iam.workloadIdentityUser — 三層都要對才能通

何時改走其他服務

需求形狀	改走
跨雲 secret 統一介面	HashiCorp Vault
需要 dynamic database / cloud credential	HashiCorp Vault dynamic engine
需要 built-in 四階段 rotation flow	AWS Secrets Manager（若可遷 AWS）
Encryption-as-a-service / 內部 PKI	HashiCorp Vault transit / PKI engine
FIPS 140-2 Level 3 HSM 需求	Cloud HSM（KMS 後端可改 HSM）
公開憑證 PKI	Google Certificate Authority Service / Let’s Encrypt
K8s workload cert 自動化	cert-manager
Secret rotation 證據鏈	7.5 Credential Rotation Scoped Evidence

不在本頁內的主題

GSM 完整 REST API 跟 gcloud secrets 詳盡子命令
Cloud KMS key lifecycle 跟 rotation 細節（看 Google Cloud KMS 章）
Workload Identity Federation 完整設定步驟（attribute mapping、condition expression、provider 設定看 Google Cloud IAM 章）
Berglas 完整 CLI 用法
Cloud Function / Cloud Run 部署細節
GCP Organization Policy 跟 secret 跨 project 共享的進階場景

案例回寫

GSM 在 07 案例庫沒有直接 vendor-level 事件、以下案例採對照引用：

案例	跟 GSM 的關係（對照）
Failure: Credential Rotation Without Scope	GSM rotation 是自寫 Cloud Function、scope map 跟雙軌驗證窗口都要自己設計、不像 AWS Secrets Manager 有 built-in 四階段 flow — 設計時就要把 consumer scope 跟 cache TTL 算進 rotation 排程
Microsoft Storm-0558 Signing Key Chain (red-team)	對照啟示 — GSM CMEK 把 encryption key 放 Cloud KMS、key 不離 KMS 邊界、跟 HSM-bound 同 mindset；secret admin 跟 KMS admin 分人是減 blast radius 的關鍵
Okta Cloudflare 2023 Support Supply Chain (red-team)	對照啟示 — GSM 管的第三方 token（GitHub PAT / Slack token / SaaS API key）scope 過寬時、上游事件直接傳導下游、要走 IAM Conditions 收 caller scope 跟過期時間

下一步路由

上游：7.6 秘密管理與機器憑證治理、7.13 偵測覆蓋率與訊號治理
平行：HashiCorp Vault、AWS Secrets Manager、Azure Key Vault
下游：Google Cloud KMS（GSM CMEK 後端、key custody 分離）
下游：Google Cloud IAM（secret IAM binding、Workload Identity Federation 設定）
跨模組：8 事故處理 vendor 清單（GSM 事件如何 routing 進 IR 流程）
官方：Secret Manager Documentation

Keycloak

Mon, 18 May 2026 00:00:00 +0000

Keycloak 是 open source 自管 Identity Provider、Red Hat 主導維護（商業支援版本為 Red Hat build of Keycloak、前身 Red Hat SSO）。它承擔的責任跟 SaaS IdP 相同 — SSO、MFA、federation、user lifecycle — 但 整個控制面留在組織自己手上：issuer signing key、support tooling、底層 PostgreSQL、HA cluster、CVE patch cadence 全部自管。決定上 Keycloak 不是技術偏好、是組織決定把 SaaS IdP 的「第三方信任成本」換成「自家 SRE 運維成本 + 安全責任」。在 0.22 能力級買 vs 建的光譜上、Keycloak 是認證能力「建」側的 canonical 例子 — 把 feature SaaS（Auth0 / Okta）的第三方信任成本、換成自管控制面的運維成本；什麼訊號該翻到這一側、見 0.22 與外包深度卡。

服務定位

Keycloak 是 自管控制面 的 human identity 與 federation engine、不是 cloud resource permission engine。跟 Okta / Auth0 的本質差異在於信任邊界落點：SaaS IdP 把 signing key、tenant 隔離、support workflow 都託管出去、客戶承擔「供應商出事我也跟著被打」的風險；Keycloak 把整條控制面收回自家機房或自家 VPC、客戶承擔「signing key 過期 / DB 崩 / Java app CVE 沒跟上」的運維風險。

跟 cloud-native SSO（AWS IAM Identity Center）相比、Keycloak 的核心優勢是 不綁雲廠 + 可深度客製 authentication flow + 資料不出境。適合垂直：金融、政府、醫療某些不接受 SaaS IdP 的場景；以及預算敏感、員工數中等、SRE 量能足以接 24/7 on-call 的組織。

本章目標

讀完本頁、讀者能判斷：

Keycloak 該承擔哪一段 identity 控制（SSO / MFA / federation / brokering）、哪一段該交給雲端 IAM 或下游應用
自管 IdP 的最低運維基線（HA、DB DR、cert / signing key rotation、CVE cadence、SIEM 接點）
Realm / Client / User Federation / Identity Broker / Authentication Flow / SPI 各自的決策時機與陷阱
何時用 Keycloak、何時改走 SaaS（Okta / Auth0）或其他 OSS（Authentik / Zitadel）

最短判讀路徑

判斷 Keycloak 部署是否健康、最少看 SaaS IdP 的四件事加上自管特有的四個維度：

誰能做什麼：master realm admin 的人數、是否走 access request workflow、admin console 是否限 IP / device trust、是否強制 phishing-resistant 認證
憑證在哪裡：client secret 是否走 secret management、realm signing key 的 rotation 排程、admin token 的 TTL
入口如何暴露：哪些 realm 對外、reverse proxy / Ingress 是否做 rate limit、admin console（/auth/admin）是否限內網或 zero trust
證據是否可回查：Event Listener SPI 是否接 SIEM、admin event 跟 login event 是否分流、保留期是否符合稽核
DB 健康：PostgreSQL / MySQL 是否跨 AZ、是否有 PITR、是否做過 restore 演練（不是只有備份成功訊息）
Cert lifecycle：TLS cert 與 realm signing key 各自的 rotation 排程、是否走 Website Certificate Lifecycle 自動化
HA topology：Keycloak cluster 是否多節點、Infinispan cache 是否跨 AZ、單節點重啟是否會踢掉所有 session
Upgrade cadence：Keycloak 每年 major release、CVE patch 是否能在 SLA 內上、是否有 staging 跑 DB migration

八個維度任一缺失、都是自管 IdP 常見事故的入口。

日常操作與決策形狀

Realm 設計：Realm 是 Keycloak 的隔離邊界、每個 realm 有獨立的 user store、client、role、signing key。multi-tenancy 走 realm 是正確選擇、但 master realm 能管所有 realm、master realm 的 admin compromise = 全公司 IdP compromise。把 master realm 鎖在內網、operational realm 才對外、是基本姿勢。

Client 註冊與 secret：每個應用是一個 client、confidential client 有 secret、public client（SPA / mobile）走 PKCE 不存 secret。client secret 不存 source code、走 secret management 注入。client 數量爆炸時要設 naming convention 跟 ownership 標記、不然 stale client 會堆積。

User Federation：把既有 LDAP / Active Directory 接進 Keycloak、user 還是住在原 directory、Keycloak 做 protocol 翻譯（LDAP → OIDC / SAML）。這是 Keycloak 強項之一 — 不需要 user migration、漸進接入。陷阱是 LDAP 連線健康 = IdP 健康、LDAP 慢 = 全公司 login 慢。

Identity Brokering：把外部 IdP（Google、Microsoft、其他 SAML / OIDC provider）federate 進來、Keycloak 當中介。B2B 合作常見模式 — partner 用自己的 IdP、不在我的 user store 開帳號。決策點是 trust mapping：外部 claim 怎麼對應到內部 role、外部 IdP 的 MFA 狀態怎麼信任。

Authentication Flow：Keycloak 把 login / registration / reset password 做成可編輯的 flow DAG、可以插入自訂 step。這是 Keycloak 跟 SaaS IdP 最大差異點之一 — 想要 step-up MFA、device fingerprint、risk-based 判斷都可以自己接。雙面刃是 自訂 flow 容易留漏洞：跳過必要步驟、condition 寫錯讓 MFA 變可選、custom Authenticator SPI 沒處理 race condition。

Theme / 客製 UI：Keycloak 支援 theme override、可以改 login page HTML / CSS / JS。custom JS 在 login page = 自己注入 XSS 風險 — theme 寫進去之後就是 IdP 本體的攻擊面、不是普通網頁。CSP 跟 input sanitization 要當成 IdP 安全規範看待。

Event Listener / Audit：Keycloak 預設只把 event 寫進 DB、UI 上能查、但 不會自動推到外部 SIEM。生產環境必須接 Event Listener SPI（內建 jboss-logging、或自寫 Kafka / file listener）把 admin event 跟 login event 推進 SIEM。沒接的話 audit trail 只在 IdP 本機、IdP 出事就拿不到 evidence。

Exception / break-glass：master realm 留至少 2 個 break-glass admin、credential 離線存、走獨立 MFA（hardware key）。Keycloak cluster 整個失聯時、用 break-glass 直連 DB / 直連單一節點救回。

核心取捨表

取捨維度	Keycloak（自管 OSS）	Okta（SaaS）	Auth0（SaaS / B2C）	Authentik / Zitadel（其他 OSS）
控制面責任	自己跑 issuer / signing / HA / DB / upgrade	Okta 託管	Auth0 託管	自己跑、但社群規模小於 Keycloak
客製化深度	高 — Authenticator SPI / theme / event listener	中 — Workflows / Hooks、限定範圍	高 — Actions（JS hook）	中 — Authentik flow 視覺化、彈性中等
第三方信任成本	低 — 自管、自己承擔運維	高 — 供應商事件直接波及	高 — 同 Okta（同集團）	低 — 自管
運維成本	高 — HA、DR、cert、DB、CVE 都自管	低 — SaaS	低 — SaaS	高 — 同 Keycloak、生態系更小
適合場景	資料主權、預算敏感、需深度客製、有 SRE 量能	多雲、大量 SaaS、lifecycle 自動化	B2C、消費者 identity、developer-centric	規模小、Keycloak 太重、想要更現代 UI
退場成本	中 — 自己掌握資料、protocol 標準可遷移	高 — SAML / SCIM 接線散在數百 app	高 — Actions / Rules 客製綁定深	中 — 同 Keycloak

選 Keycloak 的核心訴求：資料主權 + 預算控制 + 客製 flow 需求、且有 SRE 團隊能 24/7 on-call、能接受自管的運維重量。團隊小於 50 人沒 SRE 量能、應用主要在 SaaS（pre-built integration 用不上 Keycloak 強項）、需要快速接 7000+ SaaS app — 都該回頭看 Okta / Auth0。

進階主題

User Federation 跟 LDAP 整合：企業環境常見「Active Directory 是 user source of truth、Keycloak 做 protocol 層」。注意 LDAP 同步策略（read-only / writable / import）、LDAP 健康直接影響 IdP 可用性、LDAP timeout 要設嚴格避免 login 卡住整個 cluster。

Identity Brokering 跟外部 IdP：把 Google / Microsoft / 其他 SAML IdP federate 進來、外部 user 進來時 Keycloak 自動建 link。trust mapping 是關鍵 — 外部 IdP 宣稱「這個 user 已 MFA」、要不要信？外部 group claim 怎麼對應到內部 role？沒有預設答案、要用 authorization 邊界決定。

Fine-Grained Authorization（UMA / Authorization Services）：Keycloak 內建 policy engine、可以做 resource-level 授權（不只是 role-based）。適合需要中央化 policy decision 的場景、但會把應用的授權邏輯綁進 Keycloak、退場成本變高。多數場景應該把 authorization 留在應用內、Keycloak 只做 authentication + role token 發行。

Custom Authenticator SPI：用 Java 寫自訂 authenticator、插進 Authentication Flow。能做 step-up MFA、device posture、risk score 判斷。陷阱是 SPI 程式碼就是 IdP 本體的一部分、bug = IdP 漏洞、必須走完整 code review + 安全測試流程、不能當普通 feature 開發。

Realm signing key rotation：每個 realm 有自己的 RSA / EC signing key、用來簽 ID token / SAML assertion。rotation 必須跟下游 client 協調（key rollover 期間 client 要能接受新舊 key）、否則 rotation 當天全公司 login 失敗。分域分批是必做的、參考 Failure: Credential Rotation Without Scope。

排錯與失敗快速判讀

DB 是 SPOF：Keycloak 所有 state 在 PostgreSQL / MySQL、DB 出事 = IdP 停 = 全公司 SSO 停。跨 AZ replication + PITR + 季度 restore 演練、不是 nice-to-have
Cert / signing key 過期：自管 IdP 最常見事故、TLS cert 過期擋對外 endpoint、realm signing key 過期讓所有 token 變無效。走 Certificate Rotation 自動化、過期前 30 天 alert
Cluster split-brain：Infinispan cache 跨節點同步、網路分區時 session 狀態不一致、user 看起來登入但下一個 request 又被踢出。HA topology 設計要考慮 cache mode（distributed vs replicated）、network 健康監控要 alert split-brain
Major upgrade 卡 DB migration：每年 major release 帶 schema migration、staging 沒跑過就 production 升級 = 數小時 downtime。upgrade plan 包含 rollback DB snapshot + staging full rehearsal
Custom theme / Authenticator 留漏洞：theme JS 引入 XSS、custom Authenticator 跳過 MFA、SPI 沒處理 race condition。把 IdP 客製當成 supply chain 看待、走 code review + 安全測試
Event 沒進 SIEM：預設只在 Keycloak DB、IdP 出事就拿不到 evidence。Event Listener SPI 接 Kafka / file / SIEM、admin event 跟 login event 各自接 alert runbook
Master realm admin 過多：日常工作不該用 master realm admin、應該在 operational realm 開有限權限 admin。master realm 是 single point of compromise

何時改走其他服務

需求形狀	改走
不想自管、要 SaaS IdP	Okta / Auth0
AWS-only 員工 SSO	AWS IAM Identity Center
Cloud resource 權限	AWS IAM / Google IAM / Azure RBAC
小團隊、Keycloak 太重	Authentik / Zitadel / Ory Hydra（更輕量 OSS、生態系較小）
事件偵測（不只 Keycloak event）	04 SIEM / detection 工具（04 observability 跟 07 SIEM 章節）
Secret / signing key 治理	7.6 秘密管理與機器憑證治理

不在本頁內的主題

Keycloak 完整 SAML / OIDC 規格細節、SPI Java API 文件
Red Hat build of Keycloak 商業支援的差異與授權細節
Keycloak Operator（Kubernetes deployment）的逐步部署教學
LDAP / Active Directory 各種 schema 對應規格

案例回寫

Keycloak 沒有直接的廠商級公開事件（OSS 沒有 vendor incident 的對應形態）、自管 IdP 的失效模式以下分兩類整理：跨 vendor 共通的 同構失效 用既有 case 對照、自管 IdP 特有的失效情境補敘事說明、避免案例表變成「同一個 frame 拼四個 case slug」。

對照引用（跨 vendor 同構失效）：

案例	跟 Keycloak 的關係
Azure AD Identity Control Plane 2021	對所有自管 IdP 的啟示：IdP 控制面故障會外溢到下游所有依賴 SSO 的服務、降級策略（local fallback、cached session）必須事先設計
Failure: Credential Rotation Without Scope	Keycloak realm signing key rotation 必須分域分批、一次 rotate 全部 realm = 全公司 login 同時失敗
Uber 2022 MFA Fatigue	純 push MFA 抗不過 fatigue、Keycloak 自訂 Authentication Flow 應該強制高風險操作走 phishing-resistant factor

自管 IdP 特有的失效情境（沒有對應公開 vendor case、來自自管運維常見事故樣態）：

Cert 過期讓全公司 SSO 卡死：Keycloak signing cert / TLS cert / 後端 DB cert 都自己管、任何一張過期 = login 全停。Okta / Auth0 客戶不會遇到這個失效面（vendor 自己 rotate）— 自管組織必須有 cert lifecycle monitoring（Prometheus exporter + alert）+ 季度 rotate rehearsal、不能等 Let’s Encrypt / 公司 PKI 發過期通知才動
Major upgrade 卡 DB migration 變數小時 downtime：Keycloak 每年 major release 帶 schema migration、若 staging 沒 full rehearsal 就 production 升級、可能遇到 migration 比預期慢 5-10 倍、整個維護視窗炸掉。對照 Okta / Auth0：vendor 自己升、客戶感知是 minutes-level、不是 hours-level
Realm scope 在小規模時用法跟大規模衝突：Contrast: Identity Governance by Scale 揭示不同規模治理模式差異 — 小團隊用單一 realm 順、團隊長大後該拆 realm 卻沒拆、最後 admin compromise blast radius 變整個組織。Keycloak 比 SaaS IdP 更容易踩到、因為 realm 拆分要自己決定時機、沒 vendor 推使用者升級 tier
DB 是 SPOF、自管沒做好 = SSO 跟 DB 一起死：Keycloak 用 PostgreSQL / MySQL 存 user / session / signing key、DB 出事 = IdP 停。跨 AZ HA + 跨 region DR + 季度 failover 演練是硬性要求、不是 nice-to-have；SaaS IdP 客戶不會遇到這個層次的失效面

下一步路由

上游：7.2 身分與授權邊界、7.13 偵測覆蓋率與訊號治理
平行：Okta vendor、Auth0 vendor、AWS IAM Identity Center
下游：AWS IAM / Google Cloud IAM / Azure RBAC（Keycloak 之後的 cloud resource permission 層）
跨模組：8 事故處理 vendor 清單（自管 IdP 事件如何 routing 進 IR 流程）
官方：Keycloak Documentation

Gatling

Fri, 15 May 2026 00:00:00 +0000

Gatling 的核心責任是把複雜使用者流程寫成可維護的 JVM simulation。它適合 JVM 生態團隊、強型別 DSL、HTTP / WebSocket / JMS / MQTT 等 scenario，以及需要把 injection profile、assertion、report 與 CI pipeline 綁在一起的壓測流程。

服務定位

Gatling 是 Scala-origin / 現以 Java DSL 為主流 的 load testing 工具、跑在 JVM、async / non-blocking engine（基於 Akka / Netty）讓單一 injector node 就能驅動高 RPS。它跟 k6 / JMeter / Locust 的核心差異在 語言生態 + engine efficiency + scenario 表達力、壓出負載的能力都具備：

vs k6 — k6 走 Go runtime + JavaScript scripting、CLI / Grafana 生態友善；Gatling 走 JVM + Java/Scala/Kotlin DSL、適合既有 JVM 工具鏈與強型別 review
vs JMeter — JMeter 走 GUI / XML test plan、適合非工程角色協作；Gatling 走 code-first、適合 PR / build pipeline / refactor 工作流
vs Locust — Locust 走 Python coroutine、scripting 自由度高；Gatling 走 DSL + injection profile、scenario 結構化程度更高
engine efficiency — async / non-blocking model 讓 Gatling 在單機可推到數萬 RPS、JMeter thread-per-user 在同等資源下 throughput 較低

產品線分兩層：Gatling OSS（開源 simulation runner + HTML report）與 Gatling Enterprise（前身 FrontLine、加上 distributed injector、cluster orchestration、live monitoring、long-term result storage、role-based access）。OSS 適合單機 baseline / CI smoke、Enterprise 適合 cross-region distributed / 大型活動前壓測 / 結果長期治理。

最短判讀路徑

判斷 Gatling 在壓測流程裡是否健康、最少看四件事：

Scala DSL vs Java DSL 版本：Gatling 3.7+（2022）正式加 Java DSL、2024 後新專案多走 Java DSL；舊 Scala simulation 仍可跑、但團隊要決定 維持 Scala 還是漸進改寫 Java、避免雙語言治理
Injection profile 設計：simulation 是否明確區分 open model（rampUsersPerSec / constantUsersPerSec、模擬真實 arrival）vs closed model（atOnceUsers / rampUsers、模擬 fixed user pool），對應 9.2 Workload Modeling 的 traffic shape
Assertion gate：simulation 是否有 assertions { global.responseTime.percentile3.lt(500) } 這類 hard gate、CI 跑完直接 fail build；沒 assertion 的 simulation 只是壓測、不是 release gate
Enterprise vs OSS 邊界：是否清楚知道哪些能力只 Enterprise 有（distributed injector / multi-region / long-term result storage / live dashboard）、避免用 OSS 拼湊 Enterprise 級需求

定位

Gatling 適合 code-first 且 JVM 能力強的團隊。當 workload model 需要多步驟 flow、資料 feeder、條件分支、session state 與明確 injection profile，Gatling 能用 simulation 把這些行為寫成工程 artifact。

這個定位讓 Gatling 接到 9.2 Workload Modeling 與 9.4 Saturation Discovery。它的價值在於把 traffic shape 寫進 injection profile，讓 ramp-up、constant users、stress peak 與 soak test 都能被版本化。

適用場景

JVM 團隊適合用 Gatling 承接壓測。Java、Scala 或 Kotlin 團隊能把 simulation 當成一般程式碼 review，並用既有 build、dependency、CI 與 artifact 流程維護。

複雜 scenario 適合用 Gatling 表達。登入、搜尋、加入購物車、checkout、payment mock、order query 這類 multi-step flow 可以用 session 與 feeder 管理資料。

高品質 report 適合 release review。Gatling 的 report 能幫 reviewer 看到 response time distribution、request group、error 與 injection profile，適合在 release gate 中保留可讀證據。

選型判準

判準	Gatling 的價值	需要補的能力
JVM DSL	simulation 可 code review	Scala / Java / Kotlin 維護能力
Injection profile	負載階段可精準表達	production traffic shape 校正
Session / feeder	多步驟資料與狀態容易管理	測試資料治理與敏感資料遮罩
Report	release review 可讀性高	長期趨勢儲存與 cross-run comparison

JVM DSL 價值來自可維護性。壓測 scenario 如果需要被長期 review、重構、抽 helper 或接 build pipeline，Gatling 的 code-first workflow 會比 GUI test plan 更適合工程團隊。

Injection profile 價值來自負載形狀精準。團隊可以把 steady load、spike、ramp、open model 與 closed model 放到 simulation 中，讓 9.4 Saturation Discovery 的 knee point 判讀更可重現。

跟其他工具的取捨

Gatling 和 k6 的主要差異是語言與生態。Gatling 適合 JVM 團隊與強型別 simulation；k6 適合 JavaScript-style scripting、CLI workflow 與 Grafana 生態。

Gatling 和 JMeter 的主要差異是維護模式。Gatling 偏 code review、build pipeline 與 simulation abstraction；JMeter 偏 GUI、plugin 與跨角色測試資產。

Gatling 和 Locust 的主要差異是自訂語言。Locust 適合 Python 團隊與任意 Python client；Gatling 適合 JVM 團隊與 report / injection profile 的結構化壓測。

Gatling 和 Vegeta 的主要差異是 scenario 深度。Vegeta 適合快速 HTTP pressure test；Gatling 適合需要 session、feeder、assertion 與多 request group 的長期測試。

操作成本

Gatling 的主要成本是 JVM 團隊能力。非 JVM 團隊要承擔語言、build tool、dependency 與 simulation pattern 的學習成本；這個成本只有在 scenario 複雜度夠高時才划算。

測試資料成本來自 feeder 與 session。多步驟 flow 需要 account、cart、order、token、region 與 tenant 資料，資料過期或分布偏差會讓壓測結果失真。

Enterprise / distributed 成本要提前評估。單機 Gatling 適合中小型 baseline；跨 region、大型活動前驗證或長時間 soak test 需要 runner topology、結果集中與雲端成本治理。

Evidence Package

Gatling 結果應回寫到 evidence package。最小欄位包括 simulation version、injection profile、feeder source、target environment、assertion、response time distribution、error rate、throughput、target service saturation metric、known gap 與 owner。

欄位	Gatling 證據來源
Source	simulation code、HTML report、dashboard link
Time range	test start / end
Query link	APM / metrics / logs 查詢連結
Data quality	feeder freshness、scenario coverage
Confidence	production similarity、runner capacity
Known gap	未覆蓋 flow、資料偏差、下游 mock 限制

Evidence package 的核心用途是讓 simulation 可回放。Reviewer 要能從 report 回到 injection profile、scenario code、feeder 與目標環境，才有辦法判斷一次壓測是容量訊號還是測試設計偏差。

核心取捨表

取捨維度	Gatling	k6	JMeter	Locust
語言 / DSL	Java / Kotlin / Scala DSL（JVM）	JavaScript（Go runtime）	GUI / XML test plan（JVM）	Python（coroutine / gevent）
Engine model	Async / non-blocking（Akka + Netty）	Async（Go goroutine）	Thread-per-user（同步）	Async coroutine
單機 RPS 上限	高（數萬 RPS）	高（數萬 RPS）	中（thread overhead）	中（GIL + coroutine）
Scenario 表達力	強（session / feeder / 條件分支內建）	中（JS function 自寫）	中（GUI 拖拉 + listener）	中（Python class + task）
Report quality	高（HTML report 內建、distribution / group 詳細）	中（CLI 摘要 + Grafana 串接）	中（GUI listener、不適合 headless）	中（web UI 即時、無 historical）
CI integration	強（Maven / Gradle / sbt + assertion gate）	強（CLI + JSON output）	中（CLI mode 可、但 GUI-first）	強（CLI + Python ecosystem）
Distributed	OSS 自建 / Enterprise 內建	k6 Cloud / OSS 自建	自建（master-slave）	自建（master-worker）
商業版本	Gatling Enterprise（前 FrontLine）	Grafana Cloud k6	無（純 OSS）	無（純 OSS）
適合場景	JVM 團隊、複雜 scenario、release gate、高 RPS efficiency	全棧團隊、CLI workflow、Grafana 生態	跨角色團隊、legacy test plan、protocol 多樣	Python 團隊、自訂 client、輕量 setup

選 Gatling 的核心訴求：JVM 團隊 + 複雜 scenario（session / feeder / 多 group）+ 高 RPS 單機效率 + HTML report 作為 release gate 證據。Java DSL 在 2024 後降低了 Scala 學習門檻、讓 Java/Kotlin 後端團隊不必再為了壓測導入 Scala。

進階主題

Gatling Enterprise（前 FrontLine）：商業版加 distributed injector cluster（跨 region / 跨 cloud 推大型負載）、live monitoring dashboard（real-time RPS / response time 趨勢、不用等 simulation 結束看 HTML）、long-term result storage（cross-run comparison、retention policy）、role-based access（QA / dev / SRE 不同權限）。對只跑單機 baseline 的團隊 OSS 已夠；要跑黑五 / 春晚級活動前壓測或多 region 同時施壓、需要 Enterprise 或自建 distributed topology。

Java DSL 取代 Scala 成主流（2022-2024）：Gatling 3.7（2022）正式釋出 Java DSL、3.9+ 文件 Java / Kotlin / Scala 三語並列、2024 後新教學多以 Java 為主。對 Java 後端團隊降低 onboarding 成本、但要注意 Gatling 2.x → 3.x 的 Scala syntax 不向後相容（scenario builder、http config、feed 用法都改寫）— 舊 simulation 升級時等於改寫一遍。

Distributed execution（OSS）：OSS 沒有內建 cluster orchestration、要靠 multiple injector + result aggregation：每台 injector 跑同一份 simulation（按 user count 切割）、結束後把 simulation.log 蒐集到一處用 gatling.sh 重跑 report stage。常見補位是用 Kubernetes Job + 共享 PVC、或直接走 Gatling Enterprise。

HTML report 與 release gate：simulation 跑完自動產 HTML report、含 response time percentile distribution（mean / p50 / p95 / p99 / max）、per-request-group breakdown、active users over time、error log。release gate 的標準做法是：CI job 跑 simulation → assertion gate fail 直接 break build → HTML report 存成 build artifact 供 reviewer 翻查、配合 Evidence Package 治理。

CI integration 模式：Jenkins / GitLab CI / GitHub Actions 都靠 mvn gatling:test / gradle gatlingRun / sbt gatling:test 入口、CI 設定 baseline simulation（每 PR 跑、catch regression）+ release simulation（release branch / nightly 跑、長時間 soak）。staging environment 跑壓測時要隔離噪音來源（其他 QA 流量 / cron job）、否則 RPS 數字會被污染。

排錯與失敗快速判讀

Scala learning curve 拖累進度：團隊沒人會 Scala、被 implicit / case class / pattern match 卡住 — 改用 Java DSL（3.7+）或 Kotlin DSL、保留 Gatling 表達力但去除 Scala 學習成本
Gatling 2.x → 3.x 升級 simulation 全紅：bootstrap import path / scenario builder API / feed 語法都變了 — 走 新專案直接 3.x、舊專案維持 2.x 雙軌、或安排專門 sprint 改寫、避免邊跑邊踩雷
JVM heap OOM / GC pause 拖慢 RPS：高 RPS 下 default heap 不夠、Young Gen GC 頻繁 — 調 -Xmx4G -Xms4G、用 G1GC / ZGC、監控 injector 的 GC log 跟 CPU、不是只看 target service
Injection profile 設計錯導致誤判 saturation：用 atOnceUsers(1000) 壓 closed model 但實際 traffic 是 open arrival、結果 knee point 找錯 — 看 production traffic shape、open model 用 constantUsersPerSec / rampUsersPerSec、closed model 才用 atOnceUsers
Single injector node 撞 client-side bottleneck：injector CPU / network / file descriptor / source port 用滿、看起來 target saturate 其實是 injector saturate — 監控 injector resource、scale out 成 distributed 或走 Enterprise
Feeder data 過期 / 分布偏差：用同一份 users.csv 反覆壓、cache hit rate 失真、production 看不到的 cache miss 路徑沒被測 — feeder 走 random / shuffle、定期 regenerate、覆蓋 long-tail key
HTML report 看起來綠但 production 出事：assertion gate 只設 average response time、p99 / error rate 沒設、release 後尖峰時段才爆 — assertion 要明確設 p95 / p99 + error rate threshold、不只看 mean

案例回寫

Gatling 適合回寫多步驟與多負載模型案例。它可接 9.C28 FanDuel 雙峰 workload 的直播與投注雙模型、9.C16 SeatGeek waiting room 的 token / admission flow、9.C17 BookMyShow ticketing 的售票流程壓力、9.C4 DraftKings Aurora 金融帳本的「比賽期讀爆量 + payout 時寫爆量」雙峰錯位，以及 9.C2 GR8 Tech 的「投注 / 結算 / 賠率更新」三類請求 group 的 injection profile。

這些案例的重點是 scenario 與 injection profile。Gatling 頁引用案例時，要把業務流程拆成 request group、session state、feeder、assertion 與 stop condition — 例如 DraftKings 雙峰錯位要寫成兩個 scenario 平行注入、各自有獨立 assertion budget。

下一步路由

MongoDB

Wed, 13 May 2026 00:00:00 +0000

MongoDB 是 document database 的事實標準。schema flexibility、aggregation pipeline、跨雲 managed（Atlas）讓它成為許多 startup 的 default 選擇。Microsoft 365、Disney+ 早期、Uber 等大規模平台都從 MongoDB 起家，後來依 workload 壓力把部分路徑遷移到 KV / 雲商專屬服務（Cosmos DB、DynamoDB）。

教學路線：Document shape 與 schema governance

MongoDB 服務頁的教學目標是把 document model、schema flexibility、index、aggregation pipeline 與 sharding 放回資料形狀治理。讀者讀完後要能判斷資料是否適合 aggregate root，並知道 schema governance 如何影響長期維護成本。

學習段	核心問題	對應段落
Document shape	哪些資料適合 aggregate root 與 nested document	定位、適用場景
Schema governance	schema flexibility 如何搭配 validation、版本與 migration	容量規劃要點、預計實作話題
Query / index	index、aggregation pipeline、ad-hoc query 如何影響成本	容量特性、常見陷阱
Sharding	shard key、chunk、balancer 如何把資料形狀變容量問題	容量規劃要點、Database Sharding
替代路由	何時轉 PostgreSQL、DynamoDB、Cosmos DB 或 search	不適用場景、跟其他 vendor 的取捨

定位：JSON document + 跨雲彈性

MongoDB 是以 document model 為主體的 DB。PostgreSQL JSONB 適合「SQL 為主、少量半結構化欄位」；MongoDB 則把 BSON document、aggregation pipeline、database sharding 與 schema governance 放在核心設計裡。近年版本加入 time series、change streams、queryable encryption、CSFLE 等能力。

選 MongoDB 的核心訴求：document model 是主要 use case、需要跨雲 managed（Atlas）、想避免 vendor lock-in（也可自管）。

容量特性

單一 instance 吞吐：

一般 m5.4xlarge：5K-15K WPS（依 doc size、index）
高階 instance + tuning：30K-50K WPS
超過此級別 → sharding

Sharding：

MongoDB 原生支援 sharded cluster
mongos router + config servers + shard
MongoDB sharding 要主動設計 shard key，並和 Hot Partition 風險一起看

Replication：

Replica set（primary + secondary、async）
跨 region 通常 async
自動 failover < 30 秒（mongod 內建）

Storage：

單一 collection 沒有官方上限、但 shard key resharding 過去版本是大手術（4.4+ 支援 reshardCollection）

適用場景

1. Document model 主要 workload：

schema 變化頻繁的早期產品
nested document 自然表達領域模型（訂單含多個 item、用戶含多個 preference）
對應案例：9.C30 Microsoft 365 — 從 MongoDB 遷移到 Cosmos DB MongoDB API、保留 document model

2. Aggregation pipeline 重 workload：

複雜的 $group / $match / $project chain
報表、analytics、ETL prep
比 RDBMS 寫複雜 query 更直觀（對某些 team）

3. 跨雲 managed（Atlas）：

MongoDB Atlas 跨 AWS / GCP / Azure
跟 DynamoDB（AWS only）、Cosmos DB（Azure only）、Spanner（GCP only）相反
適合多雲策略、避免單一 vendor lock-in

4. Time series workload（6.0+）：

time series collection 專屬優化
不過 InfluxDB / TimescaleDB 仍是更專業選擇

5. 已有 MongoDB 生態 + 想轉移操作責任：

Atlas 提供 backup、failover、monitoring、auto-scale
想把 MongoDB DBA / SRE 操作責任交給 Atlas

不適用場景

1. 強 ACID multi-document transaction：

MongoDB Transaction 支援多 document、但跨 shard 有性能影響
高頻金融交易仍建議 SQL 系統
替代：PostgreSQL、Aurora、Spanner

2. 複雜 JOIN：

MongoDB $lookup 適合少量相鄰資料，JOIN-heavy workload 應回 SQL 系統
schema design 階段要把常用讀取路徑 denormalize 成 document shape
替代：SQL 系統做 JOIN-heavy workload

3. 純 KV + sub-ms latency：

MongoDB document model 比 KV 多一層 BSON parsing
替代：Redis、DynamoDB、Bigtable

4. 大規模 OLAP：

aggregation 對中等資料量還行、TB 級不適合
替代：ClickHouse、BigQuery、Spark on Delta Lake

5. 嚴格資料模型 + schema enforcement：

MongoDB schema flexibility 可能導致 production data inconsistency
替代：SQL DB（schema 強制）+ JSONB column 處理半結構化

跟其他 vendor 的取捨

vs Cosmos DB MongoDB API：

MongoDB Atlas：跨雲、原生 MongoDB 行為
Cosmos DB MongoDB API：Azure-only、global distribution + 5 consistency levels
選 MongoDB Atlas：跨雲、需要原生 MongoDB features
選 Cosmos DB：Azure 生態、需要更好 global distribution
對應案例：9.C30 Microsoft 365 — 從 MongoDB 遷到 Cosmos DB MongoDB API，主要保留 document model

vs DynamoDB：

MongoDB：document model、aggregation 強、跨雲
DynamoDB：KV / single-table design、AWS 整合、5 個 9 SLA
選 MongoDB：document 為主、跨雲
選 DynamoDB：KV 為主、AWS 生態
詳見 DynamoDB vendor page 對比段

vs PostgreSQL JSONB：

MongoDB：document 為主、schema-less
PostgreSQL：SQL 為主、JSONB 補充
選 MongoDB：document 占主要 schema
選 PostgreSQL JSONB：主要結構化、少量半結構化欄位

vs Couchbase / Couchdb / Firestore：

Couchbase：MongoDB 替代、有 N1QL（SQL-like）
CouchDB：偏小規模、master-master replication
Firestore：GCP-only、realtime updates
MongoDB 在這群裡是生態最廣的

vs Elasticsearch 作為 search 替代：

兩者分屬不同類別：MongoDB 是 OLTP / document、Elasticsearch 是 search + analytics
通常搭配用：MongoDB 主、Elasticsearch 處理 full-text search

容量規劃要點

1. Shard key 設計是命脈：

跟 DynamoDB partition key 同樣關鍵
不均勻 → hot shard、實際容量達不到名義
4.4+ 可以 reshard、但仍是大手術

2. Replica set 是 HA 基礎：

至少 3 個 member（1 primary + 2 secondary）
secondary 可 read（read preference）但要注意 lag
failover 通常 < 30 秒

3. Atlas managed 服務：

提供 auto-scaling、auto-backup、跨雲部署
Tier 從 M0（free）到 M700（高階）
Atlas Online Archive 自動把舊資料移到便宜 storage

4. Index 限制：

單 collection 最多 64 個 index
compound index 有順序敏感（{a:1, b:1} 跟 {b:1, a:1} 不同）
TTL index 自動 expire 過期 document

5. Change streams（CDC）：

4.0+ 提供原生 change streams
對接 Kafka / event bus 做 event sourcing

Anti-recommendation 與升級路由

MongoDB 的 schema flexibility 會降低早期建模成本，也會把 schema governance 延後到 production。這一段先說何時維持 document model，再說何時升級 Atlas、sharding、Cosmos DB、DynamoDB 或 SQL。

機制 / 路線	維持簡單設計的條件	升級訊號	主要引用路徑
單一 replica set	document size 穩定、working set 可控、primary 寫入足夠	storage / write / working set 接近上限、failover 演練不足	Replication Lag、RPO
Atlas managed	團隊仍能管理 backup、upgrade、monitoring 與 scaling	DBA / SRE 責任想轉交平台、跨雲部署與 backup 成為主要壓力	Audit Log、Secret Management
Sharded cluster	single replica set 還能承擔容量與維護窗口	shard key 穩定、tenant / user / region 可分、hot shard 可觀測	Database Sharding、Hot Partition
Cosmos DB MongoDB API	Azure 只是部署選項，原生 MongoDB 行為仍重要	Azure global distribution、multi-region write 或 RU governance 成主題	Cosmos DB vendor
DynamoDB / KV	query 仍需要 document traversal 與 aggregation	access pattern 固定、sub-10ms p99、connection-free scaling 成主題	DynamoDB vendor
PostgreSQL	document 是主要資料形狀	JOIN-heavy、transaction-heavy、schema 約束是主要價值	PostgreSQL vendor

MongoDB 的簡單路徑是先把 document boundary 寫清楚。資料可以彈性演進，但 application 仍要知道哪些欄位是正式契約、哪些欄位只是相容期，並用 validation、migration 與 data quality check 管住版本漂移。

Sharding 的升級路徑要等 shard key 與 query shape 足夠穩定。過早切 shard 會把 aggregation、transaction 與 index 成本提前放大；過晚切 shard 則會讓 resharding、chunk migration 與 balancer 壓力進入 production 高峰期。

Deep article（已完成）

本批 6 篇 deep article 已完成、覆蓋 MongoDB 從 schema 設計到 production 跨層架構的核心 production 議題：

主題	文章	對應 production 議題
Schema contract 該放 DB 層 validator 還是 app 層 abstraction	schema-design-pattern	Toyota polymorphic governance、Forbes abstraction layer
Shard key 選型 + 單 cluster vs 多 cluster blast radius	shard-key-selection	Toyota 20 DB blast radius、跟 DynamoDB 可逆性對比
Read preference + causal session 跟 cache 層 freshness token	replica-set-read-preference	DB 層 + cache 層讀後一致性兩層合用
Aggregation pipeline 順序 / index / memory boundary	aggregation-pipeline-optimization	report dashboard 跑爆 primary 的 anti-pattern 治理
Change streams resume token + Kafka connector 治理	change-streams-kafka	at-least-once 語義 + idempotency + resume token 過期防護
Driver × deployment × cache × predictive scaling 三層協作	connection-management-and-cache-layer	Coinbase mongobetween + freshness token + ML 預測擴容三件套

跨 vendor entry：先看 DB3 vendor selection（MongoDB / DynamoDB / Cosmos DB 三方選型 + workload shape 前置判讀），再進本 vendor 的 deep article。

後續擴充（仍待補）

Index 設計跟覆蓋
從自管 MongoDB 遷到 Atlas
從 MongoDB 遷到 Cosmos DB MongoDB API（保留 document model）
從 MongoDB 遷到 DynamoDB（access pattern 需要重設計）
Queryable encryption（CSFLE）

案例對照

案例	跟 MongoDB 的關係
9.C30 Microsoft 365	從 MongoDB 遷到 Cosmos DB MongoDB API、planet-scale analytics
9.C36 Coinbase	MongoDB 為主資料層、自建 mongobetween 解決 Ruby 連線爆炸、users 服務 1.5M reads/sec
9.C37 Forbes	自管 MongoDB → Atlas on GCP、6 個月遷完、build 25→9 分鐘、120M MAU
9.C38 Toyota Connected	Atlas 撐 900 萬車 telematics、月 180 億 transaction、緊急訊號 3 秒內到 agent

MongoDB case 的讀法分三組：

作為 production 主角持續演進（Coinbase、Toyota Connected）：document model 撐住核心 OLTP / IoT、配 connection proxy / cache / event-driven 處理擴展周邊。
自管 → managed 遷移（Forbes）：同 document model、換託管模式、ROI 集中在 DBA 責任轉移跟跨雲彈性、不是性能改善。
遷出 MongoDB 保留 API（Microsoft 365）：document model 保留、底層換到 Cosmos DB MongoDB API、換取 Azure global distribution。

讀 case 時要區分 MongoDB 在「主角 / 遷入 / 遷出」三種位置的差異，三種位置揭露的工程議題完全不同。

常見陷阱

schema 長期 schema-less：production 出現 data inconsistency、難 query
shard key 用 _id（自增）：寫入全集中在最後一個 shard
$lookup 過度使用：跨 collection JOIN-heavy workload 應在 schema design 時 denormalize 或回 SQL
index 太多：寫吞吐被拖垮、定期 review 未用 index
secondary read 不檢查 lag：用戶讀到 stale data
不規劃 Atlas tier upgrade 路徑：流量上來才發現 tier 跟不上、緊急升級費用高

下一步路由

完整 T1 對照：01-database vendors index
平行：Cosmos DB vendor（MongoDB API replacement）、DynamoDB vendor（KV alternative）
上游：1.2 schema design、1.10 KV / Document DB 容量規劃
下游：1.12 大規模 DB 遷移實戰（MongoDB 遷出範例）
跨模組：9.6 容量規劃模型、9.4 Saturation Discovery（shard key 跟 hot shard）
官方：MongoDB Manual、MongoDB Atlas

Grafana OnCall

Fri, 01 May 2026 00:00:00 +0000

Grafana OnCall 是 Grafana Labs 維護的 OSS-friendly on-call 平台、源自 2021 年收購的 Amixr.io、以 Apache 2.0 授權釋出。它承擔三段責任：alert routing + schedule + escalation（PagerDuty 的 OSS 替代）、Grafana 生態 alert 收斂（Grafana / Alertmanager / Mimir / Loki alert 進統一 routing）、phone / SMS notification 透過 Twilio 等 provider。2024 年起 Grafana Labs 推出 Grafana IRM (Incident Response Management) bundle、把 Grafana OnCall + Grafana Incident（前 Grafana Incident Response & Communications）綁成一個 alert-to-resolve workflow、定位明確對標 PagerDuty 跟 incident.io 的整合 IR 路線。

服務定位

Grafana OnCall 的核心定位是 Grafana 生態內的 on-call layer、不是獨立 IR platform。底層產品線：Grafana OnCall OSS（self-hosted、Helm chart、Apache 2.0）、Grafana Cloud OnCall（SaaS、含在 Grafana Cloud Pro/Advanced）、Grafana IRM bundle（OnCall + Incident 整合、2024+ 主推路線）。對非 Grafana-heavy 環境也能單獨用、但跟 PagerDuty 比 ecosystem 廣度不及。

跟 PagerDuty 比、Grafana OnCall 走 OSS-first + 預算敏感、核心 schedule / escalation / phone-call 功能對齊、但 advanced workflow（global event orchestration、business service mapping、analytics depth）較弱。跟 Opsgenie 比、Grafana OnCall 不綁 Atlassian 生態、適合已用 Grafana stack 的團隊。跟 incident.io 比、Grafana IRM bundle 在 alert routing 強、但 Slack-native incident channel 體驗 incident.io 仍領先。

關鍵張力：OSS 路徑的維運成本 ↔ 商業 SaaS 的 SLA。Self-hosted OSS 要自管 PostgreSQL / Redis / Celery worker / Twilio account、出事故時自家 on-call 平台不能掛（chicken-and-egg）；Grafana Cloud OnCall 解這層、但脫離了 OSS 自管的成本優勢。中型團隊通常走 Grafana Cloud、小型 OSS-first 團隊走自管 + Twilio。

本章目標

讀完本頁、讀者能判斷：

自管 Grafana OnCall（Helm chart）vs Grafana Cloud OnCall vs Grafana IRM bundle 的取捨
配置 schedule / escalation chain / Twilio phone-call 的最短路徑
Grafana / Alertmanager / 自家 webhook 進 OnCall 的 routing 設計
跟 SIEM（Splunk / Elastic）webhook 整合的 alert 收斂模式
評估 Grafana OnCall vs PagerDuty / Opsgenie / incident.io 取捨

最短判讀路徑

判斷 Grafana OnCall deployment 是否健康、最少看四件事：

Slack / Teams integration：on-call notification 是否進團隊主 chat channel、ack / resolve 是否能直接在 Slack 操作不切換 UI、@here / @channel 跟 phone-call 是否分層（低風險 Slack only、高風險才打電話）
Escalation chain：N step escalation 是否覆蓋 primary → secondary → manager、每階是否有 timeout（5min / 15min / 30min）、節假日 / 跨時區 schedule 是否走 rotation 而非單人值班、override 機制是否清楚
Webhook integration to SIEM：Splunk / Elastic Notable Event 進 OnCall 的 webhook 是否走 correlation rule 過濾後 才轉發、HMAC / token auth 是否正確、failed delivery 是否有 retry 跟 dead-letter queue
Grafana dashboard alert routing：Grafana / Alertmanager alert 是否走 severity-based routing（critical / warning / info 分流到不同 escalation chain）、alert grouping / deduplication 是否啟用避免 alert storm、跟 observability-reliability-incident-loop 的 signal-to-incident 邊界是否定義

四件事任一缺失、就是 drills-and-oncall-readiness 的待補項目。

日常操作與決策形狀

Schedule + escalation chain：rotation 走 weekly / daily / custom、可掛 calendar import（iCal / Google Calendar）做休假 override。Escalation chain 是 N step + timeout 結構（例：notify primary → 5min no ack → notify secondary → 15min no ack → notify manager + phone-call）。反例是 single-step chain — 一個人 ack 不到整個 incident 卡住、production chain 至少要 3 step + 跨時區 fallback。

Alert grouping + Notification：alert source 包含 Alertmanager（Prometheus / Mimir）、Grafana alert（unified alerting 推送）、generic webhook（自家 app / SIEM）、Sentry / Datadog 等第三方。Grouping 用 integration template 寫 Jinja2 抽欄位（service / severity / region）做 deduplication。Notification channel 分層：Slack / Teams 走低成本通知、Twilio phone-call / SMS 留給 P0 / P1、Mobile push 走 Grafana IRM mobile app。

Grafana 生態整合：Grafana Cloud 帳號內 OnCall 直接啟用、不另外 deploy。Grafana unified alerting 推 alert 到 OnCall integration、Loki / Tempo 的 metric-from-log / trace-anomaly alert 一條 pipeline 進 OnCall。對應 Grafana Stack 的 alert 出口。Grafana SLO（Service Level Objective）違反 burn rate threshold 也可直接路由到 OnCall escalation。

Grafana IRM bundle（2024+）：Grafana 把 OnCall（alert routing）+ Incident（incident lifecycle / war room / timeline）打包、目標是把 alert paged → IC declared → channel created → timeline auto-recorded → post-incident review 收進一個 console。對 Grafana-heavy 環境的吸引力是 少一個 vendor seam；對 Slack-native 團隊則跟 incident.io / FireHydrant 競爭、要看 Slack 體驗深度。

OnCall webhook 整合 SIEM / 第三方：generic webhook integration 接 Splunk Notable Event、Elastic Security alert、Datadog monitor、自家 app exception。Webhook payload 走 integration template 轉成 OnCall alert 欄位、加 routing label 進對應 escalation chain。注意 webhook auth 走 token / HMAC、不要用 anonymous webhook 接外網 — 對應 incident-workflow-automation-boundary 的入口治理。

Maintenance mode：planned maintenance window 期間 suppress alert、避免 deploy / DB migration 觸發大量假 alert。設定 integration-level mute 或 route-level mute、附 reason 跟 expiry time、不要無限期 mute（容易遺忘變盲點）。

Mobile app：Grafana IRM mobile app（iOS / Android）支援 push notification + ack / resolve / 加 note、replace 部分電話需求。但 phone-call 不可完全廢除 — 手機靜音 / 深夜值班 push 不一定醒、P0 仍需 Twilio 多次呼叫升級。

自管部署：Helm chart 部署、依賴 PostgreSQL（state）+ Redis（cache / Celery broker）+ Celery worker（background job）+ Twilio account（phone / SMS）+ TLS domain。Production checklist：PostgreSQL 走 managed service（RDS / Cloud SQL）避免自管 DB on-call 平台兩層 chicken-and-egg、Redis 走 managed、Helm values 走 GitOps 版控、Twilio account 走獨立 sub-account 避免 quota 跟其他服務搶。

核心取捨表

取捨維度	Grafana OnCall	PagerDuty	Opsgenie	incident.io
計費模型	OSS 自管免費 / Cloud 含在 Grafana Cloud 套餐	Per-user / 月、advanced tier 加價	Per-user / 月（Atlassian 套餐）	Per-user / 月、Slack-native focus
部署模型	Self-hosted (Helm) / Grafana Cloud SaaS	SaaS only	SaaS only	SaaS only
授權	Apache 2.0 OSS	商業 SaaS	商業 SaaS	商業 SaaS
Advanced workflow	基本 schedule + escalation、analytics 較弱	業界最強（global orchestration / RBA）	中等（Atlassian Jira / Confluence 整合）	Slack incident channel + post-incident
Integration ecosystem	Grafana / Alertmanager 強、第三方靠 webhook	700+ 原生 integration	Atlassian 生態深、Jira / Confluence 一線	Slack-native、深度有限但體驗好
Phone / SMS	Twilio（自配 account / OSS 路徑要自管）	內建、跨地區 carrier 覆蓋廣	內建、Atlassian 計費	內建、focus 在 Slack ack 多於電話
Slack 體驗	Slack integration 基本（notify / ack）	Slack integration 完整	Slack integration 中等	Slack-native、incident channel 自動建
跨平台 IR	Grafana IRM bundle（OnCall + Incident）2024+	PagerDuty Incident Workflows	Jira Service Management incident	incident.io Catalog + workflow
適合場景	Grafana-heavy / OSS-first / 預算敏感	Enterprise / 跨產品線 / 高 SLA	已用 Atlassian / Jira Service Management	Slack-first / startup-to-midsize
退場成本	低 — OSS 路徑可帶走 config、Cloud 也有 export	中-高 — escalation policy / workflow 量多	中 — Atlassian 套餐綁定	中 — Slack workflow 客製化深度

選 Grafana OnCall 的核心訴求：OSS-friendly / 預算敏感 / Grafana 生態已是觀測平台主力、能接受 advanced workflow 較弱（或預期不需要）、自管路徑能投入 PostgreSQL / Redis / Twilio account 維運。Enterprise + 高 SLA + 跨產品線 ecosystem 廣度需求仍走 PagerDuty。

進階主題

Grafana IRM bundle 的整合決策：OnCall（alert routing）+ Incident（incident channel / timeline / post-mortem）打包後、IR workflow 收在一個 console。決策點是 是否已用 Slack 做 incident channel、若團隊 Slack incident workflow 成熟、IRM Incident 的 channel 自動建可能跟現有 incident-communication 模式衝突；若還沒成熟、IRM bundle 是最短路徑。

OnCall webhook 整合 SIEM 的 alert 收斂模式：Splunk ES Notable Event / Elastic Security alert 不該直接打 OnCall — 噪音太大會造成 alert-fatigue-and-signal-quality 問題。實務做法：SIEM 端先走 correlation rule + risk-based threshold、只有 high-confidence finding 才 webhook 到 OnCall、低風險走 Slack notification channel 給 SOC analyst triage。

Maintenance mode 跟 deploy 流程的整合：deploy pipeline 在 production rollout 前 call OnCall API 開 maintenance window（mute 特定 integration / route）、deploy 完成或失敗 rollback 後關閉。避免 deploy 期間 false alert 把 on-call 叫醒、但要設 max maintenance duration（例 1hr 自動 expire）避免長 window 變盲點。

OSS 自管的 chicken-and-egg：自管 OnCall 部署本身的 monitoring 不能依賴 OnCall — OnCall 掛了 alert 進不來、on-call 不知道 OnCall 掛了。實務做法：OnCall infra 的 monitoring 走另一條 bootstrap alert（直接 Twilio API call + email-to-pager fallback）、或保留小規模 PagerDuty free tier 做 backstop。

排錯與失敗快速判讀

Webhook 沒觸發 / alert 沒進來：integration URL 錯（環境變數沒帶 base URL）、token / HMAC auth 設錯、source 端 webhook payload format 不對（沒走 integration template mapping）— 檢查 OnCall integration log + source webhook delivery log 對齊
Slack notification stuck / 不出現：Slack OAuth token 過期、Slack workspace permission 變更、OnCall Slack bot 沒被 invite 進 channel — 重 OAuth + 確認 bot membership
Twilio quota 用完 / phone-call 失敗：Twilio account balance 不足 / 沒升級 trial / 地區 carrier 限制 — 看 Twilio dashboard balance + delivery log、A2P 10DLC 註冊跟地區 toll-free 預先設定
Schedule overlap / on-call 漏排班：rotation override 配錯、calendar import 沒同步、時區誤判（UTC vs local）— 用 OnCall schedule preview 跑 7-day forward 檢查
Notification delay / 來得慢：provider latency（Twilio / Slack / FCM push）、Celery worker queue backlog（自管路徑）、escalation timeout 設太長 — 自管路徑檢查 Celery queue length + worker count
Self-hosted upgrade gotcha：Helm chart major upgrade 帶 DB schema migration、跳版升級失敗、PostgreSQL extension 缺 — 走 staging environment 跑 migration + 備 rollback DB snapshot、不直接 production helm upgrade
Maintenance mode 沒到期 / 變盲點：mute 沒設 expiry / reason、deploy 完成沒清 mute — maintenance window 強制設 max duration、weekly review mute 清單

何時改走其他服務

需求形狀	改走
進階 IR workflow / RBA	PagerDuty
Atlassian 生態 / Jira	Opsgenie
Slack-native incident	incident.io
商業 SLA / Enterprise	PagerDuty / Opsgenie
Post-incident learning	Jeli（PagerDuty 收購）
Status page (對外溝通)	Atlassian Statuspage / Instatus

不在本頁內的主題

Twilio account 申請 / A2P 10DLC 註冊 / 地區 carrier 設定細節
Helm chart values 完整 reference（看官方 docs）
Grafana Cloud OnCall pricing tier 對照
Grafana unified alerting 規則語法（屬 observability 範圍、見 Grafana Stack）
Grafana Incident 的 channel / timeline 細節（屬 IRM bundle 另一半、本頁聚焦 OnCall）

案例回寫

Grafana OnCall 在 08 案例庫沒有直接 vendor-level 事件、本案例庫的多數事故主角是 Slack / GitHub / Cloudflare / AWS 等基礎設施。Grafana OnCall 的對照位置在 OSS-first organization / Grafana-heavy 監控環境 的 IR routing 設計、相關 case 的啟示如下：

案例方向	跟 Grafana OnCall 的關係（對照啟示）
OSS-first / Grafana-heavy 觀測環境	Alertmanager / Mimir / Loki alert 進 OnCall 是最短整合路徑、escalation chain 走 Grafana SLO burn rate trigger
預算敏感的中型團隊	Self-hosted OnCall + Twilio account 是 PagerDuty 的 OSS 替代、要算 PostgreSQL / Redis 維運成本是否真的省
Slack-only IR workflow vs Grafana IRM	Grafana IRM bundle 把 incident channel 收進 console、跟 incident.io / Slack-native workflow 二選一
Vendor 依賴出事（vendor-dependency-incident）	OnCall 自身是 vendor、自管路徑要設 bootstrap alert、Cloud 路徑要評估 Grafana Labs SLA 跟 backup paging

下一步路由

上游：Drills and On-call Readiness、Incident Workflow Automation Boundary
平行：PagerDuty、Opsgenie、incident.io、FireHydrant、Rootly
下游：Grafana Stack（alert source）、Observability ↔ Reliability ↔ Incident Loop
跨模組：Splunk（SIEM webhook → OnCall）、Vendor Dependency Incident（OnCall 自身 vendor 風險）
官方：Grafana OnCall Documentation

Grafana Stack

Fri, 01 May 2026 00:00:00 +0000

Grafana Stack 是 Grafana Labs 提供的 OSS observability 全棧、承擔三個責任：跨 data source 統一視覺化（Grafana）、各訊號類型專屬 backend（Loki logs / Tempo traces / Mimir metrics / Pyroscope profiles）、可自管或用 Grafana Cloud（managed）。設計取捨偏向「OSS-first + signal-specific backend + 統一查詢介面」、是 Datadog 的 OSS 替代方案。

對「需要 OSS / 自管 observability、跨 data source 統一儀表板、不想 vendor lock-in」這條路徑、Grafana Stack 是首選。

本章目標

讀完本章後、你應該能：

部署 Grafana + Prometheus + Loki + Tempo 基本棧
用 LogQL 查詢 Loki、用 TraceQL 查詢 Tempo
設計 dashboard as code（Jsonnet / Terraform）
評估 Mimir vs Thanos 的長期 metrics 儲存選擇
評估 Grafana Cloud（managed）跟自管的取捨

最短路徑：5 分鐘把 Grafana Stack 跑起來

1# 1. 用 docker-compose 跑起 Grafana + Prometheus + Loki
2# TODO: docker-compose.yml with grafana / prometheus / loki
3
4# 2. 在 Grafana 加 data source
5# TODO: Prometheus / Loki 各自的 datasource config
6
7# 3. 建第一個 dashboard
8# TODO: 用 explorer 試 PromQL + LogQL

最短路徑驗證 Grafana 起來、可訪 metrics + logs。實際 production 要評估 Mimir / Tempo + Grafana Cloud 取捨。

日常操作與決策形狀

Grafana 視覺化

子議題：

Data source 配置（Prometheus / Loki / Tempo / Postgres / MySQL / Elasticsearch）
Dashboard 設計：variable + template + panel
Dashboard as code：Jsonnet (Grafonnet) / Terraform Grafana provider
對應指令：HTTP API /api/dashboards

LogQL（Loki 查詢）

子議題：

LogQL syntax：log stream selector + filter + parser + aggregation
跟 PromQL 對齊的設計（同樣 label-based）
範例：{job="app"} |= "error" | json | line_format "..."
對應 metrics-from-logs（unwrap + rate）

TraceQL（Tempo 查詢）

子議題：

TraceQL syntax：span selector + attribute + aggregation
範例：{ span.http.status_code = 500 && duration > 1s }
Service graph：跨服務依賴自動分析
對應 trace-to-logs / trace-to-metrics 關聯查詢

Deep Article

LGTM Stack 組合運維：四個元件的責任分工、部署模式、常見故障與 dashboard provisioning
Loki 設計與操作限制：label-based index 設計、LogQL 查詢模式、cardinality 治理與 Elasticsearch 差異

進階主題（按需閱讀）

Loki 設計與限制

子議題：

Storage：S3 / GCS / 本地、按 stream 切 chunks
Label cardinality 跟 Prometheus 一樣敏感（不是 stream content）
LogQL 不適合 high-cardinality content search（用 Elastic）
對應 4.C3 Healthcare retention

Tempo trace 採集

子議題：

接受 OTLP / Jaeger / Zipkin protocol
Storage：S3 / GCS、cheap object storage
Trace ID lookup 為主、no full-text search（用 traces metrics 反向查）
對應 4.C4 X-Ray to OTel

Mimir 長期 metrics 儲存

子議題：

Prometheus remote write 接收 metric
Horizontally scalable（multi-tenant）
跟 Thanos / Cortex 的對照（Mimir 是 Cortex fork + improvements）
對應 4.C8 Airbnb K8s scale

Pyroscope continuous profiling

子議題：

CPU / memory / mutex / goroutine profiling
Flame graph 視覺化
跟 Tempo trace 關聯（trace-to-profile）
OSS（Grafana 收購）vs Pyroscope OG

Grafana Cloud（managed）

子議題：

Free tier 額度 + paid tier
含所有 stack（Metrics / Logs / Traces / Profiles）
Grafana Cloud vs Datadog cost 對照
Hybrid 模式：self-host backend + Grafana Cloud Grafana

Unified Alerting

子議題：

Grafana 9+ 統一 alerting（取代 dashboard alert + Prometheus alertmanager 分裂）
跨 data source 寫 alert rule
Multi-dimensional alert（per-label）
對應 Alertmanager 兼容

排錯快速判讀

Dashboard 載入慢

操作原則：先看 query 範圍跟 panel 數、用 query inspector 看 query 時間分布。

Loki query 過慢 / 失敗

操作原則：Loki query 需要 label filter 先縮範圍、再 content match。

1# TODO: LogQL: {namespace="prod", app="api"} |= "error"（先 label 後 filter）

Tempo span gap

操作原則：trace 不完整、看 sampling 設定 + Collector buffer 是否 drop。

Mimir ingestion 失敗

操作原則：remote_write rate / size limit 撞到 Mimir quota。判讀：Mimir HTTP 429 / 413。

Grafana 跟 Prometheus disconnected

操作原則：data source 連不上、看 Grafana log + network。

何時改走其他服務

需求形狀	改走
Pure metrics	Prometheus 單獨用
SaaS turnkey APM	Datadog
Log full-text search 為主	Elastic Stack
High-cardinality debug	Honeycomb
AWS / GCP native	CloudWatch / Cloud Ops
Error tracking	Sentry
Profile only	Pyroscope OSS / Polar Signals

不在本頁內的主題

各 Grafana plugin 細節
Dashboard 美術 / UX 建議
Grafana / Loki / Tempo / Mimir 各自完整 admin 手冊
Grafana 商業版 (Enterprise) 功能

案例回寫

直接相關案例

案例	主討論議題
4.C2 Gaming peak cardinality	Loki / Mimir 高峰下的 ingestion lag 與標籤治理
4.C3 Healthcare retention	Loki retention / compliance
4.C8 Airbnb K8s scale	Mimir scale / Prometheus 長期儲存

跨 vendor 對照

案例	對 Grafana Stack 的對應
4.C4 X-Ray to OTel	從 X-Ray 遷出後 Tempo 是 OSS trace backend 候選
4.C7 Datadog OTel migration	從 Datadog 遷出可去 Grafana Cloud
4.C10 規模對照	小型 single Grafana / 中型加 Loki+Tempo / 大型 Grafana Cloud 或 Mimir

下一步路由

上游概念：Metrics Basics
平行 vendor：Prometheus、OpenTelemetry
下游能力：4.20 Observability Evidence Package

k6

Fri, 01 May 2026 00:00:00 +0000

k6 是 Grafana Labs 出品的 load test 工具、承擔三個責任：CLI-first load test（Go 寫成、JS 寫測試 script）、threshold-based CI gate（pass/fail 直接接 CI）、Grafana Cloud k6 / k6 Operator on K8s 分散式。設計取捨偏向「CI-first + JS DX + 整合 Grafana 生態」、是現代 load test 主流選擇。

本章目標

讀完本章後、你應該能：

寫 k6 test script（VU / iteration / stages）
設計 threshold + CI gate（pass/fail）
用 xk6 extension 擴展（gRPC / Kafka / SQL）
部署 k6 Operator 做 distributed load
評估 k6 vs Gatling / Locust / JMeter 的選用

最短路徑：5 分鐘把 k6 跑起來

1# 1. 安裝
2# TODO: brew install k6 / docker run grafana/k6
3
4# 2. 寫 test.js
5# TODO: import http from 'k6/http'; export default function(){ http.get(...) }
6
7# 3. 跑
8# TODO: k6 run --vus 10 --duration 30s test.js

日常操作與決策形狀

Test script 結構

子議題：

export default function（per-VU iteration）
export const options（VU / duration / stages / thresholds）
Setup / teardown
對應指令範例：k6 run --vus 100 --duration 10m

Threshold + CI gate

子議題：

thresholds: http_req_duration: ['p(95)<500']
Exit code 非 0 → CI fail
Custom metric thresholds
對應 6.13 Performance Regression Gate

Test pattern

子議題：

Smoke / Load / Stress / Spike / Soak / Breakpoint
Stages（ramp-up / steady / ramp-down）
VU vs iteration vs RPS-based

進階主題（按需閱讀）

xk6 extensions

子議題：

自訂 binary：xk6 build + import extension
內建：HTTP / WebSocket / gRPC
社群：Kafka / SQL / Redis / browser
對應 cross-protocol load test

k6 Operator on K8s

子議題：

TestRun CRD
Distributed load（多 pod 模擬高 VU）
Result aggregation
對應 Kubernetes vendor 頁

Grafana Cloud k6

子議題：

Managed runner（多 region load source）
跟 Grafana dashboard 整合
跟 Loki / Tempo trace 關聯（test → APM trace）

Browser testing

子議題：

k6 browser：Chromium-based browser testing
跟 Playwright 重疊但更聚焦 load
適合 frontend regression load test

CI integration

子議題：

GitHub Actions / GitLab CI / Jenkins 整合
Artifact + report upload
對應 6.8 Release Gate

k6 vs xk6 vs Cloud

子議題：

k6 OSS：CLI + local script
xk6：build custom binary with extensions
k6 Cloud / Grafana Cloud k6：managed + UI

排錯快速判讀

Test 結果差異大

操作原則：local network / VU saturation / target 處理能力。

Threshold 太鬆 / 太嚴

操作原則：baseline 不準 / production traffic pattern 沒模擬。

Distributed load 不均勻

操作原則：k6 Operator 分配 VU 不均 / pod 規格差異。

Browser testing 慢 / 不穩

操作原則：Chromium 啟動成本 / network condition / target 反應時間。

何時改走其他服務

需求形狀	改走
JVM 生態	Gatling
GUI / 老牌	JMeter
Python	Locust
純 browser flow	Playwright / Cypress
Cloud managed	Grafana Cloud k6 / BlazeMeter / k6 Cloud
Capacity planning（非 CI）	09 performance capacity 模組

不在本頁內的主題

JS 語言基礎
k6 完整 API
Grafana Cloud k6 pricing

案例回寫

案例方向	對應主題
Shopify：BFCM 容量治理與 Game Day	峰值前 load test 對齊 capacity model + CI gate
LinkedIn：Capacity 與 On-call 分層	automated load testing 變成日常流程的工程化做法

待補 k6 customer case：Grafana Labs / k6 customer engineering blog、企業遷移 JMeter → k6 案例。

下一步路由

上游概念：6.13 Performance Regression Gate
平行 vendor：Gatling、Locust、JMeter
下游能力：09 performance capacity load test 模組

Memcached

Fri, 01 May 2026 00:00:00 +0000

Memcached 是純粹的 in-memory key-value cache、承擔三個責任：簡單 string KV cache、多執行緒高吞吐、嚴格的 cache 邊界（無持久化 / 無 data types / 無 lock）。設計取捨偏向「越簡單越好」— 沒有 Redis 的 data types / Streams / Pub/Sub、也沒有持久化 / 複製 / cluster mode。極輕量、運維成本低、適合 strict cache 場景。

對「純 cache、避免誤用為 source-of-truth、需要多執行緒高 throughput、極簡運維」這條路徑、Memcached 是首選。從 LiveJournal 2003 年開源至今、是業界最久經考驗的 cache。

本章目標

讀完本章後、你應該能：

跑起 Memcached、用 telnet 或 memcached-tool 驗證
用 SET / GET / DELETE / INCR / DECR 操作、區分 Memcached 跟 Redis 的場景界限
設計 client-side consistent hashing 做 sharding
看懂 hit rate / slab fragmentation / eviction 訊號
評估 Memcached vs Redis 的選用判讀（何時純粹勝過豐富）

最短路徑：5 分鐘把 Memcached 跑起來

 1# 1. 啟動 Memcached（-t 4 開 4 條 worker thread、-m 64 給 64MB）
 2docker run -d --name memcached -p 11211:11211 memcached:1.6 memcached -t 4 -m 64
 3
 4# 2. 用 text protocol 驗證讀寫（沒有 redis-cli 這種專屬 CLI、直接走 TCP）
 5#    set    ，下一行是 value
 6printf 'set foo 0 60 3\r\nbar\r\nget foo\r\nquit\r\n' | nc localhost 11211
 7# STORED
 8# VALUE foo 0 3
 9# bar
10# END
11
12# 3. 確認多執行緒與記憶體上限
13printf 'stats settings\r\nquit\r\n' | nc localhost 11211 | grep -E "num_threads|maxbytes"
14# STAT maxbytes 67108864      ← 64MB
15# STAT num_threads 4          ← -t 4 生效

最短路徑驗證「Memcached 起來、能讀寫、多執行緒生效」。Memcached 沒有 redis-cli 這類專屬 CLI、實際 ops 走 client library（python-memcached / pylibmc / go memcache）+ stats 系列命令。實機驗證於 memcached:1.6（VERSION 1.6.42）、最後檢查日 2026-06-16。

日常操作與決策形狀

協議與 client library

子議題：

ASCII protocol vs binary protocol（兩種都支援、binary 較有效率）
Client library：python-memcached、pylibmc（libmemcached 綁定）、go memcache、Java spymemcached
Connection management：connection pool / persistent connection

指令對照

子議題：

基本：SET / GET / ADD / REPLACE / DELETE / FLUSH_ALL
Counter：INCR / DECR（不能 < 0）
條件：CAS（compare-and-swap）做 optimistic lock
批次：GETS（批次 + CAS token）

Client-side sharding

Memcached server 本身無 cluster mode、靠 client library 做 sharding。子議題：

Consistent hashing（ketama）— 加減 node 時 minimum key 移動
Hash 演算法：md5 / SHA1 / ketama
對應 2.4 cache data shape

Memory model（slab allocator）

子議題：

Memcached 用 slab allocator 預分配記憶體 chunk
不同 size class（slab class）對應不同 chunk size
Fragmentation：當 value size 跟 slab 不對齊、memory 浪費
對應指令：stats slabs / stats items

進階主題（按需閱讀）

Slab allocator 與 memory fragmentation

子議題：

Slab class 自動分配機制
Slab reassignment（Memcached 1.4.25+）— 把記憶體在 slab class 間搬移
監控 STAT total_malloced vs STAT bytes_read
對應指令：stats slabs、slabs reassign

Multi-threaded scaling

子議題：

Memcached 從早期就 multi-threaded（vs Redis 早期 single-thread）
-t 設 thread 數、預設 4、依 CPU core 調
Lock contention：高 thread 數可能 hit per-bucket lock
對比 Redis：Redis 6+ 加 I/O threads、但 main thread 仍單線

AWS ElastiCache for Memcached

子議題：

ElastiCache 提供 managed Memcached cluster
Auto Discovery：客戶端自動發現 cluster node 變化
ElastiCache config endpoint 取代 client-side sharding 配置
跟 Redis ElastiCache 的成本對照

CAS（compare-and-swap）

子議題：

GETS 拿 value + token、SET 帶 token 做 conditional update
適合做 optimistic lock（vs Redis SETNX + lua）
CAS 失敗時的 retry 策略

Memcached vs Redis 的場景區分

子議題：

純 cache 不需 data types → Memcached 更輕量
Session store / counter / hot key 兩者都行
Leaderboard / sorted set / Streams / Pub/Sub → 只 Redis
Distributed lock → Redis（Memcached CAS 不夠強）
持久化（cache warmup 後不想全失）→ Redis（RDB / AOF）

排錯快速判讀

Hit rate 下降

操作原則：先看 eviction 是否提高、再看 key naming 是否變動。

1printf 'stats\r\nquit\r\n' | nc localhost 11211 | grep -E "get_hits|get_misses|evictions"
2# get_hits / get_misses 算 hit rate、evictions 持續增加代表 memory 壓力

Eviction 增加（memory pressure）

操作原則：超過 -m 設定的 memory limit、Memcached 用 LRU evict 老 key。看 stats slabs 哪些 slab class 最常 evict、可能要 slab reassign。

Slab fragmentation

操作原則：value size 跟 slab class 不對齊造成 wasted memory。判讀：stats slabs 看每個 slab class 的 used vs total chunks。

Client-side sharding 不平衡

操作原則：node 加減後、ketama 應 minimum 移動、但實際分布可能因 key 集中而偏斜。判讀：每個 node 的 stats 看 key count + memory usage 是否均衡。

Connection 耗盡

操作原則：每個 client 開太多 connection、Memcached 預設 max 1024 connection。看 stats curr_connections。

何時改走其他服務

需求形狀	改走
需要 data types（hash / list / set）	Redis / Valkey
需要持久化 / 半持久化	Redis with AOF / RDB
需要 distributed lock	Redis（Redlock 或 SETNX）
需要 Pub/Sub / Streams	Redis / Kafka / NATS
多核高 throughput	DragonflyDB
AWS managed	AWS ElastiCache for Memcached
Process-local cache	Caffeine / Guava Cache（JVM 內、無網路）

不在本頁內的主題

各語言 Memcached client 完整 API
Memcached internal data structure 細節
Custom binary protocol 實作
ASCII vs binary protocol 完整對照

案例回寫

直接相關案例

案例	對 Memcached 的對應
2.C2 Meta mcrouter	mcrouter 是 Memcached 專屬 protocol-aware routing proxy、處理跨叢集 / 跨區流量收斂與失效隔離
2.C6 Netflix EVCache	EVCache 基於 Memcached、Netflix 加上跨 AZ replication + client-side smart routing
2.C8 Meta TAO	TAO 底層用 Memcached 作為 graph 資料的快取層、上層加一致性 / 關聯查詢能力
2.C1 Meta cache consistency	Meta 大規模 Memcached 部署的 invalidation / shard move 一致性治理

跨 vendor 對照

案例	對 Memcached 的對應
2.C9 Cache Stampede	通用、Memcached 也需 TTL jitter / lease / probabilistic early expiration
2.C10 規模對照	小型 single instance / 中型 client-side ketama / 大型 mcrouter 路由 + 跨區 pool
2.C4 Meta CacheLib + Kangaroo	CacheLib 是 Memcached 之後 Meta 的分層 cache library、處理 DRAM 經濟極限後的議題
2.C3 Shopify serialization	Payload 編碼遷移在 Memcached 上一樣適用、雙軌策略不依賴 vendor
2.C5 Shopify write-through	Write-through 模式 Memcached 用 SET + CAS 實作、不像 Redis 有 Lua / transaction 可組合

下一步路由

上游概念：2.2 Cache Aside、2.3 TTL eviction
平行 vendor：Redis、AWS ElastiCache
下游能力：2.4 cache data shape

NATS

Fri, 01 May 2026 00:00:00 +0000

NATS 是 lightweight high-performance messaging system、承擔三個責任：subject-based routing（hierarchical wildcards）、low-latency messaging（Core NATS、fire-and-forget）、選擇性持久化（JetStream、streams + KV + Object Store）。設計取捨偏向「協議極簡、運維輕、必要時才開持久化」、適合微服務通訊跟 edge 場景。

對「微服務 messaging、IoT/edge、Request/Reply、需要 messaging + KV 一體」這條路徑、NATS 是輕量首選。本頁先給最短路徑、再展開日常 publish / subscribe 與 subject 設計、最後進階治理（JetStream、supercluster、leaf node）跟排錯。

本章目標

讀完本章後、你應該能：

用 nats-server 跑起 NATS（含 JetStream）、驗證 broker 健康
用 nats CLI publish / subscribe、看 subject hierarchy 匹配
區分 Core NATS（fire-and-forget）vs JetStream（durable）的選用判讀
看懂 stream 配置、consumer 配置、pending 訊號
評估 supercluster、leaf node、KV / Object Store 等延伸場景

最短路徑：5 分鐘把 NATS 跑起來

 1# 1. 啟動 NATS server（-js 開 JetStream、-m 8222 開監控埠）
 2docker run -d --name nats -p 4222:4222 -p 8222:8222 nats:latest -js -m 8222
 3
 4# 2. 用 nats CLI publish / subscribe（CLI 可用 natsio/nats-box 容器）
 5#    docker run --rm --network host natsio/nats-box nats 
 6nats --server nats://localhost:4222 pub demo.hello "world"
 7nats --server nats://localhost:4222 sub "demo.>"   # 另開一個 shell 持續訂閱
 8
 9# 3. 建 JetStream stream + pull consumer（持久化 + ack）
10nats --server nats://localhost:4222 stream add demo --subjects 'demo.>' \
11  --storage file --retention limits --discard old --defaults
12nats --server nats://localhost:4222 consumer add demo worker \
13  --pull --deliver all --ack explicit --filter 'demo.>' --defaults

最短路徑驗證「Core NATS + JetStream 都可用」。實際寫程式用 nats client library、見日常操作。

日常操作與決策形狀

CLI 與 client API

子議題：

nats CLI 指令對照表（pub / sub / stream / consumer / kv）
監控 endpoint（/varz / /connz / /jsz HTTP）
Client library 配置：connection / reconnect / timeout / async / sync subscribe
對應指令範例：nats stream info 、nats consumer info

Subject hierarchy 與 wildcard

Subject 是 NATS 路由的核心、層級式設計：

層級用 . 分隔（例：orders.created.us-west）
單層 wildcard *（匹配一層）
多層 wildcard >（匹配剩餘所有層）
Subject 命名規範與 ownership

Core NATS vs JetStream

子議題：

Core NATS：fire-and-forget、無持久化、極低延遲、適合即時通知 / 控制信號
JetStream：append-only stream + durable consumer、適合需要 replay / 持久化的事件流
兩者並存設計（同一 NATS server 同時跑）

Request/Reply 與 Queue groups

子議題：

Request/Reply pattern（RPC over messaging）
Queue groups（load balancing、多 subscriber 分擔同 subject）
Pub/Sub vs Queue groups 的差異

進階主題（按需閱讀）

JetStream 已展開為兩篇 deep article：core 到 JetStream 邊界（採用決策入口）、JetStream 設計與 supercluster/leaf node（stream / consumer / 跨區拓樸 / 多租戶完整實作）。下列子議題段保留選題判讀入口。

JetStream stream 設計

子議題：

Stream 配置（subjects、retention policy、storage type）
File-based vs Memory-based storage
MaxMsgs / MaxBytes / MaxAge（保留策略）
Replicas（JetStream raft、跨節點一致性）

JetStream consumer 設計

子議題：

Durable vs ephemeral consumer
Push vs pull consumer
Ack 策略（explicit ack / all / none）
AckWait + MaxDeliver + DeliverPolicy（重試控制）

Cluster / Supercluster / Leaf node

子議題：

Cluster：單一 region 多 broker、JetStream raft 同步
Supercluster：跨 cluster gateway、跨區延展
Leaf node：邊緣節點、subject mapping、適合 IoT / edge 場景
對應 3.C8 Cloudflare Queues 全球交付的對照思路

JetStream KV / Object Store

子議題：

KV store（基於 JetStream、簡單 key-value）
Object Store（基於 JetStream、大 blob）
何時用 NATS KV vs 真的 KV 服務（Redis / etcd）

Subject-based ACL 與多租戶

子議題：

Account 隔離（multi-tenancy 主機制）
Subject-level permission（publish / subscribe）
Cross-account import / export

排錯快速判讀

Consumer pending 累積

操作原則：先看 pending 是 ack-pending 還是 stream backlog、再定位 consumer 慢 vs stream 寫入過快。

1nats --server nats://localhost:4222 consumer info  
2# 看 Unprocessed Messages（stream backlog）與 Redelivered / Acknowledgment Pending（ack-pending）區分兩種累積

Stream 超 retention limit

操作原則：超 MaxBytes / MaxMsgs 時 stream 觸發 discard policy、看是 old discard 還是 new discard。

Leaf node 連線不穩

操作原則：邊緣節點到 hub 的網路品質決定 subject mapping 延遲、看 reconnect 次數與 latency。

Subject 路由錯誤

操作原則：wildcard 設計錯導致訂閱不到、或匹配過多。看 subject hierarchy 規範與實際 subject。

JetStream raft 不一致

操作原則：replica 配置 R3 但只有 2 個健康節點、stream 變 read-only。看 cluster info 與 raft state。

何時改走其他服務

需求形狀	改走
高吞吐事件流（百萬 msg/sec）	Kafka
複雜 routing（exchange model）	RabbitMQ
Managed queue（AWS / GCP）	SQS / Pub/Sub
Redis 生態已存在	Redis Streams
大型企業生態整合	RabbitMQ / Kafka（社群更大）
Managed NATS	Synadia Cloud

不在本頁內的主題

各語言 client 完整 API（依官方文件）
NATS 跟 gRPC 的對比（在分散式通訊章節）
Synadia Cloud 商業功能

案例回寫

NATS 專屬案例（C34-C41）

案例	主討論議題
3.C34 Netlify data plane	全球 metrics / logs fan-out
3.C35 Form3 multi-cloud	JetStream Leaf Node 跨雲低延遲支付
3.C36 Intelecy IoT	工業 IoT / BoltDB → JetStream
3.C37 MachineMetrics edge	Leaf node + KV + Object Store + 多租戶 Auth
3.C38 Clarifai ML	NATS Streaming queue group / at-least-once
3.C39 Choria fleet	Request/Reply + Queue group / 50 萬 server
3.C40 Resgate API gateway	Subject hierarchy 即 schema / Core NATS
3.C41 i-flow OT/IT	多工廠 leaf node hub-and-spoke

跨 vendor 對照

案例	對 NATS 的對應
3.C8 Cloudflare Queues	全球交付對照：leaf node + supercluster
3.C10 規模對照	小型 messaging / 中型 JetStream / 大型 supercluster

下一步路由

上游概念：0.3 非同步選型、3.1 broker basics
平行 vendor：Kafka、RabbitMQ
下游能力：3.4 consumer 設計、3.6 processing recovery semantics

systemd

Fri, 01 May 2026 00:00:00 +0000

systemd 是 Linux 主流 init system、承擔三個責任：service unit lifecycle（start / stop / restart / reload）、signal + journald + cgroups 整合、socket activation + timer（cron 替代）。設計取捨偏向「OS-level 整合 + 單機資源管理 + dependency graph」、適合 VM / bare metal 上單機服務、不需要 cluster orchestration 的場景。

對「VM / bare metal 服務管理、邊緣 / appliance、單機 lifecycle + journal + cgroups」這條路徑、systemd 是 Linux 主流選擇。

本章目標

讀完本章後、你應該能：

寫 service unit file、配置 Type / Restart / ExecStart
設計 signal handling + graceful shutdown
用 journald + journalctl 查 logs
設定 cgroups v2 resource limit
用 socket activation / timer 替代 inetd / cron

最短路徑：5 分鐘把 systemd service 跑起來

 1# 1. 建 unit file（需 root 或 sudo）
 2cat > /etc/systemd/system/myapp.service <<'UNIT'
 3[Unit]
 4Description=My Application
 5After=network.target
 6
 7[Service]
 8ExecStart=/usr/bin/myapp --config /etc/myapp/config.yaml
 9Restart=on-failure
10RestartSec=5
11
12[Install]
13WantedBy=multi-user.target
14UNIT
15
16# 2. 啟用 + 啟動
17systemctl daemon-reload
18systemctl enable --now myapp
19
20# 3. 驗證
21systemctl status myapp
22journalctl -u myapp -f

日常操作與決策形狀

Unit file 設計

子議題：

Unit type：service / socket / timer / target / mount / path
Service Type：simple / forking / oneshot / notify / dbus
Restart：no / on-failure / on-abnormal / always
ExecStart / ExecStop / ExecReload
對應指令：systemctl cat myapp.service、systemctl edit

systemctl 指令

子議題：

Lifecycle：start / stop / restart / reload / enable / disable
Status：status / is-active / is-enabled / list-units
Reload after edit：daemon-reload
對應指令範例：systemctl status myapp、systemctl list-units --failed

journald 日誌

子議題：

結構化日誌（kv pairs）
journalctl filter（-u / –since / -p / -f）
對應 logging：persistent vs runtime journal
跟外部 log forwarder（Vector / Fluent Bit）對接

進階主題（按需閱讀）

Signal handling + graceful shutdown

子議題：

SIGTERM（default stop signal）/ SIGKILL（force kill after timeout）
TimeoutStopSec：grace period
應用程式要 trap SIGTERM 做 cleanup
對應 Platform lifecycle contract（concept 通用）

cgroups v2 + resource limit

子議題：

CPUQuota / MemoryMax / IOWeight / TasksMax
Slice unit（樹狀 resource 限制）
跟 Kubernetes 的 resource limit 對比（K8s 用 cgroups 但抽象更高）
對應指令：systemd-cgls、systemd-cgtop

Socket activation

子議題：

用 .socket unit 持有 listening socket、service 啟動時繼承
啟動延遲：socket 一直在、service 按需起
替代 inetd
適合 occasional service / low-traffic

systemd timer

子議題：

.timer unit 替代 cron
OnCalendar / OnUnitActiveSec / RandomizedDelaySec
跟對應 .service unit 配對
比 cron 強：journal log / dependency / 失敗 restart

Portable services + systemd-run

子議題：

systemd-run：ad-hoc 跑 transient unit
Portable services：把 service + image 一起搬
systemd-nspawn 容器（systemd 自家輕量容器）

跟 container 整合

子議題：

跑 podman container 在 systemd（quadlet / generators）
Docker daemon 由 systemd 管
K8s kubelet 由 systemd 管（cluster node）
對應 single-node container management

排錯快速判讀

Service start failure

操作原則：先 systemctl status、再 journalctl -u 看 log。

1systemctl status myapp                # 看 Active state + Main PID + 最近 log
2journalctl -u myapp --since=-5m       # 最近 5 分鐘的完整 log

Restart loop

操作原則：Restart 配置不當 + StartLimit 觸發。判讀：systemctl status 看 restart count + RateLimit。

journald disk full

操作原則：journal storage 超 SystemMaxUse 設定。判讀：journalctl --disk-usage、/etc/systemd/journald.conf 設限。

cgroup OOM

操作原則：MemoryMax 超過、系統 OOM kill。判讀：journalctl -k 看 kernel oom 訊息。

Dependency 不對

操作原則：unit 依賴 network / db 但 After= 沒設。判讀：systemctl list-dependencies myapp。

何時改走其他服務

需求形狀	改走
多實例 cluster	Kubernetes
Container workflow 為主	Docker / Podman
Process supervisor（非 init）	supervisord / runit
Cron-only 場景	純 cron / systemd timer
Non-Linux（Windows / macOS）	Windows Service / launchd
邊緣 K8s	K3s（systemd 上跑 K3s）

不在本頁內的主題

完整 unit file directive reference
systemd internals（dbus / pid 1）
各 distro systemd 版本差異
systemd-resolved / systemd-networkd 等其他 component

案例回寫

跨 vendor 對照

案例	對 systemd 的對應
5.C9 cutover without drain	systemd 服務切換要靠 ExecStop / TimeoutStopSec / SIGTERM trap 等價 drain
5.C10 規模對照	小規模 VM 服務首選 systemd、跨規模升階到 K8s 時要保留 unit-level 回退腳本

待補 systemd 案例：大規模 fleet（HashiCorp Nomad 跟 systemd 整合）、IoT / edge appliance 案例、systemd portable services 落地案例。

下一步路由

上游概念：5.1 container runtime
平行 vendor：Kubernetes、Docker
下游能力：06 reliability（graceful shutdown）、4 observability（journald）

AWS IAM Identity Center

Mon, 18 May 2026 00:00:00 +0000

AWS IAM Identity Center 是 AWS 原生的 workforce SSO 控制面、前身為 AWS SSO（2022 改名）。它承擔三個責任：人類身份進 AWS 多帳號的 統一入口（Access Portal）、把使用者映射到各帳號 IAM role 的 Permission Set 模板、以及對少量已整合 SAML app 的 SSO gateway。它不是 AWS IAM 的替代品、是疊在 AWS IAM 之上的 人類入口層。

服務定位

IAM Identity Center 是 人類身份進 AWS 的 portal、不是 cloud resource permission engine。它跟 AWS IAM 的分工是兩層：Identity Center 管「人是誰、能登入哪些 account」、AWS IAM 管「進到 account 後對 resource 能做什麼」。實際機制是 Identity Center 透過 Permission Set 在每個目標 account 建一個 AWSReservedSSO_* 命名的 IAM role、使用者 assume 該 role 拿短期 STS token。

跟 Okta 相比、Identity Center 的核心優勢是 跟 AWS Organizations + Control Tower 原生整合、Permission Set 可以一次發佈到數百個 account、不必每個 account 各接 SAML。代價是 SaaS app integration 量級遠少於 Okta（Okta 7000+ 預建、Identity Center 僅中等規模）、跨雲 federation（GCP / Azure）也不在原生範圍。

許多大型組織採三層架構：Okta 是 HRIS 下游的 identity source of truth、SCIM push 進 Identity Center、Identity Center 再 map 到 AWS IAM Permission Set。Okta 管「人是誰」、Identity Center 管「AWS portal 入口」、AWS IAM 管「resource 能做什麼」。中小組織可以省略 Okta、直接用 Identity Center 內建 user store、但就失去跨 SaaS 統一 SSO。

本章目標

讀完本頁、讀者能判斷：

Identity Center 在 人類身份 / AWS portal / resource permission 三層裡的位置、何時該交回 AWS IAM 或上游 IdP
Identity Source 選擇（內建 / Active Directory / 外部 SAML）對 lifecycle 與 lock-in 的長期影響
Permission Set / Account Assignment / Access Portal 三個核心概念的稽核重點
何時 Identity Center 夠用、何時要疊 Okta 在前、何時 Identity Center 反而是錯選擇

最短判讀路徑

判斷 Identity Center 配置是否健康、最少看四件事：

誰能 assume 哪個 role：Permission Set 跟 Account Assignment 是否走最小權限、AdministratorAccess 範圍 Permission Set 是否限定 break-glass、是否強制 phishing-resistant 認證才能 assume 高權限
Permission Set 邊界：每個 Permission Set 的 session duration（預設 1 hour、可調 12 hour）、inline policy vs Customer Managed Policy reference、是否用 ABAC tag 收斂跨 account 散佈
External IdP federation 狀態：Identity Source 是內建 / AD / 外部 SAML、若走外部 IdP SCIM push 是否監控 sync 失敗、signing certificate 是否在 rotation 排程內
CloudTrail 是否完整：Identity Center 事件分布在 management account 跟 member account、是否有 organization trail 收齊、admin 變更 / Permission Set 變更 / failed assume 是否 alert

四件事任一缺失、就是 Audit Log 與 Authorization 邊界的待補項目。

日常操作與決策形狀

Identity Source 是根信任：Identity Center 支援三種 user/group 來源 — 內建 store、AWS Managed AD / on-prem AD via AD Connector、外部 SAML IdP（Okta / Entra ID 等、SCIM 推進來）。選了之後 user lifecycle 從哪來就鎖死、換 Identity Source 是大工程（要重建所有 Permission Set assignment、舊 user GUID 不通用）。早期決定錯比 Permission Set 設錯難救。

Permission Set 是 cross-account role template：定義一次、apply 到多 account、實際在每個 account 部署成一個 AWS-Reserved 命名的 IAM role。Permission Set 本身不是 role、是 role 的部署模板 — 改 Permission Set 會 push 到所有 account 上對應的 role。Customer Managed Policy reference 比 inline policy 好維護、但要先確保每個 target account 都有同名 policy、否則 assignment 會失敗。

Account Assignment：把 user/group 綁到 Permission Set + 特定 account 的三元組。這層用 group 而不是個別 user、跟著 Identity Source 的 group 變動自動同步。臨時權限（離職員工延長、incident 應變）走 access request workflow 或 IAM Access Analyzer + Just-in-Time、不要永久 assignment。

Access Portal URL 是 phishing 目標：custom URL（https://.awsapps.com/start）設定後變成員工每天用的入口、phishing 攻擊會 mimic。要強制 phishing-resistant MFA（WebAuthn / passkey）、純 push MFA 抗不過 fatigue。CLI 走 aws sso login 自帶 browser-based flow、不要叫員工複製貼 access key。

Application assignment：Identity Center 也能管 SAML app 的 SSO assignment、但 integration 數量遠少於 Okta。大量 SaaS app 的場景應該疊 Okta 在前、Identity Center 只管 AWS portal。

核心取捨表

取捨維度	IAM Identity Center	Okta + AWS IAM	直接用 AWS IAM Users（不推薦）
控制面責任	AWS 託管、限 AWS 帳號 + 中等 SAML app	Okta 管人類身份、AWS IAM 管 resource、兩層分工	每個 account 各自管 user、無跨帳號統一
多帳號統一入口	原生、Permission Set 一次發到全 Org	透過 SAML federation 到 IAM role	不存在 — 每個 account 各自 IAM Users
SaaS app 範圍	中等規模 integration	7000+ 預建 integration	無
Lifecycle	內建 / AD / 外部 SCIM 進來	Okta 走 HRIS SCIM 同步、Identity Center 接 Okta SCIM	手動管理、容易 stale
退場成本	中 — AWS 內部換	高 — Okta + Identity Center 都要拆	高 — 大量 IAM Users 散佈在 N 個 account
適合場景	AWS-heavy、員工數中等、SaaS app 少	多雲 + 大量 SaaS + AWS 帳號數十個以上	不存在合理場景（small lab 例外）

選 Identity Center 的核心訴求：AWS 是主要工作環境、員工 SaaS app 用量低、要統一多帳號入口而不要再付 Okta 訂閱。員工大量用 SaaS 的場景應該疊 Okta 在前。

進階主題

External IdP federation（Okta / Entra ID SCIM 進來）：Identity Center 接外部 IdP 是 push model — IdP 主動 SCIM push、Identity Center 不 pull。push provisioning 失敗會 silent（IdP 端有 log、Identity Center 端只看到 user 沒出現）、要在 IdP 端設 sync failure alert。SAML signing certificate rotation 兩邊都要排程、過期會整個 federation 斷。

Multi-account Permission Set 設計：避免每個 environment / team 各自一份 Permission Set — 用 ABAC（tag-based access control）把「Environment=Prod + Team=Payments」的條件寫進一個 Permission Set 的 policy、tag 跟著 user attribute 跑。Permission Set 數量爆炸是 Identity Center 老化最常見訊號。

Customer Managed Policy reference：Permission Set 可以 reference target account 裡的 customer managed policy（同名同 path）、policy 本身在每個 account 獨立維護。比 inline policy 適合大規模、但要靠 CI / Terraform 確保 policy 在所有 target account 同步存在、否則 assignment 失敗。

Session duration 是攻擊面：預設 1 hour、可調到 12 hour。長 session 對 dev 體驗友善、但不利於 credential rotation — 高權限 Permission Set（AdministratorAccess、production write）應該短 session（1-2 hour）、低風險 read-only 可放 8-12 hour。

IAM Identity Center API 不該當 workforce IdP 用：API 是給 admin 管 assignment 用、不是給 app 拿 user token。要 workforce app SSO 走 SAML / OIDC federation、不要叫 app 打 Identity Center API 查 user。

排錯與失敗快速判讀

Permission Set 數量爆炸：每個 team / environment 各一份、上百個 Permission Set 沒人敢動 — 改用 ABAC + user attribute 把條件寫進 policy、收斂到十位數
Identity Source 選錯難換：早期選內建 store、後來公司導入 Okta 要換成外部 SAML — 整個 user GUID 重新映射、Permission Set assignment 重綁、評估比建新 tenant 還久
External SCIM sync 失敗 silent：Okta 端 push 失敗、Identity Center 沒人 — 要在上游 IdP 設 SCIM provisioning failure alert、不要等使用者反映「我登不進去」
Access Portal URL 被 phishing：custom URL 員工記憶、phishing 站 mimic、無 phishing-resistant MFA 擋不住 — 強制 WebAuthn / passkey、員工教育只認 bookmark / SSO launcher
CloudTrail 不完整：只開 management account trail、member account 的 role assumption 看不到 — 開 organization trail 收齊、特別 alert Permission Set 變更與失敗 assume
Break-glass 缺席：Identity Center 控制面故障時 console 進不去 — 保留每個 account 的 root credential（離線存）跟少數 break-glass IAM User（hardware MFA、與 Identity Center 獨立 audit）、季度驗證

何時改走其他服務

需求形狀	改走
大量 SaaS app 統一 SSO	Okta vendor（疊在 Identity Center 前）
Customer / B2C identity	Auth0 vendor
自管 / 不接受 cloud-managed IdP	Keycloak vendor
AWS resource permission（policy / role / STS）	AWS IAM vendor
跨雲 federation（GCP / Azure workforce）	Google Cloud IAM / Azure RBAC
Secret / API key 治理	7.6 秘密管理與機器憑證治理

不在本頁內的主題

AWS IAM 的 policy / role / STS 機制細節（屬 AWS IAM vendor 頁）
Permission Set 的 JSON policy 撰寫教學
AWS Organizations / Control Tower 的完整架構
各 SaaS app SAML 接線教學

案例回寫

案例	跟 IAM Identity Center 的關係
Azure AD Identity Control Plane 2021	Identity Center 控制面故障會擋住 AWS console portal、降級路徑必須事先設計（emergency root credential、break-glass IAM User）
Failure: Credential Rotation Without Scope	Permission Set session duration 跟 external IdP signing key rotation 是不同域、要分開排程、不能混為一談
Okta Support System Incident 2023	Okta 作為 Identity Center 的 external IdP 時、上游事件會傳導下來、Identity Center 端要看 SCIM sync 異常與 federation token reuse
Cloudflare 2023 Okta Token Follow-Through	上游 IdP 出事後、Identity Center 端的 active session 是否要強制 reauth、不能等供應商公告

下一步路由

上游：7.2 身分與授權邊界、7.13 偵測覆蓋率與訊號治理
平行：Okta vendor（外部 IdP 疊在前）、Auth0 vendor、Keycloak vendor
下游：AWS IAM vendor（Permission Set 落地的 resource permission 層）、Google Cloud IAM / Azure RBAC（多雲對照）
跨模組：8 事故處理 vendor 清單（Identity Center 事件如何 routing 進 IR 流程）
官方：AWS IAM Identity Center Documentation

AWS KMS

Mon, 18 May 2026 00:00:00 +0000

AWS KMS 是 AWS 原生的 key management service、解決 對稱 / 非對稱金鑰生命週期管理 與 envelope encryption pattern：service 內部保管 master key（KMS Key）、應用層用 GenerateDataKey 取得短暫的 data key 對實際資料加密、master key 完全不離 KMS 服務邊界。整合面跟 AWS IAM / AWS Secrets Manager / S3 / EBS / RDS 都串好、是 AWS 上幾乎所有靜態資料加密的後端。

服務定位

AWS KMS 的核心定位是 AWS-only 的 multi-tenant managed key management，FIPS 140-2 Level 3 認證、跨服務 envelope encryption 的共同地基。跟 CloudHSM 比、KMS 是 managed + shared HSM 池、CloudHSM 是 single-tenant dedicated HSM；需要更高隔離 / 自管 cluster / FIPS Level 3 single-tenant 時走 CloudHSM、或用 KMS Custom Key Store 把 KMS 後端指向自己的 CloudHSM。跟 Google Cloud KMS / Azure Key Vault 比、設計概念相近、但 KMS 把 secret store 切出去（Secrets Manager）、Key Vault 則把兩者合一。

跟 Vault transit engine 比、行為相似（key 不離 service、app 拿 ciphertext）、但治理面完全不同：KMS 綁 AWS 控制面、IAM + Key Policy 雙層授權、CloudTrail 是稽核入口；Vault transit 是跨雲統一介面、token + policy 為主、需要自管 cluster。AWS-heavy 組織首選 KMS、跨雲組織才會把 KMS 當下游、上游用 Vault transit 抽象。

本章目標

讀完本頁、讀者能判斷：

哪些資料 / 場景該用 Customer Managed KMS Key、哪些 AWS Managed Key 已經夠用、什麼時候直接走 CloudHSM
Key Policy + IAM + Grant 三層授權的分工、production 必開的 CloudTrail Data event 與 monitor 範圍
Multi-Region Key、Custom Key Store、External Key Store、BYOK 等進階形態的取捨
KMS 出事（IAM 過寬、Key Policy 把自己鎖死、Schedule Deletion 誤觸發）時的判讀路徑跟回退選項

最短判讀路徑

判斷一個 AWS KMS deployment 是否健康、最少看四件事：

Key Policy 設計：是否含 root principal（不然 key 變孤兒）、是否走 least privilege（不是 kms:* 給整個 account）、admin / user / monitor 三類 principal 是否分開、policy 變更是否走 PR review
Grant 治理：哪些 service-to-service 短期授權走 Grant（rotation Lambda / RDS / EBS）、Grant TTL 是否設、廢棄 grant 是否定期 RetireGrant
Multi-Region 與 rotation 策略：是否啟用 annual automatic rotation（適用 symmetric encryption key）、Multi-Region Key 的 replica 是否跟 DR plan 對齊、asymmetric / signing key 的 manual rotation 流程是否有 runbook
CloudTrail Data Event 必開：management event 預設記、但 Encrypt / Decrypt / GenerateDataKey 是 data event、預設不記 — 沒這層 forensic 沒著力點、Storm-0558 對照下完全無法回答「誰用哪把 key 簽了什麼 token」

四件事任一缺失、就回到 7.6 秘密管理與機器憑證治理跟 Audit Log 的補丁清單。

日常操作與決策形狀

Key Type 選擇：symmetric encryption key（AES-256-GCM、最常用、S3 / EBS / RDS / Secrets Manager 都走這個）；asymmetric key pair（RSA / ECC、用於 sign / verify 或 encrypt / decrypt、JWT 簽署、CodeSign、文件簽章）；HMAC key（generate / verify MAC、API request signing）。對應 Storm-0558 signing key chain — 自己 host signing key 出事的核心教訓是 key 不該離 HSM service、所以 JWT signing 用 asymmetric KMS key 是 baseline 設計、private key 永遠不離 KMS。

Key Origin（key material 來源）：AWS_KMS（KMS 內部生成、預設）；EXTERNAL（BYOK、組織自己生成 key material、import 進 KMS、可以隨時 reimport 或刪除）；AWS_CLOUDHSM（Custom Key Store、key material 存在自己的 CloudHSM cluster）；EXTERNAL_KEY_STORE（XKS、AWS 外的 HSM、控制面在 AWS、key material 在 on-prem）。多數場景用 AWS_KMS 就夠、合規 / 主權需求才走 EXTERNAL / Custom Key Store。

Key Policy 跟 IAM 的雙層：KMS 跟其他 AWS service 最大差異是 Key Policy 是主要授權機制、IAM policy 單獨不夠。Key Policy 必含 arn:aws:iam::ACCOUNT_ID:root 給 root principal（不是 root user、是讓 IAM 能參與授權的開關）— 沒這條 key 變孤兒、即使 IAM 開了 admin 也救不回來。production 通常分三類 statement：admin（Create / Delete / Schedule、走 break-glass）、user（Encrypt / Decrypt / GenerateDataKey、給 app）、monitor（Describe / List、給 SRE）。

Grant 是程式化短期授權：service-to-service 整合（Secrets Manager rotation Lambda、RDS 自動加密、EBS volume attach）通常走 Grant 而不是改 Key Policy — 每個 grant 有自己的 grant token、可以帶 TTL、可以 RetireGrant / RevokeGrant 收回、不跟 key policy 永久綁定。沒治理時 grant 累積上千個 / 沒人 retire 是常見問題、跟 Failure: Credential Rotation Without Scope 同類 — 沒 scope map 等於沒治理。

Alias 與 Key ID 的解耦：alias（alias/my-app-prod-key）是 指向 key 的可變指標、key ID / ARN 是 不可變識別。production code 應該用 alias、要換 key 時只需要重綁 alias、不用改 deployment。Cross-account 跨帳號使用必須用 ARN（alias 不跨帳號）。

Key Rotation 的真實語義：annual automatic rotation（symmetric encryption key 才支援）換的是 KMS 內部的 backing key material、key ARN / Alias / Key ID 都不變、app 完全不需要動。舊資料仍用舊 backing key 解密、KMS 自動處理、不是「資料全部重新加密」— 這是常見誤解。asymmetric / HMAC key 不支援 automatic rotation、必須 manual 建新 key + alias 切換 + app 端雙讀容忍窗口（跟 JWT signing key rotation 同套路）。

Multi-Region Key：跨 region replicate 的 KMS key 共用 key material 跟 Key ID（後綴帶 mrk-）、不是建立新 key — 跨 region 加密的 ciphertext 在另一 region 可以直接 decrypt、不用 cross-region API call。適合 multi-region active-active app + DR scenario。代價是 replica region 跟 primary region 的權限要分別治理、Key Policy 不會自動同步。

Encryption Context 是 authenticated data：encrypt 時帶的 key-value pair（例：{"app": "billing", "tenant": "acme"}）、decrypt 必須提供同一組 context — 否則失敗。用來防 ciphertext 被 replay 到別的 context（攻擊者拿到 billing 的 ciphertext 想當 payroll 的 ciphertext 用）、所有 context 都會進 CloudTrail、是 forensic 上的關鍵欄位。production 一律帶 context、單純加密不帶 context 等於少一層防護。

Customer Managed vs AWS Managed vs AWS Owned：三層分權 — Customer Managed（CMK、自己控 Key Policy + 自選 rotation）、AWS Managed（aws/secretsmanager、aws/s3、AWS 管 Key Policy、看得到但改不了）、AWS Owned（完全看不見、AWS 自己用、無 CloudTrail）。production 高敏感資料應該用 Customer Managed、才能控 policy + 開 data event + 自選 rotation 週期。

核心取捨表

取捨維度	AWS KMS	Google Cloud KMS	Azure Key Vault	AWS CloudHSM	Vault transit engine
部署模型	AWS managed multi-tenant、FIPS 140-2 Level 3	GCP managed multi-tenant、FIPS 140-2 L3	Azure managed、Standard / Premium tier	AWS managed single-tenant HSM cluster	自管 Vault cluster
跨雲	弱 — AWS-only	弱 — GCP-only	弱 — Azure-only	弱 — AWS-only	強 — 跨雲統一介面
授權模型	Key Policy（強制） + IAM + Grant 三層	IAM 為主、Resource policy 輔	Access policy + RBAC 雙模式	CloudHSM user / role + Cluster IAM	path-based policy + token
Multi-Region	Multi-Region Key（共用 key material）	自動跨 region replication 較易	Geo-replication 透過 Premium tier	自管 cross-region replication	Replication（Enterprise）
Envelope encryption	一級 pattern（`GenerateDataKey`）	一級 pattern	一級 pattern	自己實作	內建（transit engine）
Asymmetric signing	支援（RSA / ECC、JWT / CodeSign 直用）	支援	支援	支援 + 完整 PKCS#11	支援（部分）
整合面	全 AWS service 原生（S3 / EBS / RDS / Lambda）	全 GCP service 原生	全 Azure service 原生	PKCS#11 / JCE / OpenSSL	應用層 SDK
適合場景	AWS-heavy + envelope encryption + JWT signing	GCP-heavy	Azure-heavy + 跟 AD 整合	合規 / FIPS L3 single-tenant / 自管 HSM	跨雲 + key 不離 service
不適合場景	跨雲統一 custody、需 FIPS L4、需自管 HSM cluster	同左	同左	純 envelope encryption 用 KMS 即可	AWS-only 簡單需求（KMS 更便宜）

KMS 是 AWS 上的 預設選擇、CloudHSM 是合規 / 自管要求才上的昇級、Vault transit 是跨雲統一介面、Google / Azure 對標品在各自雲一樣是預設選擇。

進階主題

KMS Custom Key Store + CloudHSM 整合：Custom Key Store 把 KMS 的 控制面（API、Key Policy、CloudTrail、IAM 整合）保留、但 key material 存在自己的 CloudHSM cluster。組織需要 FIPS 140-2 Level 3 single-tenant 但又不想放棄 KMS 的 service 整合（S3 SSE-KMS / EBS encryption）時用。代價是 CloudHSM cluster 的運維成本（cluster HA、user 管理、backup）。

External Key Store (XKS)：更激進的形態 — key material 完全在 AWS 之外（on-prem HSM 或第三方 HSM）、AWS 透過 XKS proxy 呼叫外部 HSM 做 cryptographic operation。用於 資料主權 場景（金融 / 政府 / 跨境合規要求 key 不出組織邊界）、代價是 latency 跟 availability 完全綁外部 HSM、AWS service 整合面要算清楚。

Multi-Region Replica Key 跟 DR：primary region 出事時 replica region 仍能 decrypt 既有 ciphertext、不需要 cross-region API call。但 primary 跟 replica 是各自獨立的 Key Policy、變更不會自動同步 — 跟 Audit Log 治理一樣、replica region 也要納入 CloudTrail Data Event 覆蓋範圍。

BYOK（Bring Your Own Key）：Origin = EXTERNAL 的 KMS Key、key material 由組織自己生成、用 wrapping key 加密後 import 進 KMS。優點是組織保有 master copy（KMS 出事時仍能 re-import 到別處）、缺點是 automatic rotation 不支援（必須手動 import 新 key material）、且必須自己處理 wrapping key 的生命週期。

跟 Secrets Manager 的整合：Secrets Manager 的 secret 本身用 KMS key 加密（預設 AWS Managed aws/secretsmanager、production 應該指到 Customer Managed CMK）。rotation Lambda 透過 Grant 取得 Decrypt + Encrypt 能力、跟 Secrets Manager 一起構成 static secret rotation 的證據鏈 — 跟 credential rotation scoped evidence 對齊。

Asymmetric signing 的 use cases：JWT signing（KMS Sign API 直接簽 JWT header.payload、private key 不離 KMS、跟 Storm-0558 的設計對照鮮明）；CodeSign / S3 object signing（artifact integrity）；mTLS client cert 的 private key（搭配 cert-manager AWS issuer）。代價是 latency（每次 sign 一次 KMS API call、~10ms 級別、不適合超高 QPS）跟 cost（asymmetric operation 比 symmetric 貴 ~5x）。

排錯與失敗快速判讀

Key Policy 沒有 root principal：Schedule 時忘了寫、key 立刻變孤兒、誰都不能用 — 只能透過 AWS Support 救（流程慢）；建立流程強制 template 含 root principal
IAM admin 改不動 KMS key：Key Policy 沒授權 IAM 介入、即使 admin policy 有 kms:* 也擋掉 — 加 Enable IAM User Permissions statement 給 root principal、IAM 才能參與授權
Schedule Key Deletion 誤觸發：min 7 天、max 30 天的等待期、期內可 cancel — production key 必含 alert（CloudWatch Alarm on ScheduleKeyDeletion event）+ 強制 4-eyes approval
CloudTrail Data Event 沒開：事故後想查「誰 decrypt 了什麼」、發現只有 management event — production 必開 KMS data event、預估 cost（每 100k events ~$0.10）、敏感 key 一律開
Encryption Context 不一致：encrypt 時帶 context、decrypt 時忘了帶（或帶錯）、InvalidCiphertextException — code review 強制 context schema、用 typed wrapper 避免人手帶錯
Grant 累積 + 沒 retire：每個 KMS key 有 50,000 grant 上限、rotation Lambda 跑久了 grant 累積 — 定期 ListGrants + RetireGrant 廢棄的、IaC 治理 grant lifecycle
Cross-region decrypt 失敗：以為 ciphertext 跨 region 通用、結果原本不是 Multi-Region Key — production 跨 region 場景一律建 Multi-Region Key、不要事後補
CMK rotation 後舊 ciphertext 還能 decrypt：annual rotation 不會 re-encrypt 舊資料、KMS 自動用對應 backing key — 這是設計、不是 bug；真要全量 re-encrypt 要走 application-level migration

何時改走其他服務

需求形狀	改走
FIPS 140-2 Level 3 single-tenant HSM	CloudHSM、或 KMS Custom Key Store 橋接
GCP-heavy 環境	Google Cloud KMS
Azure-heavy + 跟 AD / Managed Identity 整合	Azure Key Vault
跨雲統一 key custody	HashiCorp Vault transit engine
Static secret + rotation orchestration	AWS Secrets Manager（後端是 KMS）
K8s workload mTLS cert	cert-manager（可用 KMS asymmetric key）
Public TLS cert	AWS ACM / Let’s Encrypt
數據主權 / on-prem HSM required	KMS External Key Store (XKS) 或直接 CloudHSM

不在本頁內的主題

KMS 完整 API reference 跟 SDK 範例
各 AWS service（S3 SSE-KMS、EBS encryption、RDS encryption、DynamoDB encryption）的詳盡設定步驟
跟 AWS Organizations / SCPs 的 cross-account KMS sharing 完整治理流程
CloudHSM cluster 的完整運維（高可用、user 管理、backup）— 看 CloudHSM
各種 cryptographic algorithm 的數學原理跟選型細節

案例回寫

KMS 在 07 案例庫沒有直接 vendor-level 事件、以下案例採對照引用：

案例	跟 KMS 的關係（對照）
Microsoft Storm-0558 Signing Key 2023	KMS 設計核心對照 — signing key 必須 HSM-bound + 不可導出、KMS 預設 key 完全不離 service；自己 host private key 是 Storm-0558 級事件的根因
Microsoft Storm-0558 Signing Key Chain (red-team)	三件事必到位：asymmetric KMS Key 做 JWT signing（private key 永遠不離 KMS）、強制 rotation 流程、CloudTrail Data Event 紀錄「誰用 key 簽什麼 token」
Failure: Credential Rotation Without Scope	KMS Alias / Grant 的 rotation 跟 revocation 要分域 — 一次 Schedule Key Deletion 沒 scope map 等於潛在全停、Grant lifecycle 要納入治理

下一步路由

上游：7.6 秘密管理與機器憑證治理、7.5 傳輸信任與憑證生命週期（KMS 為 TLS / signing key 的 root custodian）、7.13 偵測覆蓋率與訊號治理
平行：Google Cloud KMS、Azure Key Vault、CloudHSM
下游：AWS Secrets Manager（後端用 KMS）、cert-manager（可用 KMS asymmetric key 當 issuer）
對照：HashiCorp Vault（transit engine / 跨雲統一介面）
跨模組：8 事故處理 vendor 清單（KMS 事件如何 routing 進 IR 流程）
官方：AWS KMS Documentation

GitHub Advanced Security

Mon, 18 May 2026 00:00:00 +0000

GitHub Advanced Security（GHAS）是 GitHub 內建的 application security platform、由四大模組組成：Code Scanning（CodeQL 為預設 SAST、可接受第三方 SARIF）、Secret Scanning（偵測 leaked credential、含 Push Protection 預防 push）、Dependency Review（PR 級依賴變更 gate）、Dependabot（自動化依賴 update + alert、細節見獨立 vendor 頁）。它跟 Snyk / Trivy 等獨立 SCA 工具的核心差異是 跟 GitHub workflow / PR / Security tab 深度整合 — security finding 直接出現在 PR review 跟 organization Security overview、不需另一個 dashboard。

服務定位

GHAS 的核心定位是 把 application security 控制面收斂回 GitHub 平台：SAST、Secret Scanning、Dependency Review、Dependabot 共用 GitHub 的 identity / permission / PR / branch protection / Actions / Security tab，讓 security finding 跟 code review 在同一個 surface 上決策。這跟 Snyk 走「跨 SCM、跨雲、自有 dashboard」是相反方向 — Snyk 把 security 抽到平台之上、GHAS 把 security 釘在 GitHub 之內。

跟 Trivy 比、定位差更遠。Trivy 主打 container image / IaC / SBOM scan、open-source 免費、適合塞進任何 CI；GHAS 主打 source code + secret + dependency、Enterprise 付費、container scan 有但偏弱。兩者通常並存 — Trivy 跑 container artifact、GHAS 跑 source repo。

跟 Dependabot 的關係是內含 — Dependabot 是 GHAS 四模組之一、跟 GHAS 同一個控制平面、跟 PR / Security tab 同一條 evidence chain。本頁聚焦 GHAS 整體 + Code Scanning / Secret Scanning / Dependency Review；Dependabot 的 update PR 政策、ecosystem 覆蓋、alert routing 細節留在該頁。

關鍵張力：GHAS 計費走 per-active-committer + per-repo、2024 後 Secret Scanning 跟 Code Scanning 拆開計費。大型 mono-repo 或 committer 數量膨脹的組織會撞到成本天花板、需要選擇性 enable repo + 拆模組買；同時、Push Protection 這類 預防型 控制只有 enable 後才有效、選擇性 enable 等於默認 risk 接受。

本章目標

讀完本頁、讀者能判斷：

GHAS 四大模組各自承擔哪段控制責任（SAST / Secret / PR-level dependency gate / 自動 update）、哪些跟 Snyk / Trivy 重疊或互補
CodeQL 跟 SARIF 標準的關係、為什麼第三方 SAST 工具的 finding 也能進 GHAS Security tab
Secret Scanning 的 Push Protection（預防 push）跟 Secret Scanning Alert（偵測 leaked）的職責差、partner pattern vs custom pattern 何時用
何時用 GHAS、何時改走 Snyk / Trivy / GitLab Ultimate（GitLab 自家相當品）

最短判讀路徑

判斷 GHAS 配置是否健康、最少看四件事：

誰能 enable / disable：Organization owner / Security manager role 配置、enable GHAS 的 audit log 是否同步、誰能改 Code Scanning workflow（branch protection 是否擋住 workflow file 直接 push）
哪些 repo 開啟：Org Security overview 看 Code Scanning / Secret Scanning / Dependency Review coverage、新建 repo 是否預設啟用（Organization-level default setting）、private / internal / public repo 是否一致開啟
Push Protection 狀態：Secret Scanning Push Protection 是否 organization-wide enable、bypass 權限給誰（developer 個人 bypass vs 必須走 Security team approval）、bypass 事件是否進 audit
Secret Scanning Coverage：partner pattern（AWS / GCP / Stripe / Slack 等預配）是否全開、custom pattern 是否涵蓋自家 internal token（service token、internal API key）、historical scan 是否跑過（不只新 commit、舊 commit 也要掃）

四件事任一缺失、就是 Secret Management 跟 Supply Chain Integrity 邊界的待補項目。

日常操作與決策形狀

Code Scanning 走 SARIF 標準：Code Scanning 不只是 CodeQL 的 UI、是 SAST aggregation layer。所有 SAST 結果（CodeQL 預設、或 Semgrep / Snyk Code / Brakeman / Bandit / SonarCloud / Checkmarx 等第三方）以 SARIF（Static Analysis Results Interchange Format）upload 到 Code Scanning、Security tab 統一展示、PR review 統一標註。意義是 組織可以用多個 SAST 工具但只看一個 dashboard — 不需要每個 vendor 各自登入。多工具 SARIF upload 用 GitHub Actions 的 github/codeql-action/upload-sarif step。

CodeQL 是 first-class query language：CodeQL 用 Datalog-like 語法寫 自定 query、可以檢測 organization-specific anti-pattern（例：禁用某內部 deprecated function、強制 input validation 在特定 trust boundary）。vendor-provided pack（GitHub 維護的 CodeQL pack）覆蓋 OWASP Top 10 / CWE Top 25、自定 query 補組織 idiomatic check。代價是 CodeQL 學習曲線陡 — 不是 regex / AST pattern、是完整的 graph query language。

Secret Scanning 三層職責：Secret Scanning 分三層。Partner pattern — GitHub 跟 AWS / GCP / Stripe / Slack / npm 等 vendor 預配 token pattern、預設 detection 範圍最大、leaked token 還會通知 vendor revoke。Push Protection — commit push 前 scan、發現 secret 直接 reject push、開發者必須先移除才能 push；這是預防不是偵測、不需要等 leaked 後 rotation。Custom pattern — 組織自己的 internal token（service-to-service API key、legacy auth token）寫 regex pattern、配 validation endpoint 降 FP。

Dependency Review 是 PR-level gate：每個 PR 跑 新增 / 升級依賴的漏洞檢查 + license check、把 新引入 CVE 列在 PR review、可設 branch protection 強制 PR 過 Dependency Review 才能 merge。這跟 Dependabot 是互補關係：Dependabot 是 已 merge 依賴的 update PR（時間軸：merge 後 vuln 出現、自動發 update PR）、Dependency Review 是 PR 加新依賴時的 gate（時間軸：merge 前 vuln 已知、擋 PR）。兩條軸都要開。

Security overview 是 org-level dashboard：Organization Security tab 看 跨 repo 的 Code Scanning / Secret Scanning / Dependency / Dependabot alert 彙整、用 repo / severity / age filter 排序。對於 security team 不是 repo owner 的組織、Security manager role 給 security team 跨 repo read + triage 權限、不需要 admin。

Security Advisories（CVE 揭露 workflow）：自家 OSS / 商業 product 出 CVE 時、走 GitHub Security Advisory — 在 private fork 修補、coordinated disclosure 時間到公開 advisory、GitHub 自動向 CVE Numbering Authority 申請 CVE ID。這條 workflow 是 維護者視角、不是 使用者視角；使用者收到的是其他人發的 advisory 進 Dependabot alert。

SARIF integration 是 GHAS 的 aggregation 角色關鍵：GHAS 不強迫只用 CodeQL — Snyk Code / Semgrep / SonarCloud 等 SAST 工具跑完輸出 SARIF、CI 上傳到 GitHub、Security tab 集中展示。意義是 組織用 Snyk 做 SAST、但 finding 走 GHAS UI 是合法配置；GHAS 賣的不只是 CodeQL、是 SAST 統一視圖。

核心取捨表

取捨維度	GHAS	Snyk	Trivy	Dependabot（GHAS 子模組）
主要範圍	Source code + secret + dependency（PR-level）	SCA + Container + IaC + SAST（跨 SCM）	Container image + IaC + SBOM scan	依賴 update + alert（merged code）
SCM 綁定	緊綁 GitHub	跨 GitHub / GitLab / Bitbucket / Azure Repos	無 SCM 綁定、跑在 CI / artifact registry	緊綁 GitHub
SAST 引擎	CodeQL 預設 + 第三方 SARIF aggregation	Snyk Code（DeepCode）	無 SAST	無
Secret Scanning	Partner pattern + Push Protection + custom pattern	Snyk Secret Scanning（較弱）	有限（filesystem secret scan）	無
Container 強度	中（Code Scanning 可掃 Dockerfile）	強（Snyk Container 是主打）	強（Trivy 是 container scan 標準）	無
License / SBOM	有（Dependency Review 含 license）	強（SBOM 生成、license compliance dashboard）	強（SBOM 是 first-class）	無
PR 整合	深 — Security tab + PR review 直連	中 — GitHub Check + 跨 SCM PR comment	中 — 第三方 Action 整合	深 — 自動發 PR
計費	Per-active-committer + per-repo（Enterprise）	Per-developer + tier	Open source 免費（Aqua 商業版加值）	GHAS 一部分
適合	GitHub-heavy org、想統一 PR + security UI	多 SCM / 多雲、SCA + Container 一站、license 強需求	Container / IaC scan 為主、CI pluggable	GitHub repo 想要自動依賴 update
不適合	GitLab / Bitbucket / 自管 Git 為主	GitHub-only 又要省成本	需要 SAST + Secret Scanning	不想自動產生 PR（噪音）

選 GHAS 的核心訴求：GitHub 是 SCM + 想 PR review 跟 security finding 合一 + Enterprise 預算可吸收 per-committer cost。GitLab 主要的組織直接走 GitLab Ultimate 的對等功能；多 SCM 或 container 為主走 Snyk + Trivy 組合。

進階主題

CodeQL custom query 開發：寫自定 query 用 CodeQL CLI 本地開發、跑 codeql database analyze、SARIF output 上傳。常見場景：禁用 internal deprecated API、特定 framework 的 misuse pattern、組織 idiomatic security check。Query pack 可以 publish 到 GitHub Container Registry 或 internal registry、跨 repo 復用。代價是 維護成本 — CodeQL query language 學習曲線陡、組織需要至少 1-2 個 security engineer 專門養護。

Push Protection bypass workflow：Push Protection reject push 後、developer 可以 bypass（標記 false positive / test data / 風險已知）。Bypass 權限治理是關鍵 — 開放給 developer 個人 bypass 失去預防意義、強制 Security team approval 又拖慢 dev velocity。常見折中：低風險 pattern（test fixture token）developer 可 bypass、高風險 pattern（production credential）必須 Security team approve；所有 bypass 事件進 audit log。

跟 GitHub Actions 整合：Code Scanning 走 GitHub Actions workflow 跑 CodeQL — github/codeql-action/init + github/codeql-action/analyze。同 workflow 可以加 upload-sarif step 接第三方 SAST 結果。Actions 用 GitHub-hosted runner 跑 CodeQL 是預設、大型 repo 跑 CodeQL analyze 可能超時、需改 self-hosted runner（大 RAM / 多 CPU）— 但 self-hosted runner 自身是 supply chain 風險、需要 ephemeral runner + 限制 secret access。

SARIF 多工具整合：第三方 SAST / SCA / Container scan 工具（Snyk / Semgrep / Trivy / Brakeman / Bandit / Gosec）跑完輸出 SARIF、CI 上傳到 GHAS。實務上組織常用 CodeQL + Semgrep 雙軌 — CodeQL 跑深度 graph query、Semgrep 跑快速 pattern 規則；finding 在 Security tab 用 tool filter 分開看。

Secret Scanning partner pattern：GitHub 維護的 partner pattern list 涵蓋 AWS / GCP / Azure / Stripe / Slack / npm / Docker Hub / GitHub PAT 等。leaked token detect 後、GitHub 自動通知 vendor、vendor 端可選擇 自動 revoke 該 token。意義是 組織不需要做 rotation — vendor 已經把 leaked token 廢掉。custom pattern 則需要組織自己提供 validation endpoint、GHAS 呼叫驗證才確認是真 leak。

GHAS Cloud-hosted vs Self-hosted Runner 治理：CodeQL 跑在 GitHub-hosted runner 是預設、所有 source code 上傳到 GitHub 運算環境。對 source code 機密度高 的組織（金融 / 國防 / 法規限制 source 出境）、需走 self-hosted runner。Self-hosted runner 的供應鏈風險見 GitHub OAuth 2022 — runner token 是 supply chain entry、OIDC short-lived token 是建議方向。

GHAS Enterprise pricing trap：Per-active-committer 計費、organization 內所有 過去 90 天有 commit 的 user 都算 active committer、即使只 commit 1 行也計費。大型公司容易超支；2024 後 Secret Scanning 跟 Code Scanning 拆開計費、可只買 Secret Scanning（單價較低）給全 org、Code Scanning 給關鍵 repo。Public repo 上 GHAS 功能多數免費（Code Scanning、Secret Scanning、Dependency Review）；GitHub Enterprise Cloud 的 internal / private repo 才落入 GHAS 計費範圍 — 兩者範圍不同、新組織常踩到把 private repo 全開的成本。

排錯與失敗快速判讀

新建 repo 沒自動開 GHAS：Organization-level default 沒設、新 repo 預設 disable — 開 Organization Security settings 的 Enable for new repositories、現有 repo 用 bulk enable
Push Protection 大量誤殺：custom pattern regex 太寬、合法字串被當 secret — 加 validation endpoint 或收緊 regex、bypass 統計看 FP rate
Secret Scanning 沒掃歷史 commit：只 enable 後新 commit 觸發、舊 commit leaked secret 沒被發現 — 跑 historical scan（enable 後 GitHub 自動掃過去全部 commit）、可能花數小時
Dependency Review 沒擋住 vuln PR：Branch protection 沒加 Dependency Review required check — 加進 required status check、新 PR 才強制過
Code Scanning workflow 跑很久 / 超時：repo 太大、GitHub-hosted runner RAM 不足 — 換 larger runner（GitHub Larger Runners）或 self-hosted、或只跑 changed file analysis
Custom CodeQL query FP 多：query 寫得太寬、commit 都跳 alert — 加 @precision high 標籤、用 Sink-Source 分析降低 reach
第三方 SAST SARIF 沒進 Security tab：upload-sarif step 沒設對 category 或 permissions — security-events: write permission 必須在 workflow 給；同 repo 多工具用不同 category 區分
Bypass 沒進 audit：Push Protection bypass 沒同步到 SIEM — Enterprise audit log streaming 開、event filter 加 secret_scanning.bypass

何時改走其他服務

需求形狀	改走
多 SCM（GitHub + GitLab + Bitbucket）	Snyk
Container image scan 為主	Trivy 或 Snyk Container
SBOM 生成 + license compliance	Syft + Grype（SBOM-first OSS）/ Snyk + Trivy（SBOM 含在 scan）
GitLab 為主	GitLab Ultimate（SAST / Secret Detection / Dependency Scanning 內建）
Secret scan 但不在 GitHub	GitGuardian / Gitleaks
Runtime detection（不只 source code）	7.13 偵測覆蓋率與訊號治理系列工具

不在本頁內的主題

CodeQL 完整 query language reference
Dependabot 的 update PR 政策、ecosystem 覆蓋、grouped update（見 Dependabot vendor 頁）
GHAS Enterprise Server（自管 GitHub）跟 Cloud GHAS 的功能差異
各語言 / 框架的 CodeQL pack 完整覆蓋表
GHAS 跟 GitHub Copilot Autofix 整合的 AI-assisted remediation 細節

案例回寫

GHAS 在 07 案例庫沒有 直接 GHAS-level vendor 事件。對照引用展示 GHAS 在 supply chain / source-level 控制的能力邊界：

案例	跟 GHAS 的關係
Log4Shell CVE-2021-44228	Dependency Review + Code Scanning 應覆蓋 transitive 依賴、不只 direct import；Security Advisory 是維護者揭露 CVE 的 workflow
XZ Backdoor 2024	對照啟示 — GHAS Dependency Review 看 package version、看不到 maintainer takeover；需補 release-tarball vs git tag 差異跟 maintainer trust baseline
SolarWinds 2020 Sunburst	對照啟示 — Code Scanning 是 source-level、看不到 build-time 植入；需配合 artifact provenance（SLSA L2+）+ reproducible build
GitHub OAuth 2022 Token Supply Chain	對照啟示 — GHAS 自身 token / Actions 權限治理是 supply chain risk、Push Protection + OIDC trust（非長期 token）是 mitigation
7.12 供應鏈完整性與 Artifact 信任	GHAS 是 supply chain 治理工具集、章節原則對應四模組 workflow

下一步路由

上游：7.12 供應鏈完整性與 Artifact 信任
平行：Snyk、Trivy、Dependabot、Syft + Grype（SBOM 走 SARIF 進 GHAS Code Scanning 是常見組合）
下游：7.6 秘密管理與機器憑證治理（Secret Scanning 配 Vault rotation）
跨類：7.13 偵測覆蓋率與訊號治理（GHAS alert 進 SIEM 的 routing）
跨模組：8 事故處理 vendor 清單（leaked secret / SAST critical finding 進 IR 流程）
官方：GitHub Advanced Security Documentation

Google Security Operations

Mon, 18 May 2026 00:00:00 +0000

Google Security Operations 是 Google 雲端的 SOC 整合平台、2023 年起把前 Chronicle SIEM + 2022 收購的 Siemplify SOAR + 2022 收購的 Mandiant threat intel 三條產品線整合成單一品牌。它跟 Splunk / Elastic Security / Datadog Security 的差異在 資料規模假設 + 計費哲學 + threat intel 內建程度、偵測能力本身相近 — Google 的設計假設是 PB/day ingestion + Google 級基礎設施 + 固定費率 by data tier、跟 Splunk per-GB 累進的計費哲學完全相反。

服務定位

Google Security Operations 的核心定位是 為超大規模 SOC 設計的雲原生 SIEM + SOAR + threat intel 一體機、底層走 Google 自家 search infrastructure、上層由四個 first-class concept 撐起來：UDM（Unified Data Model、Google 自定 schema、所有 source 強制 normalize）、YARA-L（Google 自家 detection rule 語言）、Curated Detection（Google 維護的 detection rule 訂閱、客戶不需自己拉）、Mandiant Applied Threat Intel（事件期間自動 enrich + IoC push）。

跟 Splunk 比、Google 走 fixed-price by data tier + 強制 schema normalization — Splunk per-GB ingestion 計費在 PB-scale 會痛、Google 在 multi-PB 通常便宜 3-5 倍、但客戶要接受 UDM 強制 schema 跟 YARA-L 新語法。跟 Elastic Security 比、Google 是 SaaS-only + 大規模優化、Elastic 可自管 + OSS-friendly。跟 Datadog Security 比、Google 是 純 SOC 專用工具、Datadog 是 observability 平面上的 security view；Datadog 適合中等規模 + observability 已用 Datadog、Google 適合大規模 SOC + 不需要 observability 同 plane。

關鍵張力：fixed-price tier 在小規模反而不划算、PB-scale 才回本。組織要看清楚自己的 ingestion 量級 — TB/day 以下走 Datadog / Elastic 通常更便宜、TB-PB/day 之間是模糊地帶、PB/day 以上 Google 是少數能撐又便宜的選擇。Mandiant threat intel 跟 Gemini for Security 是 Google-only 的加值、但這兩個是 enhancement、不是選 Google 的主理由。

本章目標

讀完本頁、讀者能判斷：

Google Security Ops 在 SOC stack 承擔哪一段（log aggregation + SIEM + SOAR + threat intel 一體）、跟 Google Cloud IAM / Google Secret Manager 怎麼整合
UDM forced normalization 跟 YARA-L 對 detection 設計的影響（schema-first 而非 query-first）
Curated Detection + Mandiant Applied Threat Intel 在偵測 lifecycle 的位置（不是自己拉、是訂閱）
何時選 Google Security Ops、何時走 Splunk / Elastic / Datadog 的取捨

最短判讀路徑

判斷 Google Security Ops deployment 是否健康、最少看四件事：

Ingestion 邊界：哪些 source 進來（Forwarder / GCS bucket / Pub/Sub feed / Cloud-native API feed）、UDM normalization 是否覆蓋全部 source、自家 app log 的 parser 是否寫好
Detection 治理：誰能改 YARA-L rule、Curated Detection 開了哪些、自家 rule 是否走版控（Git → API push）、staging tenant 是否在 production 之前 sanity-check
Threat intel 流向：Mandiant Applied Threat Intel 是否啟用、Curated Detection 是否跟新 IoC 自動同步、IoC enrichment 是否回 alert 上下文
Response 流向：Siemplify SOAR 是否接 alert、playbook 是否進版控、跟 8 incident response 的 routing 是否定義

四件事任一缺失、就是 Detection Coverage and Signal Governance 的待補項目。

日常操作與決策形狀

Ingestion 路徑：log 進 Google Security Ops 有三種主路徑 — Chronicle Forwarder（agent-based、on-prem / VM、syslog / file tail）、Cloud Storage feed（log 先進 GCS bucket、Google 拉）、Pub/Sub feed（serverless / GCP 原生 push）、再加 Direct API feed（cloud SaaS 像 Okta / Azure AD / AWS CloudTrail 透過原廠 connector）。SaaS-heavy 環境通常以 Direct API feed 為主、on-prem 才需要 Forwarder。

UDM (Unified Data Model)：UDM 是 Google 自定的統一 event schema、所有 source（CloudTrail / Azure AD / Okta / endpoint / DNS）在 ingestion 時 強制 normalize 到 UDM 欄位（principal.user、target.resource、security_result.action 等）。跟 Splunk CIM 同概念、但 Splunk CIM 是 選擇性 mapping、Google UDM 是 forced normalization — 不寫 parser 就不能 ingest custom source。設計取捨：schema-first 讓跨 source query 一致、但客製 source 的 onboarding 變重。

YARA-L detection rule：Google 自家 detection rule 語言、跟 SPL / EQL 同類但結構更明示 — events { } 段定義 source pattern、match { } 段定義 join / time window、condition { } 段定義 threshold、outcome { } 段定義 risk score。比 SPL 的 pipe 風格更接近 關聯式宣告、特別適合表達 time-bounded sequence + cross-source join。Uber MFA 那種「5min 內 50 個 MFA fail + 新裝置 + 異常地理」用 YARA-L 直接寫成 sequence pattern 比 SPL 清楚。

Curated Detection：Google 維護的 detection rule 訂閱集合、跟 Splunk Security Content 同類但 Google 是 built-in subscription、客戶不需要自己拉 / merge — Google 自動跟 Mandiant threat intel 同步、新 IoC 發布後對應 rule 自動 enable。組織通常 先全部啟用 baseline、再選擇性 disable noisy 規則 + 補自家 custom YARA-L。

Applied Threat Intel (Mandiant)：事件發生時 Google 自動把 alert 裡的 IoC（IP / domain / hash）跟 Mandiant feed 對照、若命中已知 APT 活動就升級 risk score + 附上 Mandiant 報告。跟其他 SIEM 走第三方 threat intel feed 需要自己 maintain enrichment pipeline 不同、Google 走 vertical integration — 收購 Mandiant 後直接內建。

Siemplify SOAR：2022 收購 Siemplify 後整合進 Google Security Ops、playbook 處理 alert triage + 自動 response — 例如 leaked credential 自動 rotate（拉 Google Secret Manager API）、suspect user 自動 disable（拉 Okta / Google Workspace API）、suspect IP 自動加 firewall block（拉 Cloudflare WAF custom rule）。playbook 進版控、走 approval gate for high-impact action、不能黑箱 fire-and-forget。

Entity Graph：Google Security Ops 把 user / asset / IP / domain / hash 等實體做 graph、做 correlation + lateral movement detection。Snowflake 2024 那種「同一 credential / IP 跨多個 Snowflake account」的橫向擴散用 Entity Graph 直接視覺化關聯。

Google Cloud 整合：跟 Google Cloud IAM / Workload Identity Federation 整合度高 — GCP audit log 直接內建 connector、IAM policy change 直接 surface 成 alert 候選、跨 GCP project 的 federation 走 Google Cloud IAM 認證。非 GCP 環境（AWS / Azure / on-prem）一樣支援、但設定路徑比 Splunk add-on 略陡。

核心取捨表

取捨維度	Google Security Operations	Splunk	Elastic Security	Datadog Security
計費模型	Fixed price by data tier（PB-scale 划算）	Ingestion-based（GB/day、累進）	Resource-based（node / cluster size）	Per-host + per-event（events/month）
Schema 處理	UDM forced normalization	CIM optional mapping	ECS optional mapping	Tag-based、彈性高
Detection 語言	YARA-L（結構化 events / match / condition）	SPL（pipe-based、表達力強）	KQL / EQL	Datadog query
Detection content	Curated Detection 內建訂閱	Splunk Security Content（OSS、自拉）	Elastic Prebuilt + Sigma	Datadog Security Rules
Threat intel	Mandiant Applied Threat Intel 內建	需第三方 feed + 自家 pipeline	需第三方 feed	Datadog 內建 + 第三方
SOAR / Response	Siemplify SOAR 內建	Splunk SOAR（前 Phantom、業界先驅）	Cases + Elastic Defend	Workflow Automation（基本）
LLM-assisted	Gemini for Security 內建（2024+）	Splunk AI Assistant	Elastic AI Assistant	Bits AI
部署模型	SaaS only（Google Cloud）	Self-hosted / SaaS	Self-hosted / SaaS / Serverless	SaaS only
適合場景	PB-scale SOC、Google Cloud heavy、要 Mandiant	Enterprise + 跨 on-prem、預算允許	OSS-friendly、Elastic stack 已用	Cloud-native + observability 已用 Datadog
退場成本	中 — YARA-L 跟 UDM 是 Google-specific	高 — SPL / detection / dashboard 量多	中 — Sigma / Lucene 較可移植	中

選 Google Security Ops 的核心訴求：PB-scale ingestion + fixed-price 計費可預期 + Mandiant threat intel 內建 + Google Cloud 整合度。中等規模 / on-prem 為主 / 預算敏感 / 需要 observability 同 plane 的場景都更適合走 Splunk / Elastic / Datadog。

進階主題

Risk Score multi-signal aggregation：Google Security Ops 給每個 entity（user / asset）累積 risk score、跨多 rule 加總、超 threshold 才升級 alert。設計上跟 Splunk RBA 同類、但 Google 把 risk decay 跟 attribution 走 Entity Graph、跨 entity 關係的 risk 傳遞比較細。配對 Uber 2022 MFA Fatigue 的 lesson：MFA fail 累積 + 新裝置 login + 異常地理三個 signal 加總、單獨任一個都不該 alert。

Cross-tenant federated search：MSSP / 大型集團多 BU 可在 Google Security Ops 跨多個 tenant 做 federated search、單一 console 看跨組織 detection。權限走 Google Cloud IAM role assignment、跨 tenant admin 是高權限角色、走 break-glass + audit。

Applied Threat Intel + Curated Detection 同步：Mandiant 揭露新 APT 活動後、Curated Detection 對應 rule 自動 enable + Applied Threat Intel IoC 自動 push、客戶 SOC 不需要手動 onboard。SolarWinds 2020 揭露當下、Mandiant client 是少數能即時 enable 對應 detection 的 SOC。

Siemplify playbook 工程化：playbook 走 graph-based workflow（不是 linear pipeline）、可以 branching / approval gate / human-in-the-loop。Production rule 走 containment-first（disable session、不 delete account）+ approval gate for irreversible action。

Gemini for Security (2024+)：LLM-assisted investigation — natural language 問「過去 24hr 哪些 user 有異常 GCP API 行為」直接生成 UDM query、alert 自動 summarize + 提供 next step 建議。不取代 SOC analyst、但縮短 triage time。

排錯與失敗快速判讀

Custom source ingest 失敗：UDM parser 沒寫 / 寫錯、source 進不來或欄位 NULL — 補 parser、staging tenant 跑 sanity check、看 UDM event count by source 確認 normalization 通過
Detection 沒觸發 / 漏報：YARA-L 的 match { } 段 time window 寫太短、或 condition { } threshold 寫太高 — staging tenant 用歷史資料 backtest、tune window / threshold 後 promote
Alert volume 過多：Curated Detection 全開沒 tune、env-specific noise 沒 disable — 跟 Splunk 一樣走 staging 觀察 false positive curve、tune 或 disable 個別規則
Mandiant threat intel 沒命中：licensing tier 沒包 Mandiant Advantage、或 enrichment pipeline 沒啟用 — 檢查 tier、確認 Applied Threat Intel 開
Siemplify playbook 黑箱 fire-and-forget：自動 disable 結果誤殺合法 user — playbook 走 approval gate、預設 containment 不 deletion、定期 dry-run
Cross-tenant admin 太多：日常運維用 cross-tenant admin、blast radius 太大 — 收 admin、改 tenant-scoped role + 特定 capability、跨 tenant 走 break-glass
Cost 比預期高：data tier 選錯（買了 Enterprise Plus 卻只用 Enterprise feature）、retention 設太長 — 看實際 ingestion + retention 用量、tier 跟 retention 一起 review

何時改走其他服務

需求形狀	改走
Enterprise + 跨 on-prem + detection 成熟	Splunk
OSS-friendly / 自管 / 預算敏感	Elastic Security
Cloud-native + observability 已用 Datadog	Datadog Security
DLP / sensitive data discovery	Google DLP / Microsoft Purview
Endpoint detection 為主	CrowdStrike Falcon / Microsoft Defender for Endpoint
Incident routing	8 事故處理 vendor 清單

不在本頁內的主題

YARA-L 完整語法 reference、UDM 全欄位 schema
Chronicle / Siemplify / Mandiant 三條產品線整合前的歷史細節
Mandiant Advantage 平台（threat intel 訂閱、跟 SIEM 整合但獨立產品）
VirusTotal（Google 旗下、跟 Mandiant 互補但獨立服務）
Gemini for Security 的 prompt engineering 細節
Google Workspace security center（屬 Google Workspace、不在 Security Ops 範圍）

案例回寫

Google Security Ops 在 07 案例庫沒有直接 vendor-level 事件、但所有 detection-related case 都是 SIEM 偵測覆蓋率的對照：

案例	跟 Google Security Ops 的關係（對照啟示）
Microsoft Storm-0558 Signing Key Chain	UDM 強制 normalize 跨 Azure AD / GCP / Okta token validation 欄位、YARA-L 跨 source join 直接表達跨租戶 token forging pattern、Entity Graph 視覺化
Uber 2022 MFA Fatigue	YARA-L sequence pattern 直接表達「MFA fail count + 新裝置 login」、Risk Score 累積到 threshold 觸發 Siemplify playbook 自動 disable session
SolarWinds 2020 Sunburst	Mandiant 揭露 IoC 後 Applied Threat Intel 自動 push、Curated Detection 對應規則自動 enable、客戶不需要手動 onboard rule
Snowflake 2024 Credential Abuse	YARA-L 表達「query 體積 / 跨 schema scan / 來源 IP baseline」三軸 correlation rule；Entity Graph 聚合 credential / IP / data warehouse account 視覺化異常擴散（公開 UNC5537 跨客戶模式屬案例外延伸）
Detection Engineering Lifecycle (section)	Curated Detection + 自家 YARA-L rule 走 propose → staging → promote lifecycle、Google Security Ops 內建 rule versioning + Git → API push
Alert Fatigue and Signal Quality (section)	Risk Score multi-signal aggregation 是 alert fatigue 的工程化解法、跟 Splunk RBA 同類但 risk 傳遞走 Entity Graph、跨 entity 關係更細

下一步路由

上游：7.13 偵測覆蓋率與訊號治理、Detection Engineering Lifecycle
平行：Splunk、Elastic Security、Datadog Security
下游：Google DLP / Microsoft Purview（DLP signal 進 Google Security Ops）
跨類：Google Cloud IAM（GCP IAM log + Workload Identity Federation）、Google Secret Manager（SOAR playbook 拉 API）、Okta（IdP log source）、Cloudflare WAF（WAF log + auto-block）
跨模組：8 事故處理 vendor 清單（alert → IR routing）、4 observability（log pipeline 共用判斷）
官方：Google Security Operations Documentation

Locust

Fri, 15 May 2026 00:00:00 +0000

Locust 的核心責任是用 Python 表達高度自訂的使用者行為與 protocol client。它適合 Python 團隊、需要自訂 client、需要 distributed worker、或 scenario 邏輯比工具內建 sampler 更複雜的壓測流程。

服務定位

Locust 適合把壓測寫成一般 Python 程式。當 workload model 需要呼叫 internal SDK、特殊 protocol、複雜資料準備、狀態機、隨機行為或自訂 client、Locust 可以直接使用 Python 生態來表達。底層架構是 master + worker 分散式 swarm、worker 之間用 Gevent green-thread（非 OS thread）模擬大量並發 user、master 負責 spawn rate、aggregation 跟 Web UI。

這個定位讓 Locust 接到 9.2 Workload Modeling 與 9.5 瓶頸定位流程。它能把特殊 client 與下游 dependency 放進同一個 user behavior、但也要求團隊處理 runner、資料與可重現性。

跟 k6（JS / Go runtime）比、Locust 用 Python 換到 自訂能力與生態相容、但代價是單 worker capacity 低、CPU bound 容易先打到自己。跟 JMeter（GUI / XML）比、Locust 偏 code-first 工程團隊、scenario 直接走 Git review、不靠 GUI plugin 拼裝。跟 Gatling（Scala DSL）比、Locust 換到 Python team 友善 + 既有 domain library 重用、但失去 JVM injection profile 的精細度與報表內建。

關鍵張力：Python 表達力 ↔ runner 效能上限。Python team 想 reuse domain library、staging fixture、API client 寫壓測腳本時 Locust 是首選；但要心裡有數 單 worker RPS 上限不高、超過幾千 RPS 就要靠 worker scale-out、不是調 Locust 本身。

適用場景

Python 團隊適合用 Locust 長期維護壓測。既有 domain library、API client、fixture、資料產生器與驗證 helper 都可以被壓測腳本重用。

自訂 protocol 適合用 Locust。HTTP 之外、如果服務需要 gRPC、WebSocket、binary protocol、message broker client 或自家 SDK、Locust 可以直接接 Python library。

Distributed load 適合用 Locust worker 擴展。當單機 Python runner 遇到 CPU 或 connection bottleneck、可以用 master / worker 拆開負載產生能力。

本章目標

讀完本頁、讀者能判斷：

Locust 在壓測 stack 中承擔哪一段（user behavior modeling / load generation / distributed swarm）、哪些要外接（Prometheus / Grafana 觀測 worker 自身、APM 看目標 saturation）
User class / task weight / on_start lifecycle 的 ownership 設計（誰寫 locustfile、誰 review、誰調 spawn rate）
Distributed master-worker 部署的容量規劃（單 worker user 上限、worker 數量計算、target RPS 對應 worker count）
何時用 Locust、何時走 k6 / JMeter / Gatling 的取捨

最短判讀路徑

判斷 Locust 壓測是否健康、最少看四件事：

User class 設計：每個 HttpUser / User subclass 是不是一個明確的 persona（mobile user / API client / admin user）、wait_time 是否反映真實使用者間隔（不是 0 拼最大 RPS、是 between(1, 5) 模擬 think time）、user state 是否在 instance 內封閉
Task 比例：@task(weight) 數字是否對應 production traffic mix（80% read / 15% write / 5% admin、不是每個 endpoint 等比例）、weight 是否走版控 review
on_start lifecycle：login / token fetch / session bootstrap 是否寫在 on_start（每個 user 一次）、不是寫在 @task 裡（每個 request 都重做）— 寫錯位置會讓 auth endpoint 變成主要 traffic
Distributed master-worker：worker 數量是否夠（單 worker 跑幾千 user 後 CPU 會先打死、不是目標服務先死）、master 是否獨立機器（master 也跑 user 時 aggregation 跟 Web UI 會卡）、--expect-workers 是否設、worker sync drift 是否觀察

四件事任一缺失、就是壓測證據可信度的待補項目。

日常操作與決策形狀

locustfile 結構：locustfile.py 是 Python module、定義 User / HttpUser subclass、每個 user 有 wait_time、若干 @task(weight) method、on_start / on_stop lifecycle hook。執行用 locust -f locustfile.py --host=https://target 起 Web UI、或 locust --headless -u 1000 -r 100 -t 10m 在 CI 跑無 UI 模式。locustfile 應該走 Git review、不是 GUI 改完就跑。

Task weight / wait_time 設計：weight 是 相對權重、不是百分比 —@task(8) + @task(2) 等於 80% / 20%。wait_time = between(1, 5) 在每個 task 之間等 1-5 秒、模擬 think time；若要拚最大 RPS 用 constant(0)、但同時要意識到這就不是 user behavior 模型、是 throughput probe。

on_start vs @task 的邊界：on_start(self) 每個 user instance 啟動時跑一次、適合做 login、token fetch、cache warm、fixture lookup；@task 是 user 行為主迴圈、每次選一個 task 跑。把 login 寫在 @task 是常見錯誤、會讓 IdP 變成主壓力來源、不是目標 API。

Gevent-based concurrency：Locust 用 gevent 的 green-thread 模擬大量 concurrent user、不是 OS thread。意義是單 worker 可以跑幾千個 user、但 CPU bound 工作（JSON serialization、加密、本地計算）會 blocking 整個 worker 的 event loop。gevent.monkey.patch_all() 要在 import 第一行、否則 socket / time / ssl 不會被 patch、blocking call 會卡死 swarm。

Distributed master-worker：單機到極限時開 distributed — locust --master 起 master、locust --worker --master-host=master.example.com 起 worker。Master 負責 Web UI、spawn rate 控制、result aggregation、stat 收集；worker 負責跑 user。Master 不該跑 user（會跟 aggregation 搶 CPU、stat 失真）。worker 數量計算：先單 worker 拉到 CPU 80% 看能撐多少 user、目標 user 數除這個值 + 20% buffer。

Custom load shape：除了固定 -u 1000、Locust 支援 LoadTestShape subclass 寫 時間軸負載曲線 — spike test（瞬間 0 → 5000 user）、ramp test（線性爬升）、wave test（週期性高低交替）、step test（階梯式增加）。tick() method 每秒回傳 (user_count, spawn_rate)。用 custom shape 才能模擬 9.C16 SeatGeek waiting room 那種 ticket drop 瞬間衝擊。

Prometheus exporter / 觀測：Locust 內建 stat 只是 in-memory 的 p50 / p95 / p99 / RPS、結束就消失。長期觀測接 locust-prometheus-exporter（或 --csv result.csv 自己抓）、把 metric 推到 Prometheus + Grafana。worker 自身的 CPU / memory / network 一定要同時觀測、不然分不出是目標 saturation 還是 worker 已死。

Locust Cloud（managed SaaS）：2024 後 Locust 推官方 Locust Cloud、託管 master + worker + result storage、付費換 ops 成本。自管 master-worker 對 CI / staging 是合理的；production 等級的 scale test（10k+ concurrent user）跑一次要拉幾十台 worker、用 Cloud 省 infra ops 是合理 trade-off。

核心取捨表

取捨維度	Locust	k6	JMeter	Gatling
腳本語言	Python（generic）	JavaScript (k6 runtime)	XML / GUI / Groovy	Scala DSL（也支援 Java / Kotlin）
Runtime	Python + Gevent green-thread	Go-based、單 binary、低 overhead	JVM、heavy	JVM、async actor model
單 worker capacity	中低（Python overhead、千級 user）	高（Go runtime、萬級 VU 單機）	中（JVM tuning 後可用）	高（Akka actor、效能好）
Distributed mode	內建 master-worker	內建 k6 Cloud / k6 Operator	內建 master-slave	Gatling Enterprise（前 FrontLine）
User behavior 彈性	高 — 一般 Python、任意 library	中 — JS 但 k6 runtime 受限	中 — GUI 拼裝 + plugin	中高 — Scala DSL 表達 simulation
Custom protocol	強 — 接任何 Python library	強 — 有 gRPC / WS / Kafka extension	強但繁瑣 — plugin 生態廣	中 — 主要 HTTP / WS
CI / headless	`--headless` 支援	CI-first design	non-GUI mode 支援	內建支援
Report / UI	Web UI 即時 + CSV 匯出	k6 Cloud / Grafana / 簡 stdout	GUI listener / HTML report	HTML report 內建、視覺豐富
學習曲線	緩（Python team）/ 陡（非 Python）	中 — JS-style scripting	緩（GUI）/ 陡（深度 tuning）	陡 — Scala 語法
適合場景	Python team + 自訂 behavior / client	DevOps + CI / 標準 HTTP / 高 RPS 單機	非工程角色協作 / legacy enterprise	JVM team + 精細 injection profile
退場成本	低 — Python 腳本可移植	中 — k6 runtime 綁定	中 — XML jmx 不易他移	中 — Scala DSL 綁定

選 Locust 的核心訴求：Python team + custom user behavior + 既有 domain library 重用、且能投入 worker scale-out 預算（單 worker capacity 低、要靠分散式補）+ scenario 走 Git review 不靠 GUI。標準 HTTP 高 RPS 單機壓測直接走 k6 更快、非工程角色協作壓測走 JMeter、JVM team 精細模擬走 Gatling。

進階主題

Distributed Locust 的 master-worker swarm：production scale test 通常需要 10-100 個 worker。實作要點：worker 之間不要共享 state、shared resource 由 master 統一發（用 zeromq message bus）；worker 加入 / 離開時 user 會 redistribute、避免 user index 當 unique key；worker 跨 region 跑時 latency 來自 worker → target 不只是 target 內部、要在 worker 本身的 region 對齊。

Custom load shape（spike / wave / step）：LoadTestShape.tick(self) return (user_count, spawn_rate) tuple 每秒被叫一次。Spike test：前 60 秒 0 user、第 61 秒瞬間衝 5000、模擬 9.C16 SeatGeek waiting room 的 admission storm。Wave test：sine wave 在 1000-3000 user 之間振盪、測 autoscaling 反應速度。Step test：每 5 分鐘加 1000 user、觀察哪一階開始降級。custom shape 是 Locust 比 k6 強的點之一。

跟 Prometheus exporter 整合：locust-prometheus-exporter 把 Locust stat 推到 Prometheus / Grafana、做長期 baseline、跨 test 比較、p99 退化偵測。實務上要在 dashboard 同時放 Locust 內部 stat + worker host metric + 目標服務 APM、三層 stack 起來才能判讀是 runner 還是目標 saturation。

Locust Cloud（managed SaaS）：2024+ 官方 SaaS、託管 master + worker + result + dashboard。trade-off：自管適合 CI / staging / 內網壓測（target 跑在內網時 Cloud 連不到）；Cloud 適合大規模一次性 scale test（拉 50 worker 跑 2 小時、跑完即停、不想自己 infra ops）。

操作成本

Locust 的主要成本是 runner overhead 與分散式治理。Python runner 的效能上限要用 worker scale-out 解決；壓測結論要同時檢查目標服務 saturation 與 worker 本身 CPU、connection、network 是否已成瓶頸。

腳本工程成本來自自由度。Python 可以很快寫出複雜行為、也容易把測試資料、randomness、side effect、sleep 與 exception handling 寫散；團隊要維持 scenario structure、fixture、logging 與 artifact 標準。

自訂 client 成本來自校正。使用 SDK 或 custom protocol client 時、要確認 client retry、timeout、connection pool 與 serialization 行為是否接近 production、避免 runner 模擬出不存在的壓力形狀。

排錯與失敗快速判讀

Worker CPU 100% 但目標服務閒：Python runner 先死、不是 target saturation — 加 worker 數量、或檢查 task 裡有沒有 CPU bound 的本地計算（大 JSON parse、加密、本地 fixture 生成）擠掉 event loop
Gevent monkey-patch gotcha：requests / psycopg2 / 自家 SDK 在第三方 library 內部 blocking call、整個 worker 卡住 — gevent.monkey.patch_all() 一定要寫在 import 第一行；無法 patch 的 C extension（如 native MySQL driver）改用 gevent-friendly client
RPS 達不到目標 / 看起來像 target 慢：實際是 worker connection pool 耗盡、或 worker 本身網卡飽和 — 觀測 worker 本身的 TCP socket 數、netstat ESTABLISHED、network throughput；不要直接 blame target
Distributed sync drift：worker 之間 user count 不平均、aggregation 顯示 RPS 抖動 — --expect-workers=N 確認 master 等所有 worker join 才開測；worker 跨 region 時 message bus latency 也會影響 sync
on_start 在 @task 裡跑：壓測啟動瞬間打爆 auth endpoint、看到 IdP latency 飆高以為是 target — 把 login / token fetch 移到 on_start、每個 user 只做一次
wait_time = 0 拼最大 RPS 結果結論奇怪：這已經不是 user behavior 是 throughput probe、p99 跟 production 對不上 — 改成 between(1, 5) 模擬 think time 或寫 custom shape
Web UI 卡 / master CPU 100%：master 同時在跑 user + aggregation — locust --master 跟 worker 拆機器、master 不跑 user

何時改走其他服務

需求形狀	改走
標準 HTTP / 高 RPS 單機 / CI-first	k6
非工程角色協作 / GUI 拼裝	JMeter
JVM team / 精細 injection profile	Gatling
極簡 HTTP probe / 命令列 one-shot	Vegeta
Production traffic replay / shadow	GoReplay / Service Mesh Mirroring
壓測結果回寫到效能工程 lifecycle	9.5 瓶頸定位流程、9.3 壓測工具選型

不在本頁內的主題

locustfile 完整語法 reference、User 跟 HttpUser 的 attribute 細節
Locust Cloud 計費跟 quota 細節（看官方 docs）
gevent 跟 asyncio 的取捨（Locust 選了 gevent、不在本頁討論替代）
壓測證據怎麼歸檔（看 9.7 evidence package 通則）

Evidence Package

Locust 結果應回寫到 evidence package。最小欄位包括 locustfile version、user class、task weight、spawn rate、worker count、client library version、target environment、p95 / p99、error rate、throughput、target saturation metric、known gap 與 owner。

欄位	Locust 證據來源
Source	locustfile、CSV / JSON result、dashboard link
Time range	test start / end
Query link	APM / metrics / logs 查詢連結
Data quality	user behavior coverage、fixture freshness
Confidence	worker capacity、client realism
Known gap	worker bottleneck、custom client 偏差、資料偏差

Evidence package 的核心用途是區分目標瓶頸與 runner 瓶頸。Locust 分散式測試要同時保存 worker 數量、worker 資源、spawn rate 與 client behavior、讓 reviewer 知道壓力是否真的打到目標服務。

案例回寫

Locust 適合回寫需要高度自訂 user behavior 的案例。它可接 9.C28 FanDuel 雙峰 workload 的投注行為模型、9.C16 SeatGeek waiting room 的 admission / token flow、9.C26 PayPay mobile payment messaging 的外部推送與下游 quota 模擬、9.C8 Niantic Pokémon GO 50x surge 的玩家移動 + 互動混合行為、以及 9.C18 Zoom COVID 30x surge 的會議建立 / 加入 / 離開行為混合。

這些案例的重點是 domain behavior。Locust 頁引用案例時、要把 case 轉成 user class、task weight、custom client、downstream mock 與 worker capacity、再把總 RPS 放回這些行為條件下判讀 — 例如 Pokémon GO 玩家行為跟一般 web user 完全不同（持續 GPS 上報 + 偶發互動）、不能直接用 HTTP RPS 衡量；SeatGeek waiting room 要寫 LoadTestShape 模擬 ticket drop 瞬間衝擊、不是穩態 RPS。

下一步路由

上游：9.2 Workload Modeling
上游：9.3 壓測工具選型
上游：9.5 瓶頸定位流程
平行：k6、JMeter、Gatling、Vegeta
跨類：GoReplay（production traffic replay 替代 synthetic load）
跨模組：4 Observability（worker 自身 + 目標 APM 雙觀測）
官方：Locust documentation

CockroachDB

Wed, 13 May 2026 00:00:00 +0000

CockroachDB 是分散式 SQL、PostgreSQL wire protocol 相容、跨 region 強一致。設計理念接近 Spanner（線性化、跨 region quorum），但採 HLC + Raft 而非 TrueTime hardware，是 open source + 跨雲可用的全球 OLTP 選擇。

教學路線：Distributed SQL 與跨雲一致性

CockroachDB 服務頁的教學目標是把 PostgreSQL-like 介面背後的 range sharding、Raft replication、serializable transaction、leaseholder 與 region placement 說清楚。讀者讀完後要能判斷 distributed SQL 何時能取代自管 sharding，何時會把 latency 與 retry 壓力推回應用層。

學習段	核心問題	對應段落
Distributed SQL	SQL 介面如何藏住 range sharding 與 Raft replication	定位、容量特性
Serializable default	transaction retry、contention、latency 如何影響應用設計	容量規劃要點、Isolation Level
Region placement	multi-region table、leaseholder、survival goal 如何服務產品需求	適用場景、跟其他 vendor 的取捨
Migration pressure	從 PostgreSQL / MySQL 或自管 sharding 過來時要檢查哪些差異	預計實作話題、案例對照
替代路由	何時留 PostgreSQL、用 Spanner、Aurora DSQL 或 application sharding	不適用場景、下一步路由

定位：Spanner 的開源 / 跨雲替代

CockroachDB 跟 Spanner 解決同一個問題（跨 region 強一致 SQL）、但定位不同：

Spanner：GCP managed service、用 TrueTime hardware
CockroachDB：開源（雙授權）、可自管 + Cockroach Cloud、跨 AWS / GCP / Azure / on-prem、用 HLC + Raft

選 CockroachDB 的核心訴求：需要跨 region 強一致 SQL + 想避免雲商 lock-in、想自管或跨雲部署。

詳見 1.11 全球分散式 OLTP 的 CockroachDB 段。

容量特性

節點即容量單位：

跟 Spanner 同樣設計、節點數量決定容量
每節點承擔 query + storage + replication
線性擴展（理論）、實際依 query pattern

跨 region 配置：

multi-region survival goal（zone-level / region-level）
跨 region quorum 必要、決定 latency
跟 Spanner 同樣的物理限制（跨洲 100ms+）

Replication：

Raft consensus per range
預設 3-replica
可配置每個 region 不同 replica count（Survival Goals）

適用場景

1. 需要跨 region 強一致 SQL + 跨雲：

multi-region active-active write
GCP-only（Spanner）或 AWS-only（Aurora DSQL）和部署策略不合
對應 1.11 全球分散式 OLTP 的選型決策

2. PostgreSQL wire protocol 相容路徑：

既有 PostgreSQL 應用想升級到分散式
應用層改動小（保留 PostgreSQL driver / ORM）
注意：PostgreSQL 相容要以實際 query、extension 與 migration test 驗證

3. 自管 on-prem / hybrid：

金融 / 受監管產業需要 on-prem
Spanner / Aurora DSQL 以 cloud service 為主
CockroachDB 可自管

4. 想避免單一 vendor 全球分散式 lock-in：

開源 + 跨雲、可遷移性高
但企業版功能要付費（CockroachDB Cloud 或 Enterprise license）

不適用場景

1. single-region OLTP 夠用：

90% 場景 PostgreSQL / Aurora 已夠
CockroachDB 有分散式 overhead（每個寫經 Raft）
替代：PostgreSQL、Aurora、MySQL

2. 極端高吞吐 single-query：

CockroachDB 寫入有 Raft 開銷、單機吞吐 < PostgreSQL
整體吞吐靠 scale-out 達成、單一 query latency 較高

3. 跨洲低延遲（< 50ms）：

跟 Spanner 同樣物理限制
跨洲 quorum 100ms+ 是物理成本

4. 預算極敏感的小 workload：

CockroachDB 至少 3 個節點（Raft quorum）
跟 single-instance PostgreSQL 比較貴

5. 需要 PostgreSQL 進階特性：

部分 PostgreSQL extension 或行為需要替代方案
partial index、exclusion constraint 等可能缺

跟其他 vendor 的取捨

vs Spanner（GCP）：

CockroachDB：開源、跨雲、可自管
Spanner：GCP-only、TrueTime hardware、Google 規模驗證
選 CockroachDB：跨雲 / on-prem 需求
選 Spanner：GCP 生態 + managed operation + Google 規模驗證的成熟度

vs Aurora DSQL（AWS 2024）：

CockroachDB：跨雲、生產驗證較久
Aurora DSQL：AWS-only、serverless、新（2024）
選 CockroachDB：跨雲、想避免 AWS lock-in
選 Aurora DSQL：AWS 生態 + 已用 PostgreSQL + serverless 訴求

vs TiDB：

CockroachDB：PostgreSQL wire、英語 / 歐美生態深
TiDB：MySQL wire、亞洲生態深、HTAP（OLTP + OLAP 同庫）
選 CockroachDB：PostgreSQL 應用、跨雲
選 TiDB：MySQL 應用、需要 OLAP 整合、亞洲市場

vs PostgreSQL（傳統）：

CockroachDB：分散式、跨 region 強一致
PostgreSQL：single-primary、跨 region 是 async replication
選 CockroachDB：需要跨 region 強一致
選 PostgreSQL：single-region 夠用（90% 場景）

vs Aurora（single-region scaling）：

CockroachDB：multi-region 強一致
Aurora：single-region scaling、跨 region 是 async Global Database
選 CockroachDB：需要 multi-region write
選 Aurora：single-region scaling + AWS 生態

vs MySQL + Vitess（self-managed distributed MySQL）：

CockroachDB：PostgreSQL wire、transparent sharding（range-based）、跨 region 強一致內建
MySQL + Vitess：MySQL wire、application 層配 keyspace + shard key、跨 region 靠 application + async replication
選 CockroachDB：PostgreSQL 應用 + transparent multi-region + 想避開 Vitess operation burden
選 MySQL + Vitess：MySQL 應用 + 有 DBA 養 Vitess + 已是 YouTube / Slack 規模

容量規劃要點

1. Node count + zone / region 配置：

至少 3 個節點（Raft quorum）
multi-region 通常 9+ 節點（3 region × 3 replica）
Survival Goals 配置決定每 region 復原能力

2. Range（CockroachDB 的 partition）：

跟 DynamoDB partition、Spanner split 同類
CockroachDB 自動 split 大 range
application 主要管理 query locality、transaction retry 與 region placement

3. Locality 配置：

跟 Spanner 一樣可以指定 voting region
寫入 locality 影響跨 region latency

4. Backup / restore：

CockroachDB 原生 backup 支援 cluster-level snapshot
增量 backup 支援
注意：incremental backup chain 可能很長、定期 full backup

5. Self-managed vs Cockroach Cloud：

Self-managed：需要 ops team、可跨雲 / on-prem
Cockroach Cloud：managed、跨 cloud（AWS / GCP / Azure）、可考慮 serverless tier

Deep article（已完成）

本批 deep article 覆蓋 CockroachDB 從 consensus 機制、multi-region 配置到 managed 形態選型的核心 production 議題：

主題	文章	對應 production 議題
HLC + per-range Raft、leaseholder、寫入 latency 結構	hlc-raft-consensus	DoorDash Aurora 撞牆訊號（1.636 M QPS）、Netflix 380+ artery of small DBs 容量規劃顆粒
SURVIVE ZONE / REGION FAILURE 倒推、業務 SLO 決定副本拓樸	survival-goals	Hard Rock RPO=0 倒推、Netflix Gaming 48-node 跨 4 region「為求 survival 而非 latency」反直覺
Serializable default、application 必須包 retry loop、SAVEPOINT 語法	transaction-retry-pattern	PG → CockroachDB application contract 重塑、5 種 retry failure mode（跨 case 合成 frame）
REGIONAL BY ROW / TABLE / GLOBAL、跨州合規 + 邏輯一個 cluster	locality-aware-schema	Hard Rock 跨 8 州 sportsbook + AWS Outposts、Outposts 是合規工具不是 latency 工具反直覺判讀
三種 table locality 的選擇與 latency / 一致性取捨、選錯重配代價	multi-region-table-config	Netflix multi-region 動機為 survival 非 latency、Hard Rock row-level 歸屬 + 單一邏輯 cluster
Cockroach Cloud serverless vs dedicated、RU 計費、冷啟動 / scale	cloud-serverless	Netflix 需 Platform Team 反向 = managed 入口、Hard Rock 可預測賽季擴縮 vs serverless 突發甜蜜區
Distributed SQL 三選一決策樹：撞牆訊號分型 + 七問題	aurora-dsql-spanner-decision-tree	DB4 cross-vendor entry：DoorDash / Netflix / Hard Rock driver path 識別 + sizing barrier

DB4 cross-vendor entry：先看 aurora-dsql-spanner-decision-tree 識別 driver path、再進個別 vendor 深度。

multi-region-table-config 與 locality-aware-schema 切分：前者主寫「三種 table locality 怎麼選 + 選錯重配代價」、後者主寫「schema 怎麼配合 locality 設計（合規 boundary、跨州業務邏輯、Outposts 拓樸）」、兩者互補、survival goal 機制以 survival-goals 為 SSoT。

後續擴充（仍待補）

PostgreSQL 相容性 audit（partial index / extension / SQL 行為 gap 清單）
Backup / restore 與 PITR 操作（incremental chain 管理、restore 演練）
Changefeed / CDC 配置（CockroachDB 原生 CDC 到 Kafka / sink）

「從 PostgreSQL 遷到 CockroachDB（playbook）」已由 PostgreSQL → CockroachDB migration 涵蓋、不再列為待補。

Anti-recommendation 與升級路由

CockroachDB 的 PostgreSQL-like 介面會降低導入門檻，但 distributed SQL 的成本會出現在 transaction retry、range lease、multi-region latency 與操作拓樸。這一段先說何時維持 PostgreSQL / Aurora，再說何時升級 CockroachDB、Cockroach Cloud、Spanner、Aurora DSQL 或 Vitess。

機制 / 路線	維持簡單設計的條件	升級訊號	主要引用路徑
PostgreSQL / Aurora	single-region primary、async DR、read replica 已滿足需求	multi-region write、region failure survival、跨雲部署是硬需求	PostgreSQL vendor、Aurora vendor
CockroachDB single-region	需要水平擴容或 future multi-region，但目前在單區運作	Raft overhead 讓成本高於 PostgreSQL，且沒有 region requirement	Distributed SQL
CockroachDB multi-region	跨雲 / on-prem、PostgreSQL wire、strong consistency 是主需求	跨洲 p99 目標過低、transaction retry 影響 user flow	Quorum、Latency Budget
Cockroach Cloud	團隊仍能自管 Raft、backup、upgrade、node failure	想把 operation transfer 給 vendor	RTO、RPO
Spanner	跨雲或自管是硬需求	GCP managed、TrueTime 成熟度、Google scale evidence 是主訴求	Spanner vendor
Aurora DSQL	跨雲 / on-prem 是硬需求	AWS-only、serverless、PostgreSQL 相容與 AWS operation model 是主訴求	PG → Aurora DSQL Migration
MySQL + Vitess	PostgreSQL-like SQL 與 strong consistency 是主需求	MySQL ecosystem、application sharding 與 Vitess ops 已成熟	MySQL Vitess Sharding、Database Sharding

CockroachDB 的簡單路徑是先證明 distributed SQL 的價值大於 retry 與 latency 成本。若 workload 仍是 single-region OLTP，PostgreSQL / Aurora 通常提供更低成本；若跨 region 寫入與一致性是產品承諾，CockroachDB 才成為主要候選。

Transaction retry 的升級路徑要進入 application contract。Serializable default 能保護一致性，但 retry 會把 idempotency、timeout、user-visible latency 與 workflow compensation 帶回應用層；這些條件要在 migration playbook 前先盤點。

已知 limitation 與後續路由

CockroachDB overview 目前完成 distributed SQL 判斷。下一輪 deep article / playbook 應補 HLC + Raft、range / leaseholder、multi-region table locality、transaction retry pattern、PostgreSQL compatibility audit、Cockroach Cloud operation 與 PostgreSQL → CockroachDB migration。

案例對照

CockroachDB 在 09 案例庫已有三條直接 case 軸線（OLTP 寫入擴展、polyglot 補位、合規邊界），另外兩條對比參考軸線（Spanner 設計理念、受監管金融）一併保留。

Direct case（CockroachDB 為主角）

案例	主要工程議題
9.C39 DoorDash	Aurora Postgres single-primary 1.6 M QPS 撞牆 → multi-primary 解寫入
9.C40 Netflix	380+ cluster 艦隊、Cassandra 不夠用的 transactional workload 補位
9.C41 Hard Rock Digital	AWS Outposts + 跨州單一邏輯 DB、Wire Act 合規 + 賽季型擴縮容

對比參考案例

案例（對比參考）	跟 CockroachDB 的關係
9.C10 Spanner	設計理念對標、CockroachDB 是開源版本
9.C14 Standard Chartered	受監管金融、CockroachDB 可作為 on-prem 替代候選

CockroachDB direct case 的讀法是「寫入擴展（DoorDash）→ polyglot 補位（Netflix）→ 合規邊界（Hard Rock Digital）」三條軸線；對比案例則提醒讀者：Spanner 提供 global consistency 的成熟對照，受監管金融類案例提醒部署位置、合規邊界與自管能力常和一致性需求同時決定 vendor。

反向 sibling 路由

CockroachDB 的反向 sibling 路由用來把 PostgreSQL 相容性和 distributed SQL 責任拆開。若讀者從 PostgreSQL 章節過來，先讀 PostgreSQL → CockroachDB migration；若只是要 managed SQL 與 storage autoscale，先回 Aurora vendor；若要 Google Cloud 原生 external consistency 與 fully managed control plane，再對照 Spanner vendor。

這條路由的判準是「應用是否能承擔 distributed transaction 的語意差異」。SQL dialect 相近只降低 migration entry cost，真正的交付風險在 transaction retry、hot range、survival goal、backup restore 與 locality design。

常見陷阱

single-region 用 CockroachDB：浪費分散式開銷、PostgreSQL 便宜很多
跨洲 active-active 期待低延遲：物理限制、跨洲 quorum 100ms+
PostgreSQL extension 假設：部分 extension 或 SQL 行為需要替代方案，應用要驗證
不規劃 Survival Goals：default 配置可能不符合 RTO / RPO 需求
backup chain 過長：incremental 不 full、recovery time 變長

下一步路由

完整 T1 對照：01-database vendors index
平行：Spanner vendor、Aurora vendor、PostgreSQL vendor
上游：1.11 全球分散式 OLTP — 完整選型對比
跨模組：9.6 容量規劃模型、9.12 SLO 與 Performance Budget
Last reviewed：2026-05-22（PostgreSQL compatibility / survival goal / managed offering 屬時間敏感 claim）
官方：CockroachDB Documentation

Datadog

Fri, 01 May 2026 00:00:00 +0000

Datadog 是 all-in-one SaaS observability 平台、承擔三個責任：覆蓋 APM / logs / metrics / RUM / synthetics / security / CI visibility 全訊號類型、auto-instrumentation 廣度業界第一、跟 600+ integrations 即插即用。設計取捨偏向「turnkey + 廣度 + integration」、成本是主要取捨點。

對「想要 turnkey 體驗、不想自管 observability、多訊號類型統一平台、團隊規模可承擔成本」這條路徑、Datadog 是首選。

本章目標

讀完本章後、你應該能：

安裝 Datadog Agent、配置 APM auto-instrumentation
用 Datadog Logs / Metrics / APM 三大查詢介面
控制 cost（log indexing / metric cardinality / APM trace sampling）
寫 Monitor as code（Terraform）
評估 OTLP ingestion 跟 Datadog SDK 的取捨

最短路徑：5 分鐘把 Datadog 跑起來

1# 1. 安裝 Agent
2# TODO: DD_API_KEY= DD_SITE="datadoghq.com" bash -c "$(curl -L ...)"
3
4# 2. 啟用 APM
5# TODO: 在 Agent config 加 apm_config.enabled: true
6# TODO: 應用程式加 ddtrace-run / dd-trace-py
7
8# 3. 驗證 Agent + APM 上線
9# TODO: 在 Datadog UI 看 Host map + APM Service List

日常操作與決策形狀

Agent 安裝與配置

子議題：

安裝方式：package（apt/yum）/ container / K8s DaemonSet / Lambda extension
Agent config：core / APM / Logs / NetFlow / SNMP 各 sub-config
DogStatsD：應用層 custom metrics 入口
對應指令：datadog-agent status、/etc/datadog-agent/datadog.yaml

APM 自動 instrumentation

子議題：

各語言 tracer：dd-trace-java / dd-trace-py / dd-trace-js / dd-trace-go
Auto-instrumentation 廣度（業界最廣）
Service / Resource / Operation 三層 trace 結構
對應 4.C7 Datadog OTel migration

Logs 配置

子議題：

採集方式：Agent 採集 / Fluent Bit / Vector → Datadog
Indexing vs Archives：indexing 費錢但可查、archives 便宜但只能 rehydrate
Log Pipeline：parsing / enrichment / sensitive data scrubbing
對應 cost 控制：indexing rate / retention

Metrics

子議題:

Custom metrics（DogStatsD / Agent / API）
Metric Type：count / gauge / histogram / distribution
Cardinality 控制：每 metric 收 tags 數限制
對應 4.C2 Gaming cardinality

Deep Article

Datadog 成本治理與 Agent 配置：計價模型、custom metrics 成本控制、Agent 部署配置與常見故障
OTLP Ingestion 與 OTel 整合：Agent OTLP receiver 配置、OTel SDK feature parity、resource mapping 與故障判讀

進階主題（按需閱讀）

成本治理

子議題：

Hosts pricing（vs APM / Logs / Custom Metrics 各自獨立）
Log indexing rate 控制（Exclusion Filters）
Custom metrics 計費（per metric per host）
APM trace sampling
對應 Datadog Usage Attribution

OTLP ingestion

子議題：

Datadog Agent 接受 OTLP（gRPC + HTTP）
對 OTel SDK 用戶的優勢（avoid Datadog SDK lock-in）
對應 4.C7 Datadog OTel migration
Datadog 自家 SDK vs OTel：feature parity 取捨

Monitor as code

子議題：

Terraform Datadog provider：dashboard / monitor / SLO / synthetic
跟 IaC pipeline 整合
多環境（dev / staging / prod）配置

APM Trace Sampling

子議題：

Head-based sampling（rate-based）
Tail-based（Datadog 新功能、需 Agent 支援）
Ingestion vs Indexing sampling 兩層
對應 cost 控制

RUM / Synthetics

子議題：

RUM（Real User Monitoring）：前端用戶體驗
Synthetics：browser test / API test 主動探測
Session Replay
跟 APM 關聯：frontend trace → backend trace

Security Monitoring

子議題：

Cloud SIEM
ASM（Application Security Management、wAF/RASP）
Cloud Security Posture Management
跟 07 security 模組對照

跟 Monitoring 模組的分工

本頁從 server-side APM 平台角度說明 Datadog — agent 部署、cost governance、OTel 遷移、跟 Grafana Stack 的對照。Client-side 的 RUM 體驗（RUM SDK 四種事件、session replay、全棧追蹤的 client 端視角）見 Monitoring 模組 Datadog RUM。

兩者的交叉點是 trace context — RUM SDK 注入的 trace header 讓 client action 跟 server span 串在同一個 trace。沒有 server-side APM 的團隊用 RUM 也有價值（client-side error + performance），但全棧追蹤需要兩邊都部署。

排錯快速判讀

Agent 連不上 Datadog

操作原則：先 datadog-agent status 看 connectivity、再看 API key + region。

APM trace 缺失

操作原則：trace context propagation 在跨 service / 跨 thread 邊界丟失。

1# TODO: dd-trace-py debug mode / `DD_TRACE_DEBUG=true`

Log indexing cost 爆

操作原則：indexed log 量超預期、用 Exclusion Filter 過濾不必要 log。判讀：Datadog Usage page 看每 day indexed log。

Custom metrics 爆預算

操作原則：每 host 每 metric 計費、cardinality 高（per-user / per-request label）會爆。判讀：Metrics Summary 看 metric volume。

Monitor noise

操作原則：alert 太多、低品質、用 Composite Monitor + Recovery / No data threshold。

何時改走其他服務

需求形狀	改走
預算敏感	Grafana Stack（OSS）/ Cloud（cheaper）
需要 OSS / self-host	Grafana Stack + Prometheus
High-cardinality debug 深度	Honeycomb
AWS-only + 成本	CloudWatch
純 error tracking	Sentry
多 vendor 標準化	OpenTelemetry + 任一 backend
Logs full-text 為主	Elastic

不在本頁內的主題

各語言 dd-trace SDK 完整 API
Datadog UI 操作詳細
Pricing 詳細計算（用 Datadog Usage page）
600+ integrations 各自設定

案例回寫

直接相關案例

案例	主討論議題
4.C7 Datadog OTel migration	OTLP ingestion + SDK 移轉

跨 vendor 對照

案例	對 Datadog 的對應
4.C1 Fintech audit	Datadog Logs Indexing / Archives 作為審計證據面
4.C2 Gaming cardinality	Custom metrics cardinality 治理
4.C9 OTel migration signal drift	（反例）Datadog SDK ↔ OTLP 雙軌語意漂移
4.C10 規模對照	中大型常選 Datadog turnkey

待補 Datadog 案例：客戶 cost optimization stories、large scale 部署（Shopify / Coinbase / Zoom 等）engineering blog。

下一步路由

上游概念：4.17 Telemetry Data Quality
平行 vendor：OpenTelemetry、Grafana Stack
下游能力：4.20 Observability Evidence Package

DragonflyDB

Fri, 01 May 2026 00:00:00 +0000

DragonflyDB 是 C++ 重寫的 in-memory store、承擔三個責任：Redis / Memcached protocol 相容（drop-in 替換）、shared-nothing 多核架構（充分利用 CPU）、高 memory efficiency。設計取捨偏向「協議相容但效能大幅提升」、宣稱比 Redis 高 25 倍 throughput。授權從 Apache 2.0 改 BSL（Business Source License）、商業使用有限制。

對「需要極高 single-instance throughput、多核機器希望充分利用 CPU、Redis drop-in 但要 scale up 而非 out」這條路徑、DragonflyDB 是值得評估的替代。

本章目標

讀完本章後、你應該能：

跑起 DragonflyDB、用 redis-cli 驗證 protocol 相容
評估從 Redis 遷移的相容性風險（unsupported commands）
看懂 shared-nothing 多核架構跟 Redis I/O thread 的差異
評估 BSL 授權對你的商業使用影響
區分 DragonflyDB 跟 Redis Cluster / Garnet / KeyDB 的選用判讀

最短路徑：5 分鐘把 DragonflyDB 跑起來

 1# 1. 啟動 DragonflyDB（thread 數預設 = CPU 核數、自動多核）
 2docker run -d --name dragonfly -p 6379:6379 \
 3  docker.dragonflydb.io/dragonflydb/dragonfly
 4
 5# 2. 用 redis-cli 驗證（wire-protocol 相容、直接用 redis-cli）
 6redis-cli SET foo bar    # → OK
 7redis-cli GET foo        # → bar
 8
 9# 3. 確認版本與多核：DragonflyDB 回報相容的 redis_version + 自身版本 + thread 數
10redis-cli INFO server | grep -E "redis_version|dragonfly_version|thread_count|multiplexing_api"
11# redis_version:7.4.0          ← client library 以此判斷相容性
12# dragonfly_version:df-v1.39.0 ← DragonflyDB 自身版本
13# thread_count:8               ← 自動對齊 CPU 核數（shared-nothing 多核）
14# multiplexing_api:epoll

第三步是 DragonflyDB 跟 Redis 的核心差異證據：thread_count 自動對齊 CPU 核數、每個 thread 管自己的 partition（shared-nothing），這是它高吞吐的來源；redis_version:7.4.0 讓既有 Redis client 直接相容、無需改 code。實機驗證於 dragonfly df-v1.39.0、最後檢查日 2026-06-16。實際遷移評估見 Redis 相容邊界。

日常操作與決策形狀

CLI 與 client API

子議題：

直接用 redis-cli（DragonflyDB 100% wire-protocol 相容）
所有 Redis client library 自動相容
沒有 dragonfly-cli、用 INFO 命令確認 server type

Redis 相容邊界

DragonflyDB 相容大多數 Redis commands、但部分行為差異。子議題：

支援：Core data types / commands / persistence / pub-sub / transactions
注意：部分 Module 不支援（RedisJSON 有自家版、RedisSearch 沒有）
注意：Lua scripting 支援但效能取捨不同
限制：Cluster mode 採 single-instance scale-up、沒有 Redis Cluster mode（單 instance 已能處理 Redis Cluster 規模）

對應指令：INFO server 確認 dragonfly version + 配置。

配置與調優

子議題：

--threads：thread 數量、預設 CPU core 數
--maxmemory：memory limit、行為跟 Redis 類似
--cache_mode：傳統 cache 模式 vs DragonflyDB 預設模式
--snapshot_cron：snapshot 策略

進階主題（按需閱讀）

Shared-nothing 多核架構

子議題：

每個 thread 管自己的 partition、no shared state
VLL（Very Lightweight Lock）取代 Redis 的 single-thread model
Hash 分到不同 thread、靠 epoll 跟 io_uring 做 I/O
跟 Redis I/O threads 的對比：Redis 仍 single main thread、只 I/O 多線；DragonflyDB 完全多線

Memory efficiency

子議題：

用 dashtable（DragonflyDB 自製 hash table）取代 Redis dict
Snapshot 用 fork-less 機制、避免大記憶體 fork 開銷
同樣 dataset 通常比 Redis 省 20-40% memory（依資料形狀）

BSL 授權影響

子議題：

BSL（Business Source License）：商業使用受限、4 年後轉 Apache 2.0
限制：不可作為 managed DragonflyDB service 對外提供
內部使用無限制（多數企業場景）
對 SaaS 供應商：要審慎評估

跟 KeyDB / Garnet 的對比

子議題：

KeyDB：Redis fork、multi-threaded、Snap 收購後相對停滯
Garnet（Microsoft）：研究用、極高 throughput、生態淺
DragonflyDB：商業化最積極、生態最活躍

Scale-up vs Scale-out

子議題：

DragonflyDB 哲學：single instance 撐到很大規模（廠商宣稱 1TB+ memory / 6.4M QPS）
Redis 哲學：single instance 有上限、靠 Cluster sharding
何時 scale-up 不夠：跨 region / 跨 AZ HA 需求 → 仍需 replica / sentinel

從 Redis 遷移

子議題：

評估 module 使用：列出當前 modules、確認 DragonflyDB 對應
評估 Cluster mode 使用：DragonflyDB 不支援 Cluster mode、要評估能否回到 single instance
遷移路徑：replica 模式雙寫 / 直接 cutover
對應 BSL 授權影響評估

排錯快速判讀

Performance 不如預期

操作原則：先確認 thread 數對齊 CPU core、再看 memory pressure。

1redis-cli INFO server | grep -E "dragonfly_version|thread_count"
2# dragonfly_version:df-v1.39.0
3# thread_count:8                ← 對齊 CPU 核數才能發揮多核
4redis-cli INFO memory | grep -E "used_memory:|maxmemory:"

判讀：thread < core → 沒充分利用 CPU；memory > 50% maxmemory → 影響 throughput。

Command 不支援

操作原則：DragonflyDB 不支援全部 Redis commands、看 dragonflydb.io/docs/api/redis 確認。

判讀路徑：client error「unknown command」→ 確認 DragonflyDB 對應實作狀態。

Cluster mode client 連不上

操作原則：DragonflyDB 不支援 Redis Cluster mode、若 client 配置 cluster mode 會連不上。判讀：改回 standalone client config。

Module 不可用

對應 KeyDB / Garnet 的對照思路：DragonflyDB 自家 modules 偏少、Redis Stack modules 大多沒有 fork。

BSL 授權商業使用問題

操作原則：商業使用前審 license terms、若是 managed service 對外提供、需聯絡 DragonflyDB 取得商業 license。

何時改走其他服務

需求形狀	改走
需要 Redis Cluster mode	Redis / Valkey
需要 OSI 認可開源授權	Valkey
需要 Redis Stack 完整 modules	Redis
純 KV 不需 data types	Memcached
AWS managed	AWS ElastiCache（無 Dragonfly managed）
Multi-threaded Redis fork	KeyDB（停滯中）

不在本頁內的主題

DragonflyDB internal 架構細節（dashtable、VLL 等）
BSL 授權法律解讀（請諮詢律師）
各語言 client 完整對應表
詳細 benchmark methodology

案例回寫

直接相關案例（沿用 Redis-compatible 同源案例 + 待補 DragonflyDB-specific case）

DragonflyDB 2022 年開源、wire-protocol 與 Redis 相容、Redis 上的 cache pattern 案例可作為框架參考。Production case 仍累積中。

案例	對 DragonflyDB 的對應
2.C5 Shopify write-through	Write-through 模式在 DragonflyDB 上行為一致、單 instance 多核可承接更大 throughput
2.C3 Shopify serialization	Payload 雙軌遷移 client-side 實作、DragonflyDB 跟 Redis 共用 API、遷移路徑相同

待補 DragonflyDB-specific 案例：早期採用者 benchmark 報告、從 Redis Cluster 收回 single-instance 的遷移案例、BSL 授權實際商業使用評估、multi-core 加速效果的 production 實測。

跨 vendor 對照

案例	對 DragonflyDB 的對應
2.C10 規模對照	DragonflyDB 擅長 scale-up、中大型 single instance 取代 Redis Cluster 是核心賣點
2.C9 Cache Stampede	TTL jitter 通用、DragonflyDB 行為跟 Redis 一致、多核擴展不會消除 stampede 風險
2.C4 Meta CacheLib + Kangaroo	分層 cache 議題對照、DragonflyDB 強調 memory efficiency 取代 flash tier 的部分需求
2.C1 Meta cache consistency	一致性治理框架通用、但 DragonflyDB 無 Cluster mode、shard move 議題不同（單 instance scope）

下一步路由

上游概念：2.2 Cache Aside、2.3 TTL eviction
平行 vendor：Redis、Valkey
下游能力：2.6 high concurrency
回退路徑：DragonflyDB → Redis/Valkey

Gatling

Fri, 01 May 2026 00:00:00 +0000

Gatling 是 JVM 生態的 load test 工具、承擔三個責任：code-first 強型別 scenario DSL（Scala / Java / Kotlin、編譯期就抓 script bug）、async / non-blocking 引擎（單機高 VU 不靠 thread-per-VU）、Gatling Enterprise 分散式負載與企業 dashboard。設計取捨偏向「強型別 + 高單機 throughput + JVM 既有資產」、跟 k6（JS DX）跟 JMeter（GUI + plugins）的取捨在 dev workflow 跟團隊既有技能。

本章目標

讀完本章後、你應該能：

用 Scala / Java / Kotlin DSL 寫 simulation（scenario + injection profile）
設計 assertion + threshold 接 CI
用 HAR-driven recording 從瀏覽器抓真實 user flow 起 script
評估 Gatling Enterprise 分散式 vs OSS 單機高 VU 的取捨
評估 Gatling vs k6 / JMeter / Locust 的選用條件

最短路徑：5 分鐘把 Gatling 跑起來

 1# 1. 安裝
 2# TODO: brew install gatling / 下載 bundle / Maven / sbt plugin
 3
 4# 2. 寫 simulation
 5# TODO: class MySim extends Simulation {
 6#         val httpProtocol = http.baseUrl("...")
 7#         val scn = scenario("...").exec(http("get").get("/"))
 8#         setUp(scn.inject(rampUsersPerSec(1).to(50).during(60))).protocols(httpProtocol)
 9#       }
10
11# 3. 跑
12# TODO: gatling.sh -s MySim / mvn gatling:test / sbt Gatling/test

日常操作與決策形狀

Simulation 結構

子議題：

Simulation class（一個檔一個 simulation、整個 test 的根）
scenario(...).exec(...)（一條 user journey 的步驟序列）
httpProtocol（baseUrl / header / acceptedContent / proxy 共用配置）
feeder（CSV / JSON / JDBC 餵 data、配合 randomFeeder / circular）

Injection profile（VU 注入節奏）

子議題：

atOnceUsers(n)、rampUsers(n).during(t)、constantUsersPerSec(rate).during(t)、rampUsersPerSec(a).to(b).during(t)、heavisideUsers(n).during(t)
跟 k6 stages 對照：Gatling 用 injection step composition、k6 用 stages array — 概念近、語法不同
Closed model（固定 VU）vs Open model（固定 rate）— Gatling 兩者都支援、production 流量多半 open model 更貼近

Assertion + threshold + CI

子議題：

setUp(...).assertions(global.responseTime.percentile3.lt(500), global.successfulRequests.percent.gt(95))
Assertion 失敗時 process exit code 非 0、直接接 CI pass/fail gate
對應 6.13 Performance Regression Gate

進階主題（按需閱讀）

HAR-driven recording

子議題：

Chrome DevTools 匯出 HAR、gatling-recorder 從 HAR 產 simulation skeleton
適合：複雜 user flow（multi-step checkout / form / login redirect）懶得手寫 script
邊界：recording 出來是 baseline、需手動補 dynamic correlation（CSRF token / session id / form state）

Gatling Enterprise（前 FrontLine）

子議題：

分散式 load（多 injector node 模擬 100k+ VU）、跨 region traffic source
Web UI 跑 test、看 dashboard、開 trend analysis
接 Git repo 自動 build simulation、跟 CI / Jenkins / GitLab 整合
對應 Kubernetes vendor 頁的 on-K8s 部署

Async engine 跟單機高 VU

子議題：

引擎基於 Akka / Netty、non-blocking IO、單 thread 可驅動上千 VU
對比 JMeter thread-per-VU 模型、Gatling 單機 VU 上限可高 10x 起跳
邊界：target service 才是瓶頸時、單機更高 VU 也壓不出更多訊號、要走分散式

JVM tuning

子議題：

Heap size（-Xms / -Xmx）跟 GC 策略（G1 / ZGC）影響高 VU 穩定性
Connection pool / file descriptor ulimit 是常見卡關點
Container 跑 Gatling 要注意 CPU / memory request 給足

從 JMeter 遷移

子議題：

JMeter .jmx 沒官方 converter、要人工 port
適合切點：新 simulation 寫 Gatling、舊 .jmx 維護收斂後再評估
對應 JMeter 「既有 .jmx 資產治理」段

排錯快速判讀

單機 VU 上不去

操作原則：JVM heap / ulimit / connection pool 三層先排、再看是不是 target service 已是瓶頸（latency 漲、VU 卻沒滿）。

Response time p99 不穩

操作原則：GC pause（看 GC log）/ network jitter / target service warmup 沒做完。Steady-state 量測前要先 ramp-up + soak 5-10 分鐘。

Assertion 偶發 fail

操作原則：threshold 設在 noise level 附近、把 baseline 重跑 3 次抓 p95 區間、再設 threshold 留 buffer。

Recording 出來的 script 跑不通

操作原則：HAR 沒抓到 dynamic value（CSRF / session）、要手動加 check(regex(...).saveAs(...)) 把 response 抓出來餵後續 request。

何時改走其他服務

需求形狀	改走
非 JVM 團隊 / JS DX	k6
Python + 動態 user behavior	Locust
GUI 設計 / 既有資產	JMeter
Browser flow load	k6 browser / Playwright + 自製 load harness
Cloud managed	Gatling Enterprise / BlazeMeter / k6 Cloud
Capacity planning（非 CI）	09 performance capacity

不在本頁內的主題

Scala / Kotlin 語言基礎
Gatling DSL 完整 API reference
Gatling Enterprise pricing 跟 deployment model 細節

案例回寫

案例方向	對應主題
LinkedIn：Capacity 與 On-call 分層	JVM 服務的 capacity headroom 與 automated load test
Shopify：BFCM 容量治理與 Game Day	峰值準備期 scenario-driven load test 的對照組

待補 Gatling customer case：金融 / e-commerce 重度 JVM 生態採用 Gatling Enterprise、HAR-driven scenario recording 在 multi-step checkout flow 的實踐。

下一步路由

上游概念：6.13 Performance Regression Gate
平行 vendor：k6、Locust、JMeter
下游能力：09 performance capacity load test 模組

incident.io

Fri, 01 May 2026 00:00:00 +0000

incident.io 是 Slack-native IR 平台、承擔三個責任：把 incident lifecycle 整合在 Slack 內（declare / respond / update / close / postmortem）、自動 timeline + action item tracking、後加 on-call 模組整合 paging。設計取捨偏向「Slack-first + lifecycle automation + 一站式」。

服務定位

incident.io 設計上把 Slack 當成 IR 工作台、不需要在事故中切換 dashboard：宣告、角色指派、status update、stakeholder comms、timeline、action item、postmortem 全部在 Slack channel 完成、PM / leadership / customer-facing team 看 Slack 就能跟上節奏。2023 年起加上 incident.io On-call（取代 PagerDuty 的 alerting / schedule / escalation layer），從純 response orchestration 變成完整 IR + on-call 平台、減少 PagerDuty + Slack bot 雙系統的 state drift。

跟 PagerDuty 比、incident.io 是 response-first、PagerDuty 是 paging-first；組合使用時 PagerDuty 觸發 → incident.io 開 channel 跑 response、現在 On-call 模組讓 incident.io 也能獨立扛 paging layer。跟 FireHydrant 比、兩者定位接近、差別在 incident.io 偏 opinionated workflow（流程預設嚴謹、custom 餘地小）、FireHydrant 偏 customizable + Microsoft Teams 友善。跟 Rootly 比、Rootly 強調 no-code workflow builder 跟 AI 補助、incident.io 強調 catalog-driven service ownership 跟 learning review 結構化。

本章目標

整合 incident.io 到 Slack workspace
配置 incident severity / role / status workflow
設計 catalog（service / team metadata）
用 post-incident flow 自動產 postmortem template
評估 incident.io vs FireHydrant / Rootly、判斷是否要走 On-call 模組合併 PagerDuty

最短判讀路徑

判斷 incident.io deployment 是否健康、最少看四件事：

Slack workflow 完整度：/incident declare 後是否自動開 channel、role bot prompt 是否觸發、status update reminder 是否進 Slack（不靠人記憶 cadence）、stakeholder 是否能在不進 incident channel 的前提下追進度（broadcast channel / status page mirror）
Incident type 設計：severity（SEV1-4）+ incident type（infra / security / customer-facing）+ role 三者是否清楚、severity 定義有沒有歧義（這條是大型 org 最常翻車的地方）
Role assignment 跟交接：commander / scribe / comms / SME 的角色定義、handoff 時 bot 是否 prompt、長 incident（>4hr）的 commander rotation 是否有 fallback
Post-incident learning：close 後是否自動產 postmortem skeleton、action item 是否 sync 到 Jira / Linear 並追完成率、learning review 是否在 N 天內走完（不是寫完 postmortem 就結案）

四件事任一缺失、就是 Drills and On-call Readiness 的待補項目。

最短路徑

1# 1. Slack install incident.io app
2# 2. /incident declare 建第一個 incident
3# 3. 配置 severity / role
4# 4. close + retrospective

日常操作與決策形狀

Slack workflow

子議題：

/incident slash command
Auto-created channel（#inc-…）
Role assignment（commander / scribe / comms）
Bot prompts

Catalog + Post-incident flow

子議題：

Service / team / customer metadata
跟 5 deployment service ownership 對齊
Auto timeline from Slack
Action item sync 到 Jira / Linear
Postmortem template + learning review

核心取捨表

取捨維度	incident.io	PagerDuty	FireHydrant	Rootly
主要 surface	Slack-native	Web / mobile app + 通知	Slack + Microsoft Teams	Slack 為主
設計取向	Opinionated workflow、流程預設嚴謹	Paging-first、response 較淺	Customizable workflow、Teams 友善	No-code workflow builder + AI 補助
Paging layer	自家 On-call 模組（2023+）	業界 paging 標準	整合 PagerDuty / Opsgenie	整合 PagerDuty / Opsgenie
Catalog	First-class、service ownership 強	Service directory 較淺	Functionality + service catalog	Service catalog 中等
Learning review	Structured（內建 review cadence）	Postmortems by PagerDuty（需另外 enable）	Retrospectives 工作流	Retrospectives + AI summary
適合場景	Slack-heavy 中型 SaaS、流程要嚴謹	大型 enterprise、paging-critical	多 surface（Slack + Teams）、需要 custom 流程	Slack-heavy、想用 AI 加速 retro / comms 撰寫

選 incident.io 的核心訴求：團隊已 Slack-heavy、想要一套 opinionated workflow 把 IR 從「靠經驗」變成「靠流程」、且願意接受 catalog 維護成本換取 ownership clarity。

進階主題（按需閱讀）

Workflows（custom automation）

子議題：trigger → condition → action 的低代碼自動化、severity-based auto-page、approval gate、跟外部 API 串接（呼叫 Jira / Linear / Statuspage）。重點是 workflow 進 Git 版控、change review 走 PR、不在 console 直改。

Catalogue（service ownership + dependency）

子議題：incident.io Catalog 把 service / team / customer / region 等實體建模、incident 宣告時自動帶出 owner team + on-call 名單 + dependent service。對應 5 deployment service ownership 的 service catalog 概念；catalog stale 是常見 anti-pattern、要設 sync source（Backstage / Terraform / IdP group）+ stale alert。

On-call layer integration（2023+）

子議題：incident.io On-call 取代 PagerDuty 的 schedule + escalation + paging。優勢是 single source of truth（不需要 PagerDuty incident ↔ Slack channel state sync）、缺點是 paging reliability 還在追 PagerDuty 的 multi-region failover 成熟度。遷移時走 parallel run（兩邊都 page）2-4 週再切。

Status Page integration

子議題：跟 Atlassian Statuspage / Instatus 整合、auto-sync incident status 到 public page、避免 SRE 手動雙寫造成 stakeholder 看到的狀態跟內部不一致。

AI investigation features（2024+）

子議題：AI summarizer（自動產 incident summary 給 leadership）、suggested actions、postmortem draft。要當 first draft 不是 source of truth、commander 仍負責最終敘事。

排錯快速判讀

Slack outage 時 fallback：incident.io 重度依賴 Slack、Slack 自身 outage 時 IR 工作台會跟著掛 — 要預先準備 out-of-band channel（Zoom war room / Google Meet / 手機群組）、commander handoff 流程要寫進 runbook、不能假設 Slack 永遠在
Slack app 沒回應：bot offline / permission scope 不足 / workspace admin 改了 app 權限 — 檢查 incident.io admin console 的 health status
Incident type 設計過細：SEV 1-5 + 10 種 type + 20 個 role 結果沒人記得選哪個、宣告時 friction 太高反而延遲 declare — 收斂到 3-4 種 type、severity 限 4 級、role 預設帶入
Incident type 設計過粗：所有事故都 SEV2、escalation criteria 不明 — 要寫 severity definition doc、附判讀範例（customer-facing impact / data loss risk / blast radius）
Severity 沒對齊：team severity definition 不一致、設 catalog default + 在 Slack 宣告時 bot 自動 quote 定義
Catalog stale：service owner 離職沒更新、dependency 改了沒同步 — 要從 IdP group / Terraform / Backstage sync、設 stale threshold（>90 天沒更新就 alert owner team）
Action item drift：sync to Jira 失敗 / ownership 不明 — 在 close incident 前 bot 強制要求每個 action item 都有 owner + due date + Jira ticket
Postmortem 沒做：close 後 prompt 沒觸發 / template 太複雜 — 把 template 縮到 5 個必填欄位、其餘 optional、用 AI draft 降低 friction

何時改走其他服務

需求形狀	改走
Microsoft Teams	FireHydrant
No-code workflow / AI	Rootly
Paging-first	PagerDuty
自建 Slack workflow	Slack workflow + GitHub Issues / Linear
Learning-focused	Jeli（PagerDuty 整合）

不在本頁內的主題

Slack app 完整 spec / Custom workflow 細節 / Pricing

案例回寫

incident.io 主打 Slack-native IR：本案例庫尚無直接揭露 incident.io 使用細節的事故；可參照的閱讀脈絡是「以 Slack 為主要協作通道、事故 channel + 公開 status 同步運作」的服務、典型客戶側 profile 是 Slack-heavy 中型 SaaS organization、IR 流程強調 collaboration 跟 learning 而非單純 paging。

案例	對應主題
Slack cases	通訊平台失效時 IR channel 的退路設計
Discord cases	即時通訊產品事故的多通道協作節奏（對照素材）

待補 candidate：Lightspeed / Linear / Etsy 等 incident.io 公開 customer story。

下一步路由

nginx

Fri, 01 May 2026 00:00:00 +0000

nginx 是 HTTP server / reverse proxy / load balancer 的事實標準之一、承擔三個責任：HTTP 7 層處理（reverse proxy / TLS termination / static content）、L4 / L7 load balancing、Kubernetes ingress controller（ingress-nginx）。設計取捨偏向「配置簡單 + 效能穩定 + reload 機制成熟」、跟 envoy 比是靜態 config-driven（無 dynamic xDS）。F5 收購後 nginx Plus 是商業版、社群 fork 有 Freenginx / angie。

對「HTTP reverse proxy / LB、TLS termination、K8s ingress、API gateway 入門」這條路徑、nginx 是穩定首選。

本章目標

讀完本章後、你應該能：

寫 nginx config（server / location / upstream）
配置 TLS / mTLS + SNI
設計 rate limiting + connection limit
部署 ingress-nginx 到 Kubernetes
評估 nginx vs nginx Plus / OSS fork（Freenginx / angie）

最短路徑：5 分鐘把 nginx 跑起來

 1# 1. 啟動 nginx（docker）
 2docker run -d --name nginx-demo -p 80:80 \
 3  -v "$(pwd)/nginx.conf:/etc/nginx/nginx.conf:ro" nginx:stable-alpine
 4
 5# 2. 寫 reverse proxy config（nginx.conf 範例）
 6cat <<'CONF' > nginx.conf
 7events { worker_connections 1024; }
 8http {
 9  upstream backend {
10    server app:8080;
11  }
12  server {
13    listen 80;
14    location / {
15      proxy_pass http://backend;
16      proxy_set_header Host $host;
17      proxy_set_header X-Real-IP $remote_addr;
18    }
19  }
20}
21CONF
22
23# 3. reload + 驗證
24nginx -t            # test config syntax
25nginx -s reload     # reload without restart（zero-downtime config update）

日常操作與決策形狀

nginx config 設計

子議題：

階層：events / http / server / location / upstream
變數：$host / $remote_addr / $http_
Include 拆分大 config
對應指令：nginx -T（dump full config）、nginx -t（test）、nginx -s reload

Reverse proxy 配置

子議題：

proxy_pass / proxy_set_header / proxy_http_version
proxy_buffering / proxy_request_buffering
upstream load balancing（round_robin / least_conn / ip_hash）
對應 5.3 LB contract

TLS termination

子議題：

ssl_certificate / ssl_certificate_key / ssl_protocols
SNI（server_name + listen 443 ssl）
mTLS：ssl_client_certificate + ssl_verify_client
對應 07 security TLS 章

進階主題（按需閱讀）

Rate limiting / connection limit

子議題：

limit_req_zone + limit_req（leaky bucket）
limit_conn_zone + limit_conn
跟 knowledge cards rate-limit 對照
對應威脅建模: 2.6 快取威脅建模

ingress-nginx for Kubernetes

子議題：

Helm chart 部署
Ingress resource + Annotations 配置
ConfigMap + Snippets（power users）
跟 Traefik / Gateway API 對比

OpenResty / Lua extension

子議題：

OpenResty：nginx + LuaJIT、可寫 Lua handler
ngx_lua: access / content / log phase handler
適合：自訂 auth / dynamic routing
對應 envoy WASM extension 對比

nginx vs nginx Plus / Freenginx / angie

子議題：

nginx OSS（F5 維護）：basic feature
nginx Plus（商業）：active health check / dynamic config API / DNS upstream
Freenginx：2024 社群 fork（不滿 F5 治理）
angie：另一個 fork、多 commercial extension
選擇判讀：dynamic config 重要 → 看 Envoy / Plus；OSS 純社群 → Freenginx / angie

Performance tuning

子議題：

worker_processes / worker_connections
keepalive_timeout / keepalive_requests
sendfile / tcp_nopush / tcp_nodelay
跟 09 performance capacity 對照

排錯快速判讀

502 Bad Gateway

操作原則：upstream 不可達 / 回應錯。判讀：error.log + upstream health。

504 Gateway Timeout

操作原則：proxy_read_timeout / proxy_send_timeout 超過。判讀：upstream 處理時間 vs timeout 配置。

Connection limit / 502 under load

操作原則：worker_connections 不夠、ephemeral port 耗盡、upstream keepalive 不對。判讀：netstat + nginx stub_status。

SSL handshake failure

操作原則：cipher / protocol mismatch、cert chain incomplete、SNI 不對。判讀：openssl s_client -connect host:443 -servername host。

Reload 不生效

操作原則：nginx -t 先 test、新 worker 起來舊 worker drain。若行為怪、檢查是否拿到舊 listening socket。

何時改走其他服務

需求形狀	改走
Dynamic config / xDS	Envoy
Cloud-native auto-discovery	Traefik
AWS managed	AWS ELB（ALB / NLB）
L4 為主 / 高吞吐	HAProxy / NLB
Service mesh	Istio / Linkerd / Consul Connect
API Gateway 進階	Kong / Tyk / Apigee

不在本頁內的主題

完整 nginx directive reference
ngx_lua / OpenResty 完整教學
各 distro nginx 版本差異
nginx internal architecture

案例回寫

跨 vendor 對照

案例	對 nginx 的對應
5.C9 cutover without drain	切流時 nginx upstream / ingress-nginx 沒做 graceful drain、長連線跟 5xx 一起放大
5.C10 規模對照	小型直接 nginx reverse proxy、中型走 ingress-nginx、大型才考慮 envoy 或 service mesh

待補 nginx 案例：Cloudflare 為何 fork（freenginx）、大規模 ingress-nginx 客戶案例、OpenResty 在 production 的擴展案例。

下一步路由

上游概念：5.3 LB Contract
平行 vendor：Envoy、Traefik、AWS ELB
下游能力：07 security（TLS / WAF）、09 performance

Redis Streams

Fri, 01 May 2026 00:00:00 +0000

Redis Streams 是 Redis 5.0 引入的 append-only log data type、承擔三個責任：輕量 event stream（XADD / XREAD）、consumer group 與 pending entries list（XREADGROUP / XACK）、Redis 生態內整合（避免額外引入 Kafka）。設計取捨偏向「跟 Redis 本體生命週期綁定、低延遲 + 記憶體成本、適合中等規模」。Redis vendor 細節見 02 redis。

對「已用 Redis、需要輕量 stream、不想引入額外基礎設施」這條路徑、Redis Streams 是務實選擇。本頁先給最短路徑、再展開日常 XADD/XREAD 操作與 consumer group 設計、最後進階治理（PEL、retention、Cluster 影響）跟排錯。

本章目標

讀完本章後、你應該能：

用 redis-cli XADD / XREAD 操作 stream
設計 consumer group + XCLAIM 處理 consumer 失敗的訊息接管
看懂 pending entries list（PEL）累積訊號、定位 consumer 健康
設計 MAXLEN / MINID retention 對齊記憶體預算
評估 Redis Cluster 對 Streams 的影響與限制

最短路徑：5 分鐘把 Redis Streams 跑起來

 1# 1. 啟動 Redis（已有 Redis 跳過）
 2docker run -d --name redis -p 6379:6379 redis:7
 3
 4# 2. XADD 寫入 stream（'*' 由 Redis 產生遞增 entry ID）
 5docker exec redis redis-cli XADD mystream '*' field1 value1
 6
 7# 3. XREAD 讀取（從 0 起讀、最多 10 筆）
 8docker exec redis redis-cli XREAD COUNT 10 STREAMS mystream 0
 9
10# 4. 建 consumer group 後用 group 模式讀（'>' 取未投遞訊息、進 PEL 等 ack）
11docker exec redis redis-cli XGROUP CREATE mystream mygroup 0
12docker exec redis redis-cli XREADGROUP GROUP mygroup consumer1 COUNT 10 STREAMS mystream '>'

最短路徑驗證「Redis 起來、stream 能寫能讀」。實際用 consumer group 場景見日常操作。

日常操作與決策形狀

XADD / XREAD / XREADGROUP

子議題：

XADD：寫入 entry、* 自動 ID vs 手動 ID
XREAD：簡單讀取（無 consumer group、適合單 consumer）
XREADGROUP：consumer group 模式、配合 ACK
對應指令範例：XADD、XREAD、XREADGROUP、XACK、XPENDING

Consumer group 與 PEL

Consumer group 是 Streams 的核心抽象、配合 Pending Entries List（PEL）追蹤未 ack 訊息。子議題：

XGROUP CREATE / SETID / DESTROY
XACK：明確 ack
XPENDING：查 PEL 狀態
XCLAIM / XAUTOCLAIM：consumer 失敗時接管訊息

Retention：MAXLEN / MINID

子議題：

MAXLEN：保留最近 N 個 entry（近似或精確）
MINID：保留 ID 大於某值的 entry
XADD 寫入時帶 MAXLEN（最常用）
XTRIM 手動修剪

進階主題（按需閱讀）

PEL 失敗接管、retention 與 cluster 影響已展開為 deep article：XCLAIM/PEL 失敗接管與 cluster 影響。下列子議題段保留選題判讀入口。

XCLAIM 與 consumer 失敗接管

子議題：

Idle time 判讀（min-idle-time 參數）
XAUTOCLAIM（Redis 6.2+、自動接管）
接管後的去重責任（仍需 idempotency）

Memory 與 retention 取捨

子議題：

Stream 佔用 Redis 記憶體、MAXLEN 是主要旋鈕
近似修剪（~ 標記）vs 精確修剪的性能差異
配合 maxmemory-policy 與 eviction（注意 stream 不會被 eviction）

Redis Cluster 對 Streams 的影響

子議題：

Stream key 只在單一 shard（無 partition 概念）
多 stream 跨 shard 的設計（用 hash tag 控制分布）
Cluster failover 對 PEL 一致性的影響

Stream + Functions（Redis 7+）

子議題：

Redis Functions（取代 Lua scripting）
Stream 處理寫成 Redis-side function
適用 / 不適用場景

Redis Sentinel / Cluster 對可靠性的影響

子議題：

Replication lag 對 Streams 一致性的影響
AOF 與 RDB 對 Stream 持久化的差異
Failover 期間 PEL 是否完整

排錯快速判讀

PEL 累積（XPENDING 數字持續增長）

操作原則：先看是單一 consumer 還是整 group 都累積、再定位 consumer 失敗 vs ACK 漏寫。

1redis-cli XPENDING mystream mygroup
2# 回傳 PEL 總數 + 每個 consumer 的待 ack 數、定位累積集中在哪個 consumer

判讀路徑：consumer crash 沒 ACK → consumer 慢 → ACK 程式碼漏寫。

Memory pressure（stream 佔用過大）

操作原則：MAXLEN 沒設或設太大、stream 持續增長。判讀：用 MEMORY USAGE 看 stream 佔用、調整 MAXLEN。

跨 shard stream 限制

操作原則：Streams 不支援 partition、單 stream 受單 shard 容量限制。設計：用 hash tag 強制分散到多 stream。

Consumer 重平衡（無原生機制）

操作原則：consumer group 沒有自動 rebalance、要手動 XCLAIM 接管。看 idle time 與 XPENDING 判斷該接管哪些。

Failover 後 PEL 不一致

操作原則：Sentinel / Cluster failover 後、replica 升 primary、PEL 可能不完整。對應 3.C9 語義誤配的思路。

何時改走其他服務

需求形狀	改走
高吞吐 / 長期 retention	Kafka
複雜 routing	RabbitMQ
跨節點 stream（partition + replication）	Kafka / Pulsar
輕量 messaging（不需 Redis）	NATS
Managed queue	SQS / Pub/Sub
Redis Pub/Sub（fire-and-forget）	Redis Pub/Sub（同 Redis、不持久化）

不在本頁內的主題

Redis 本體運維（見 02 cache 模組 redis vendor）
各語言 Redis client 完整 API
Redis Pub/Sub 細節（不是 Streams、語意不同）

案例回寫

Redis Streams 專屬案例（C42-C47）

案例	主討論議題
3.C42 Bitso Reliable Streams	自建抽象層 + DLQ + idempotency
3.C43 Arcjet 取代 Kafka	Janitor 自寫 retention / 6 位數 → $1k
3.C44 Harness event-driven	XAUTOCLAIM head-of-line / 監控缺口
3.C45 Klaxit Rust + Logplex	High-throughput log ingestion / consumer group
3.C46 Learning.com 退場	（反例）長期事件儲存因成本與延遲退場
3.C47 PHP + S3 hybrid	Payload 大小限制 / hybrid storage

跨 vendor 對照

案例	對 Redis Streams 的對應
3.C5 Slack Kafka+Redis	多 broker 組合：Kafka 處理量、Redis 處理即時性
3.C10 規模對照	中等規模 / Redis 生態內 / 不跨 shard

Stream + Functions / Redis Cluster on Streams 缺直接 customer case：公開資料多在 single-instance / Sentinel 規模、Cluster 跟 Functions 案例稀薄、撰寫該段時要明示。

下一步路由

上游概念：0.3 非同步選型、3.1 broker basics
Redis 本體：02 cache 模組
平行 vendor：Kafka、NATS
下游能力：3.4 consumer 設計

AWS IAM

Mon, 18 May 2026 00:00:00 +0000

AWS IAM 是 AWS 的 cloud resource permission engine — 它回答的問題是「這個身份能對哪一個 AWS resource 做哪一個 API call」。它不是 workforce IdP、也不負責「這個人類是誰」的判定。所有 AWS API 流量（無論來自 console 操作、CI pipeline、Lambda、EC2、跨帳號 partner）最終都要經過 IAM 的 policy 評估、IAM 是 AWS 安全模型的根。

服務定位

AWS IAM 是 cloud resource permission engine、人類 workforce 的 SSO 與 lifecycle 應該走 AWS IAM Identity Center 或外部 IdP（Okta / Keycloak）。Identity Center 把人類映射到 Permission Set、Permission Set 在每個目標帳號裡實際上是 AWS-Reserved IAM Role — 也就是說：人類登入走 Identity Center、實際的 API 授權判斷一定回到 IAM。兩層責任分清楚、policy 才不會錯放在「誰是誰」的地方。

AWS IAM 跟 Google Cloud IAM / Azure RBAC 在 policy model 上設計差異很大。AWS 的表達力最強 — identity-based policy、resource-based policy、Service Control Policy（SCP）、Permission Boundary、Session Policy 是五個獨立的層、最終結果由 Explicit Deny > Org SCP > Resource-based > Identity-based > Permission Boundary > Session Policy 的評估順序決定。表達力換來的代價是 最容易設定錯：S3 bucket policy 設錯 = public、KMS key policy 漏一個 condition = 跨帳號可以解密、Trust Policy 沒設 ExternalID = confused deputy 攻擊面。

本章目標

讀完本頁、讀者能判斷：

哪些 IAM first-class concept（User / Group / Role / Policy / STS）對應到自己的場景、哪些要避免（例如：給人類發 IAM User access key）
跨帳號信任、CI / 第三方 SaaS 連進 AWS、service-to-service 認證該走 Role assumption / OIDC trust 還是 Roles Anywhere
SCP、Permission Boundary、resource-based policy 三層上限的疊加方式、何時用哪一層
CloudTrail + Access Analyzer 的稽核 baseline、出事時的最短取證路徑

最短判讀路徑

判斷一個 AWS 帳號的 IAM 配置是否健康、最少看四件事：

誰能 assume 哪個 Role：所有 Role 的 Trust Policy（誰能呼叫 sts:AssumeRole）、有沒有跨帳號 trust、跨帳號 trust 是否帶 ExternalID、有沒有 * 在 Principal 裡
Resource-based policy 暴露面：S3 bucket policy、KMS key policy、Lambda function policy、SNS / SQS policy 是否有 Principal: * 或來自非預期帳號；用 IAM Access Analyzer 找 unintended external access
Permission Boundary 與 SCP 是否生效：開發者建的 Role 是否 attach Permission Boundary（防止 admin 自己給自己升權）、Organization 是否 attach SCP 做整個 OU 的上限
CloudTrail 是否完整、是否進 SIEM：management event 跟 data event 都開、跨 region、跨帳號、保留期符合稽核要求、特定事件（AssumeRole 失敗、root login、CreateAccessKey）接 alert runbook

四件事任一缺失、就是 Authorization 與 Audit Log 邊界的待補項目。

日常操作與決策形狀

Role 設計（cross-account / service / OIDC trust）：所有 持續性 的身份都應該是 Role、不是 IAM User。Service Role（給 EC2 / Lambda / ECS task）是 AWS 內部 service-to-service；Cross-account Role 給 partner 帳號或自家其他帳號用 sts:AssumeRole 進來；OIDC trust 是現代 CI 必備路徑（GitHub Actions / GitLab / 自管 K8s 用短期 OIDC token 換 AWS STS 短期憑證、不在 secret store 存 long-lived access key）。

Policy 種類分工：identity-based policy attach 在 User / Group / Role 上、回答「這個身份能做什麼」。Resource-based policy attach 在 resource 上（S3 bucket、KMS key、SNS topic、Lambda function）、回答「誰能對這個 resource 做什麼」— 同帳號內 identity-based 跟 resource-based 任一個 allow 就通過、跨帳號 兩邊都要 allow。SCP 是 Organization 層級的上限、不是 grant — SCP allow 不會給任何權限、SCP deny 會擋掉整個 OU 的所有 identity。Permission Boundary 是 user 角度的上限、給 admin 用來限制「我把 admin 權限委派給 developer 後、developer 自己建的 role 不能超過這條線」。

STS 與臨時憑證：所有 cross-account、service-to-service、人類 console federation 都應該走 STS — sts:AssumeRole（跨帳號 / 跨 role）、sts:AssumeRoleWithSAML（SAML IdP）、sts:AssumeRoleWithWebIdentity（OIDC）、sts:GetFederationToken（外部 broker）。Session 預設 1 小時、最長可設 12 小時（依 Role 設定）。Debug 起手式：aws sts get-caller-identity 確認當前 caller 是誰、是 User、Role 還是 federated session。

Access Key 治理：IAM User 的 long-lived access key 是 最後手段、用於 break-glass 或無法跑 IMDS / Roles Anywhere 的 legacy。所有 access key 走 Secret Management、定期 rotation、IAM Access Analyzer 的 unused access finding 找閒置 key。

CloudTrail / Access Analyzer baseline：CloudTrail organization trail 開到所有帳號、management event 必開、data event（S3 object level、Lambda invoke）依資料敏感度開。Access Analyzer 至少跑 external access（找 resource-based policy 把資源暴露給外部帳號）跟 unused access（找閒置 Role、user、permission）。

Trust Policy / ExternalID：第三方 SaaS（監控、CSPM、備份服務）要進你的 AWS 帳號時、其 Trust Policy 必須要求 ExternalID — 否則攻擊者只要知道 Role ARN 就能假冒第三方 SaaS 的呼叫端、走 confused deputy 攻擊面（AWS confused deputy 官方說明）。自家跨帳號 trust 不一定要 ExternalID、第三方一定要。

核心取捨表

取捨維度	AWS IAM	Google Cloud IAM	Azure RBAC
基本單位	Policy（attach 到 identity 或 resource）	Role Binding（principal + role + resource）	Role Assignment（scope + principal + role）
隔離邊界	Account（root）+ Organization SCP	Project / Folder / Org（階層 inherit）	Subscription / Management Group（階層 inherit）
Policy 表達力	高 — identity / resource / SCP / boundary / session 五層	中 — Conditional IAM + Organization Policy	中 — RBAC + Azure Policy 兩層
Resource-based	多 service 支援（S3 / KMS / SNS / SQS / Lambda…）	較少（GCS / Pub/Sub / KMS 等）	較少、多走 RBAC 統一
設定錯誤代價	高 — bucket / key policy 設錯就 public	中 — 較統一但精細度也較低	中 — 階層 inherit 容易誤放

AWS IAM 是 表達力最強、最容易設定錯 的雲端 IAM。Google Cloud IAM 設計較統一、policy model 易讀但精細度有限。Azure RBAC 走 inheritance + scope、靠 Management Group 結構治理。三家都不能直接互換、跨雲環境需要在每家自己的 IAM 模型裡建等價的 least-privilege baseline。

進階主題

Service Control Policy（SCP）：Organization 層級的上限、用來宣告「整個 OU 永遠不能做什麼」 — 例如禁止 root user 操作、禁止關閉 CloudTrail、禁止在非允許 region 建 resource。SCP 是 deny-list 防護網、不是日常授權；日常授權交給 identity-based policy。SCP 過嚴會擋住合法操作、過鬆等於沒設、設計時要對齊 organization 的安全政策骨幹。

Permission Boundary：用在 委派 admin 場景 — 公司想讓 platform team 自己建 IAM Role 給應用、但又不想讓他們建出 admin role。Admin 給 platform team 一個 Permission Boundary policy、platform team 建的所有 Role 都會被這個 boundary 限制上限、就算 attach 了 AdministratorAccess 也只能在 boundary 範圍內生效。

ABAC（attribute-based / tag-based access control）：大規模 multi-account 環境、每個 service 一個 Role 會 Role 爆炸。ABAC 用 tag（principal tag、resource tag、request tag）做 policy condition — 例如「Role 上有 team=payments tag 的人能操作 team=payments tag 的 resource」。設計成立的前提是 tag 來源可信、不能讓使用者自己改 principal tag。

IAM Roles Anywhere：給 AWS 之外的 workload（地端 K8s、其他雲、邊緣設備）用 X.509 憑證換 STS 短期憑證。前提是有一個可信的 PKI（自管 CA 或公開 CA）跟 trust anchor。比起把 IAM User access key 放在地端 secret store、Roles Anywhere 是更安全的設計。

OIDC trust（GitHub Actions / GitLab CI / 第三方 CI）：CI / CD 連 AWS 的標準做法。在 AWS 建一個 OIDC identity provider 指向 CI 的 OIDC issuer、Role 的 Trust Policy condition 限制 repo:org/repo:ref:refs/heads/main、CI workflow 直接 aws sts assume-role-with-web-identity。完全不需要在 CI secret store 存 long-lived AWS access key、token TTL 隨 job 結束自動失效。

Resource-based policy 跨帳號設計：S3 bucket policy、KMS key policy、SNS / SQS / Lambda policy 都支援跨帳號授權。設計時兩件事必查：Principal 是否包含預期的帳號 / Role ARN、condition 是否限制來源（aws:SourceAccount、aws:SourceArn、aws:PrincipalOrgID）。漏了 condition、就可能讓任何拿到「假裝是某個 service」身份的人都能呼叫 — Capital One 2019 事件本質就是 SSRF 取得 EC2 IMDS 的 Role credential、再用該 Role 的權限去 S3 列舉跟讀取資料、揭示 resource-based policy + identity-based policy 沒有最小化、就會在事故時最大化。

排錯與失敗快速判讀

AccessDenied 但 policy 看起來 allow：先用 IAM Policy Simulator 或 aws iam simulate-principal-policy 重算、確認是 SCP 擋、Permission Boundary 擋、resource-based policy 沒 allow、還是 condition key 不匹配。Explicit Deny 永遠贏。
跨帳號 sts:AssumeRole 失敗：兩邊都要設 — caller 帳號的 identity-based policy 要 allow sts:AssumeRole 到目標 Role ARN、目標 Role 的 Trust Policy 要 allow caller 的 Principal。漏其一就失敗。
S3 bucket 不小心 public：用 Access Analyzer 的 external access finding 找、用 Block Public Access 帳號級別開關擋掉（即使 bucket policy 寫了 public、Block Public Access 也會擋）。常見根因：bucket policy 寫 Principal: * 沒加 condition、或 ACL 殘留歷史設定。
Role / access key 殘留：用 Access Analyzer 的 unused access finding、或 IAM credential report 找超過 90 天沒用的 user / role、配 Failure: Credential Rotation Without Scope 的分域分批 rotation 流程清理
第三方 SaaS Role 缺 ExternalID：稽核第三方 vendor 的 onboarding 文件、若沒要求 ExternalID 是 vendor 自己安全模型有破口、自己這邊也要拒絕這種 onboarding
CloudTrail 落地不全：Organization trail 沒覆蓋新建帳號、data event 沒開、log 沒進 SIEM、保留期不足 — 這四件事都會讓事故發生時拿不到證據

何時改走其他服務

需求形狀	改走
人類員工 SSO 進 AWS	AWS IAM Identity Center
多雲 / SaaS app 統一 SSO	Okta / Keycloak
Customer / B2C identity	Auth0
Google Cloud resource 權限	Google Cloud IAM
Azure resource 權限	Azure RBAC
Secret / API key 治理	7.6 秘密管理與機器憑證治理
Key lifecycle / envelope encryption	AWS KMS vendor 頁（S2 批次撰寫中）+ 7.6 秘密管理與機器憑證治理
事件偵測（CloudTrail 以外）	04 SIEM / detection 工具與 07 SIEM 章節

不在本頁內的主題

IAM policy JSON 語法完整 reference 與所有 condition key 清單
每個 AWS service 的細部 IAM 動作對照
AWS Organization、Control Tower、Landing Zone 完整建置流程
KMS / Secrets Manager / Certificate Manager 的內部細節（見對應 vendor 頁）

案例回寫

案例	跟 AWS IAM 的關係
Microsoft Storm-0558 Signing Key 2023	雖是 Microsoft Entra / Exchange Online 事件、對 AWS cross-account role assumption signing chain 提供對照：ExternalID 設計、HSM-bound key、跨帳號 token 驗證一致性
Failure: Credential Rotation Without Scope	IAM User access key、STS session、Role trust 的 rotation 必須分域分批、不能單一指令打全部
Microsoft Storm-0558 Signing Key Chain (red-team)	對 IAM Roles Anywhere / OIDC trust 的 signing material 治理啟示：trust anchor、key custody、跨環境驗證

下一步路由

上游：7.2 身分與授權邊界、7.13 偵測覆蓋率與訊號治理
平行：AWS IAM Identity Center、Google Cloud IAM、Azure RBAC
下游：7.6 秘密管理與機器憑證治理（AWS KMS vendor 頁 S2 批次撰寫中）
跨模組：8 事故處理 vendor 清單（CloudTrail / Access Analyzer 訊號如何 routing 進 IR 流程）
官方：AWS IAM User Guide、AWS IAM Identity Center User Guide

Google Cloud KMS

Mon, 18 May 2026 00:00:00 +0000

Google Cloud KMS 是 GCP 原生的 key management service、把 envelope encryption、asymmetric signing 與 MAC 等密碼運算集中在受控的 key custodian 內、key material 不離保護邊界。應用端只持 KMS resource name + IAM 權限、用 Encrypt / Decrypt / AsymmetricSign API 把 plaintext 或 hash 送進 Cloud KMS、key 永遠在 Google 管理的 software 模組或 HSM 內運算完才把結果送回。整個 GCP 的 CMEK（Customer Managed Encryption Key）生態都以 Cloud KMS 為錨點 — GCS bucket、BigQuery dataset、Persistent Disk、Cloud SQL、GKE etcd 都可指定一把 Cloud KMS key 做加密、跟 cloud-native 預設加密（GCP 自管 key、客戶看不到）拉出邊界。

服務定位

Cloud KMS 的核心定位是 GCP-native envelope encryption + signing 控制面、用 KeyRing 作為 organizational + locational grouping、CryptoKey + CryptoKeyVersion 作為 key material 的版本軸。跟 AWS KMS 相比、最大差異是 沒有獨立的 Key Policy：權限完全走 GCP IAM（Role Binding 綁到 KeyRing 或 CryptoKey resource）、好處是跟 Google Cloud IAM 統一治理（同一份 IAM audit、同一套 conditional binding）、代價是少了 AWS KMS Key Policy 那種 key-level 的獨立 deny override。

跟 Azure Key Vault 相比、Cloud KMS 拆得更細：Azure 把 secret + key + certificate 合在同一個 Key Vault service、Google 拆成 Google Secret Manager（secret）+ Cloud KMS（key）+ Certificate Authority Service（PKI），各 service IAM、quota、audit 獨立。跟 CloudHSM 相比、Cloud KMS Protection Level=HSM 是 managed HSM（FIPS 140-2 Level 3、Google 顧 cluster）、CloudHSM 是 single-tenant 專屬 HSM（客戶顧 cluster、合規隔離更強）。跟 Vault transit 相比、Cloud KMS 綁 GCP、Vault transit 可跨雲；但 Vault 自己常用 Cloud KMS 當 auto-unseal master key custodian。

本章目標

讀完本頁、讀者能判斷：

KeyRing 該放哪個 location（global / regional / dual-regional / multi-regional）、為何一旦決定無法搬遷
CryptoKey Version + Primary 版本軸怎麼支撐 rotation、何時該 disable / destroy 舊 version
Protection Level（SOFTWARE / HSM / EXTERNAL）跟 Cloud HSM、External Key Manager 的取捨
CMEK 整合 GCS / BigQuery / Persistent Disk 跟 cloud-native default encryption 的邊界差異

最短判讀路徑

判斷一份 Cloud KMS 部署是否健康、最少看四件事：

KeyRing location 對不對：production sensitive key 用 region / multi-region、避免不必要的 global KeyRing；location 一旦設定 不能改、key 也搬不出原 KeyRing — 設錯只能建新 KeyRing + 重新加密所有 ciphertext
IAM Conditions 跟 least privilege：roles/cloudkms.cryptoKeyEncrypterDecrypter 不該綁到 KeyRing level（會放大爆炸半徑）、應綁到具體 CryptoKey；admin 跟 use 角色分離（roles/cloudkms.admin ≠ roles/cloudkms.signer）；敏感 key 加 IAM Condition（時間窗、resource attribute）
Cloud Audit Logs 開到對的層級：Admin Activity（建 key、改 IAM、destroy version）預設開、Data Access（每次 Encrypt / Decrypt / Sign）預設關 — production sensitive key 必須在 IAM audit config 把 Data Access 開、否則「誰用 key 做了什麼」查不到
Protection Level 對齊合規：production 跟 PII / 金融 / 醫療資料的 key 應走 HSM 或 EXTERNAL、SOFTWARE 只給 dev / 低敏感場景；EKM 對應 資料主權（key 物理上不在 GCP）

四件事任一缺失、就是 Audit Log 與 KMS 邊界的待補項目。

日常操作與決策形狀

KeyRing 設計：KeyRing 是 組織單位 + 位置鎖。建議切法：依 環境 + 用途 拆（prod-data-encryption-asia-east1、prod-signing-global、dev-data-encryption-asia-east1），不要全公司一個 KeyRing。Location 選擇：跟資料 colocate（GCS bucket 在 asia-east1 的 key 也放 asia-east1 KeyRing、避免跨區延遲與資料主權問題）；signing key 多半放 global 或 multi-region 提高可用性；CMEK 給 BigQuery 時 KeyRing location 必須跟 dataset location 一致、否則綁不上。一個原則：KeyRing location 是一次性決策、上線前確認跟 cloud resource location + 法規要求對齊。

CryptoKey Version 與 Primary：CryptoKey 有多個 version（projects/.../cryptoKeys/k/cryptoKeyVersions/1、v2、v3）、其中一個是 Primary — 所有 Encrypt API 預設用 Primary version 加密、Decrypt 自動依 ciphertext 內嵌的 version ID 找對應 version 解。Rotation 不是「換 key」、是 建立新 version 並 promote 為 Primary；舊 version 仍可 decrypt 既有 ciphertext（除非手動 disable / destroy）。Destroy 是 24 小時延遲（可在期內 restore）、destroy 之後 ciphertext 永久不可解 — 排程 destroy 前必須確認沒有遺留 ciphertext 還在用該 version。

Auto Rotation：CryptoKey 可設 rotationPeriod（最短 1 天、預設 90 天）、KMS 在到期時自動建立新 version + promote 為 Primary、app 不需要改 code。Auto rotation 只對 symmetric encryption key 有效；asymmetric key（signing / decryption）不支援 auto rotation、需要手動建 version + 通知 consumer 更新 public key。注意 auto rotation 是 key version 換、不會 re-encrypt 既有資料 — 真正的 資料 re-encryption 是另一條工作流（讀回 ciphertext + 用新 Primary 重加密寫回）、要依 CMEK-integrated resource 各自規劃。

Protection Level：SOFTWARE（軟體運算、最便宜、FIPS 140-2 Level 1）/ HSM（Cloud HSM 後端、FIPS 140-2 Level 3、key 物理上在 Google 管理的 HSM cluster）/ EXTERNAL（External Key Manager、key 在客戶自管的外部 HSM、Cloud KMS 把運算委派出去）。Production sensitive key 應走 HSM、SOFTWARE 給 dev / 低敏感場景。Protection Level 是 CryptoKey 建立時決定、不能改 — 要升等只能建新 CryptoKey + 遷移 ciphertext。

CMEK 整合：CMEK 把 Cloud KMS key 綁到 GCS bucket / BigQuery dataset / Persistent Disk / Cloud SQL / GKE etcd / Pub/Sub topic / Dataflow job 等 resource。設定方式：cloud service 的 service account（如 service-PROJECT_NUMBER@gs-project-accounts.iam.gserviceaccount.com）取得該 CryptoKey 的 cryptoKeyEncrypterDecrypter 權限、resource 在加密時自動呼叫 KMS。跟 cloud-native default encryption（GCP 自己管 key）的差異：CMEK 下 客戶可隨時 disable key 讓整個 bucket / dataset 立刻無法解（compliance kill switch）、default encryption 沒這個能力。代價是 KMS 故障 = CMEK-integrated resource 全部讀寫卡住、所以 production KMS 自身 SLA 跟 monitoring 是 cluster-level dependency。

External Key Manager (EKM)：GCP 把 encryption / decryption operation 委派給客戶自管的外部 HSM（Thales、Equinix SmartKey、Fortanix 等）、key 物理上不在 GCP、Cloud KMS 只是個 proxy。適合 資料主權 嚴格的場景（歐盟金融、政府機密、跨境法規）— 客戶撤銷外部 HSM 的存取、GCP 立刻無法解密、達成「Google 看不到資料」的合規承諾。代價：每次 Encrypt / Decrypt 都打外部 HSM、延遲跟可用性受外部 HSM 影響、運維複雜度大幅上升。

IAM 整合：用 Role Binding 控制存取（綁在 KeyRing 或 CryptoKey resource）— roles/cloudkms.cryptoKeyEncrypterDecrypter（Encrypt + Decrypt）/ roles/cloudkms.signer（AsymmetricSign）/ roles/cloudkms.signerVerifier（含 public key 取得）/ roles/cloudkms.admin（建 key、改 IAM）。對應 Google Cloud IAM 的 conditional binding、可加時間窗、resource attribute、access level 條件。跟 AWS KMS 的關鍵差異：沒有 Key Policy — 所有授權都在 IAM、好處是統一治理、代價是少了 key-level 的獨立 deny override（AWS KMS Key Policy 可寫「即使 IAM 給了 admin、仍 deny destroy」、Cloud KMS 要用 Organization Policy 或 IAM Deny 達成類似效果）。

核心取捨表

取捨維度	Google Cloud KMS	AWS KMS	Azure Key Vault	Vault transit
部署模型	GCP managed	AWS managed	Azure managed	self-hosted 或 HCP
跨雲	弱 — 綁 GCP	弱 — 綁 AWS	弱 — 綁 Azure	強 — 同介面跨雲
Multi-region key	用 multi-region KeyRing（key material 在多 region 鏡像）	Multi-Region Key 較直接（單一 key ID、跨 region 自動同步）	支援 geo-replication	跨雲、需自行設計 replication
Key 權限模型	純 IAM Role Binding、無 Key Policy	IAM + 獨立 Key Policy（雙層授權）	RBAC + Access Policy 雙模式	Vault policy（path-based）
HSM 選項	Protection Level=HSM（managed、FIPS 140-2 L3）	AWS KMS HSM-backed（預設）+ CloudHSM（專屬）	Premium tier + Managed HSM	依賴後端 KMS / HSM
外部 key 託管	External Key Manager (EKM)	XKS (External Key Store)	BYOK + Managed HSM	自管 HSM unseal
Audit	Cloud Audit Logs（Data Access 需手動開）	CloudTrail（KMS event 自動進）	Azure Monitor / Activity Log	Vault audit device
CMEK 整合廣度	GCS / BQ / PD / Cloud SQL / GKE etcd / Pub/Sub / Dataflow	S3 / EBS / RDS / DynamoDB / Lambda env	Storage / SQL / Cosmos / Disk	不適用（app-level）
適合場景	GCP-heavy、需 CMEK 整合、Workload Identity Federation 已主導	AWS-heavy、需 Multi-Region Key + Key Policy 精細控制	Azure-heavy、需要 secret + key 統一治理	跨雲、需要 app-level encryption-as-a-service

選 Cloud KMS 的核心訴求：GCP 是主力雲 + 需要 CMEK 把 GCS / BigQuery / PD / Cloud SQL 的加密 key custody 拉回客戶手上 + 接受 IAM-only 授權模型。需要 跨雲統一 key custody 走 Vault transit 或 EKM；需要 單一專屬 HSM 隔離 走 CloudHSM 或 EKM 接 on-prem HSM。

進階主題

External Key Manager (EKM) 與資料主權：EKM 讓 key 物理上不在 GCP、Cloud KMS 變成 proxy 把 cryptographic operation 委派給客戶自管 HSM。常見部署：金融 / 政府用 EKM via VPC（外部 HSM 在客戶 VPC 內、Cloud KMS 走 PSC 連線、延遲較低）、跨境合規用 EKM via Internet（HSM 在第三方 KMS provider、延遲較高但治理邊界更乾淨）。代價：每次 Encrypt / Decrypt = 一次外部呼叫、CMEK-integrated resource 的讀寫吞吐量受外部 HSM 限制、外部 HSM 故障 = 整個 GCP 端讀寫卡住。

Cloud HSM（Protection Level=HSM）：把 CryptoKey 物理上鎖在 Google 託管的 FIPS 140-2 Level 3 HSM cluster 內、key 不可 export、所有 cryptographic operation 在 HSM 邊界內完成。對應 Microsoft Storm-0558 Signing Key 2023 的對照啟示：signing key 一旦能被 export 或從 memory crash dump 撈出、整個信任鏈崩 — HSM-bound key 從設計上斷掉這條路徑。代價：HSM 後端比 SOFTWARE 貴、operation 延遲略高（典型多 < 10ms）、quota 也獨立計算。

Asymmetric Key 做 JWT signing：CryptoKey purpose=ASYMMETRIC_SIGN 配 algorithm（RSA / EC）、app 透過 AsymmetricSign API 把 JWT header+payload 的 hash 送進 KMS、KMS 回 signature。Public key 走 GetPublicKey API 取得、給 JWKS endpoint 對外發布。優勢：private key 不離 KMS、即使 app server compromise 也無法搬走 signing key；劣勢：每次簽名都 round-trip 一次 KMS、高 QPS 場景要算 quota 跟延遲（典型 ~10-30ms / sign）。

跟 Google Secret Manager 的 CMEK 整合：Google Secret Manager 預設用 GCP 管的 key 加密 secret、若要 客戶管 key、可設 CMEK 把 GSM 的 secret 用客戶 Cloud KMS key 加密。意義：disable Cloud KMS key 立刻讓 GSM secret 不可讀（compliance kill switch）— 但代價是 KMS 故障 = GSM 也卡住、是強耦合 dependency。

Multi-region key：Cloud KMS 的 multi-region KeyRing（如 us、europe、asia）讓 key material 在多 region 鏡像、提高可用性但加密 / 解密延遲較高。AWS KMS 的 Multi-Region Key 設計不同（單一 key ID 跨 region 同步、有獨立的 primary / replica 角色）— 跨雲遷移 / 多雲 active-active 設計時要留意這個差異、Cloud KMS multi-region 比較像 單一邏輯 key 多 region 可用、不是 多 region 各自獨立可寫。

Import 自有 key material（BYOK）：Cloud KMS 可 import 客戶自產的 key material（透過 wrapping key 包覆後上傳）、適合需要 客戶端 key generation 證據鏈 的合規場景。代價：import 的 key 不能 auto rotate（rotation 必須客戶端重新產 key 再 import），且 SOFTWARE / HSM Protection Level 都支援、EXTERNAL 不適用（EXTERNAL 本來就在外部 HSM、不走 import 路徑）。

Organization Policy 與防護欄：跟 Google Cloud IAM 整合的 Org Policy 可在 organization-level 強制 只允許 HSM / EXTERNAL key（constraints/gcp.restrictNonCmekServices）、防止工程師建出 SOFTWARE key 處理敏感資料。這層防護欄比依賴 reviewer 紀律有效、屬於 Failure: Credential Rotation Without Scope 同類「規約靠系統而非紀律」的設計。

排錯與失敗快速判讀

KeyRing location 設錯：KeyRing 建在 global、要綁 asia-east1 的 BigQuery dataset CMEK — 綁不上、location 不能改、只能建新 KeyRing + 重新加密 — 上線前 review KeyRing location 跟 resource location 對齊
Data Access audit 沒開：production 用 Cloud KMS 做 signing、事故時要查 誰用 key 簽了什麼、發現只有 Admin Activity log、沒有 Decrypt / Sign 記錄 — IAM audit config 加 dataAccess log type、留意 audit log 自己會增加成本與 quota
CMEK key disable 後 resource 全卡：disable CryptoKey 想做 compliance 演練、整個 GCS bucket 讀寫立刻 503 — disable 是 全或無、要演練得排維護窗、有 rollback 計畫（re-enable 後恢復）
Auto rotation 設定 + asymmetric key：以為 asymmetric signing key 也會 auto rotate、上線數月後發現 version 1 還在用 — asymmetric key 不支援 auto rotation、要手動建 version + 通知 JWKS consumer
IAM Role 過寬：給整個 KeyRing cryptoKeyEncrypterDecrypter、單一 service account 可以解所有 key — 改綁到具體 CryptoKey、加 IAM Condition
EKM 外部 HSM 故障：外部 HSM 連線中斷、Cloud KMS 端 Encrypt / Decrypt 全 fail、所有 CMEK-integrated resource 讀寫卡住 — EKM 需要 dual HSM redundancy + Cloud KMS 端 monitoring alert
Destroy 後資料不可解：CryptoKeyVersion destroy 後 24 小時 grace period 過了、發現某個 backup 還是用該 version 加密 — destroy 前必須跑 inventory 確認沒有 ciphertext 還掛在該 version

何時改走其他服務

需求形狀	改走
AWS-only 加密 + 需 Key Policy 精細控制	AWS KMS
Azure-only 加密 + 需 secret + key 同治理	Azure Key Vault
跨雲統一 encryption-as-a-service	HashiCorp Vault transit engine
單一專屬 HSM 隔離 / 跨雲合規	CloudHSM
GCP secret 管理（非 key）	Google Secret Manager
GCP IAM 治理基底	Google Cloud IAM
公開憑證 / PKI	Certificate Authority Service（GCP）或 Let’s Encrypt
Secret rotation 證據鏈	7.5 Credential Rotation Scoped Evidence

不在本頁內的主題

Cloud KMS 完整 API reference 跟 gcloud kms CLI 詳盡用法
Cloud HSM partition 內部架構、FIPS 140-2 Level 3 驗證細節
EKM 各 partner（Thales / Fortanix / Equinix）的整合步驟與 API 對照
BigQuery / GCS / Cloud SQL 各自 CMEK 設定的完整教學
Cloud KMS pricing 詳盡計算（key version 數、operation 次數、HSM 加成）

案例回寫

Cloud KMS 在 07 案例庫沒有直接 vendor-level 事件、以下案例採對照引用：

案例	跟 Cloud KMS 的關係（對照）
Microsoft Storm-0558 Signing Key 2023	Cloud KMS Protection Level=HSM 把 signing key 鎖在硬體、不可 export、跟 HSM-bound mindset 同源 — signing key 一旦能 export 整條信任鏈崩
Microsoft Storm-0558 Signing Key Chain (red-team)	Asymmetric Key + Cloud Audit Data Access 是誰用 key 簽什麼的稽核基礎、預設關閉的 Data Access log 在 production 必須開、否則事故時無證據
Failure: Credential Rotation Without Scope	Auto Rotation 是 vendor-controlled、但 CMEK 整合的 GCS bucket / BQ dataset 的 re-encryption schedule 還是要自己管、否則 rotation 只換 key version、舊資料還是用舊 version

下一步路由

上游：7.6 秘密管理與機器憑證治理、7.5 傳輸信任與憑證生命週期（KMS 為 TLS / signing key 的 root custodian）、7.13 偵測覆蓋率與訊號治理
平行：AWS KMS、Azure Key Vault、CloudHSM
平行（secret）：Google Secret Manager、HashiCorp Vault
上游（IAM）：Google Cloud IAM（Cloud KMS 權限完全走 IAM Role Binding）
跨模組：8 事故處理 vendor 清單（KMS 事件如何 routing 進 IR 流程）
官方：Cloud KMS Documentation

Google DLP

Mon, 18 May 2026 00:00:00 +0000

Google DLP（Data Loss Prevention、2023 重新命名為 Sensitive Data Protection / SDP）是 GCP 原生的敏感資料 discovery + classification + transformation 服務。它跟 Microsoft Purview / AWS Macie / Cloud-native data policy 的差異不在「能不能發現 PII」、而在 發現之後能做多少事 — Google DLP 的核心優勢是 transformation 層（masking / Format-Preserving Encryption / tokenization / k-anonymity / differential privacy），不只是 detection。

服務定位

Google DLP 的核心定位是 infrastructure-level 敏感資料治理、跨 GCS / BigQuery / Cloud SQL / 任意 Inspect API input 的 PII 發現與去識別化。三層能力堆疊：Discovery（背景 scan GCS bucket / BigQuery table / Cloud SQL instance 找 PII / payment / credential）、Classification（150+ 預定義 infoType + custom infoType 組合）、Transformation（redact / mask / replace / pseudonymize / Format-Preserving Encryption / k-anonymity / differential privacy）。

跟 Microsoft Purview 比、Purview 走 information protection（sensitivity label + Office docs + Microsoft 365）+ DLP、Google DLP 走 infrastructure-level data scan + transformation；兩者解不同層、企業若 Office docs / SharePoint 為主走 Purview、cloud data warehouse / object storage 為主走 Google DLP。跟 AWS Macie 比、Macie 限 S3 + EBS / RDS snapshot、Google DLP 跨 GCS + BigQuery + Cloud SQL + 任意 Inspect API content（含 streaming / on-prem 透過 API call）。跟 Cloud-native data policy 比、Google DLP 是 detection + transformation、Cloud-native policy 是 access control；production 常組合使用 — DLP 發現敏感欄位 → policy 限制誰能 access → 必要時 DLP transformation 在 query time 自動 redact。

關鍵張力：content scanned 計費 ↔ 偵測覆蓋率。DLP API 按 scanned bytes 計費、整 BigQuery dataset full scan 在 PB-scale 跟 SIEM ingestion 同類痛點。實務應該分 sample scan（每 dataset 抽 1% 找 infoType 分布）+ full scan（高敏感 dataset 才完整 scan）+ streaming scan（write path 即時擋）三層。

本章目標

讀完本頁、讀者能判斷：

Google DLP 在 GCP 資料保護 stack 中承擔哪一段（discovery / classification / transformation）、哪些要外接（Google Cloud IAM 管 DLP service account、BigQuery column-level security 補 access control）
infoType / Inspection Job / transformation 種類的選用判準（什麼場景 mask、什麼場景 FPE、什麼場景 k-anonymity）
計費 trap 的應對（sample scan + full scan 分層、Pub/Sub trigger 避免重複 scan）
何時用 Google DLP、何時走 Purview / Macie / Cloud-native policy 的取捨

最短判讀路徑

判斷 Google DLP deployment 是否健康、最少看四件事：

誰跑 Inspection Job：DLP service account 的 IAM role（roles/dlp.user / roles/dlp.jobsEditor）、能 scan 哪些 project / bucket / dataset、findings 寫進哪個 BigQuery table、誰能讀 findings
infoType coverage：是否覆蓋 organization-specific PII（員工 ID / 客戶 ID 用 custom infoType + dictionary）、預定義 infoType 是否 enable 對應業務的（PCI 場景需 CREDIT_CARD_NUMBER + Luhn check、HIPAA 場景需 healthcare infoType）
Transformation lifecycle：發現 PII 後做什麼（自動 quarantine bucket / 自動 redact view / Pub/Sub trigger Cloud Function）、transformation 是 one-way（mask / redact）還是 reversible（FPE / tokenization 需 key management 走 Cloud KMS）
Cost 治理：scan 頻率 vs scan scope 的策略、是否分 sample / full / streaming 三層、findings retention policy（findings table 本身也是敏感資料、不該無限保留）

四件事任一缺失、就是 Data Protection and Masking Governance 邊界的待補項目。

日常操作與決策形狀

使用模式：Inspect API vs Inspection Job：DLP 有兩種呼叫模式 — Inspect API 走同步單次 scan（小 payload、即時 mask、API 寫入前的 streaming gate）、Inspection Job 走非同步批次 scan（大 dataset、結果存 BigQuery findings table、Pub/Sub trigger 後續 workflow）。production 通常混用：write path（Cloud Function / API gateway）走 Inspect API 即時擋住敏感資料寫進儲存、背景 Inspection Job 對既有 dataset 跑覆盤。

infoType 是 first-class concept：infoType 不是 regex、是 PII 分類單位。預定義 150+ 種（CREDIT_CARD_NUMBER / EMAIL_ADDRESS / US_SOCIAL_SECURITY_NUMBER / IP_ADDRESS / GENERIC_ID / PERSON_NAME 等）、各帶內建驗證邏輯（CREDIT_CARD_NUMBER 內建 Luhn check 比純 regex 精準、減少 FP）。Custom infoType 三種：regex pattern（自訂 regex）、dictionary（明確 token list、例員工 ID 全集）、hotword rule（context-aware、附近出現特定字才認、例「身分證」附近的數字才認 ID）。FP rate 直接由 infoType 精度決定、production rule 應該優先用預定義 infoType + hotword 限縮。

Transformation 種類遠不只 mask：DLP 的 transformation 是它跟其他 discovery-only 工具的核心差異。Redact 完全刪除（query result 看不到欄位）；Mask 保留長度替換字元（****1234）；Replace 替換成固定字串（[REDACTED]）；Pseudonymize / Tokenization 一致性 token（同樣 input 給同樣 output、可做 join 但不可逆）；Format-Preserving Encryption (FPE) 保留長度 / format 的可逆加密（key 在 Cloud KMS、analyst 查 anonymized data + 必要時授權 reverse）；k-anonymity / l-diversity aggregate 到至少 k 個 record 才公開（防止 quasi-identifier re-identification）；Differential privacy 加 noise 保證 statistical privacy（aggregated analytics 用）。後三項是 production analytics 場景的關鍵 — 不是「藏起來」而是「可用但保護」。

跟 BigQuery 深度整合：DLP 可 inline scan BigQuery column、findings 自動寫回 metadata。配合 BigQuery column-level security（policy tag）+ authorized view 做「敏感 column 只給特定 role + 自動 redact 給其他 role」。Production 模式：DLP Inspection Job 跑完後、自動 apply policy tag 到含 PII 的 column、無 tag access 的 query 自動失敗或 mask。

跟 Cloud Storage 整合：可 schedule 掃 bucket 整批檔案、發現後可自動 quarantine（移到隔離 bucket、不同 IAM、警告 owner）。對應 LastPass 2022 Backup Chain 的對照：backup bucket 應該獨立 DLP scan、含 credential 的 backup 走獨立 quarantine bucket + 不同 IAM 邊界、不是放在跟 dev backup 同一個 bucket。

Pub/Sub trigger workflow：Inspection Job 完成後可 publish 到 Pub/Sub topic、Cloud Function 訂閱後執行 — 自動 quarantine / 自動通知 owner / 自動寫進 SIEM findings index / 觸發 BigQuery policy tag update。這是 detection → response 自動化的 first-class pattern、不是後加的 webhook。

IAM 邊界：DLP service account 需要讀 source data（roles/storage.objectViewer / roles/bigquery.dataViewer）+ 寫 findings（roles/bigquery.dataEditor to findings dataset）+ 呼叫 DLP API（roles/dlp.user）。service account 本身是高敏感 — 它能讀整個 organization 的 PII、應該走 short-lived credential（Workload Identity Federation）+ 嚴格 audit。

核心取捨表

取捨維度	Google DLP	Microsoft Purview	AWS Macie	Cloud-native data policy
核心能力	Discovery + classification + transformation	Sensitivity label + DLP + Office docs	Discovery + classification（無 transform）	Access control + column-level security
Data source 範圍	GCS + BigQuery + Cloud SQL + 任意 Inspect API	Microsoft 365 + SharePoint + Azure data	S3 + EBS / RDS snapshot 限定	BigQuery / S3 / Snowflake 各自 native
Transformation	mask / FPE / tokenize / k-anonymity / DP（全套）	redact + Office sensitivity label	無 — 只 detection	無 — 只 access control
計費模型	按 content scanned（GB）	按 user / asset / 流量	按 storage scanned（GB） + bucket count	多半含在 cloud platform、policy 規模相關
Custom 分類能力	infoType (regex + dictionary + hotword)	sensitive info type + classifier (ML)	managed data identifier + custom	tag-based / column-level、無 content scan
Healthcare / PHI	Cloud DLP for Healthcare（FHIR / DICOM）	Purview Healthcare data + Microsoft 365 PHI	有限	無原生 PHI 認知
適合場景	GCP-first + BigQuery / GCS 為 PII 儲存層	Microsoft 365 / Office docs / SharePoint 為主	AWS-only + S3 為 PII 儲存層	已知敏感 column、想做 access control 不做 mask
退場成本	中 — transformation 邏輯耦合 DLP API	高 — sensitivity label 跟 Microsoft 365 深綁	低 — 只是 finding 跟 alert	低 — policy 是 metadata

選 Google DLP 的核心訴求：GCP 為主資料平台 + BigQuery / GCS 有大量 PII + 需要 transformation（不只 detection）+ 合規（GDPR / HIPAA / PCI）需要 column-level redaction / tokenization。on-prem 為主或 Office docs 為主走 Purview、AWS-only 走 Macie + S3 policy。

進階主題

Custom infoType 三層組合：production 自家業務的 PII（員工 ID / 客戶 ID / 內部 case ID）需要 custom infoType。三種組合：regex 抓 pattern（員工 ID 格式 EMP-\d{6}）、dictionary 抓明確 token list（內部 case ID 全集、月更新）、hotword 限縮 context（附近出現「員工」「ID」才認、避免一般 6 位數字誤判）。三者組合的 FP rate 比單獨 regex 低一個量級。

Format-Preserving Encryption (FPE) vs Tokenization：兩者都產生「外觀像原值但不是原值」的替換。FPE 是可逆加密、key 在 Cloud KMS、analyst 在 anonymized data 工作 + 必要時走授權流程 reverse（例：客服需要看完整信用卡號處理退款）。Tokenization 是 deterministic mapping、同樣 input 給同樣 output、可做 join 分析但 token table 不存（理論上不可逆、實務上看 implementation）。選擇判準：需要分析 join 同一 user 跨 dataset 用 tokenization、需要授權 reverse 用 FPE、只要遮蔽不需要還原 用 mask / redact。

k-anonymity / l-diversity / differential privacy：解決 quasi-identifier re-identification 問題 — 即使欄位不是直接 PII（如 ZIP + 性別 + 年齡）、組合起來能反推個人。k-anonymity 保證每個 record 在 quasi-identifier 上至少跟 k-1 個其他 record 一樣（典型 k=5）。l-diversity 進一步保證 sensitive attribute 在每組內至少 l 個不同值（防止 homogeneity attack）。Differential privacy 加 calibrated noise 到 aggregate query 結果、保證個別 record 加入或刪除對結果影響有 bound。Risk Analysis API 可估算 dataset 的 k-anonymity / l-diversity 風險、不需要先 transform 才知道風險。

跟 Cloud DLP for Healthcare 整合：FHIR / DICOM 格式的 PHI 有專屬 transformation pipeline。FHIR resource 的特定欄位（patient name / MRN / birth date）按 HIPAA Safe Harbor 自動遮罩、DICOM image 的 metadata 跟 burned-in text 都可 redact。Healthcare 場景的 PHI 治理跟一般 PII 不同 — 不能直接 mask 全部、要保留 clinical utility（年齡轉年齡段、ZIP 保留前三碼）。

跟 BigQuery column-level encryption：BigQuery 原生支援 AEAD encryption function、可用 KMS-managed key 對 column 做 cell-level encryption。DLP 可在 ingestion 階段先 tokenize、BigQuery query 階段配合 column-level security 做 access-time decryption。是「detection（DLP）+ classification（policy tag）+ encryption（AEAD）+ access control（column-level security）」的完整 stack。

排錯與失敗快速判讀

DLP scan 找不到明顯 PII：infoType 沒 enable / 預定義 infoType 對 organization-specific 格式不認 — 加 custom infoType + hotword、跑 sample scan 驗證 coverage
FP rate 太高 / findings 淹沒：infoType 太寬 / hotword 沒設 — 加 likelihood threshold（VERY_LIKELY / LIKELY）、custom infoType 加 hotword 限縮 context
Scan cost 暴衝：每次都 full scan 整個 dataset / 沒分層 — 改 sample scan（每 dataset 1%）+ 高敏感 dataset 才 full scan + streaming scan 守 write path
Inspection Job 跑超久 / timeout：dataset 過大 / 沒 partition — 切 partition by date、Job concurrency 提高、避免單 Job 跨整個 organization
Transformation 後 analyst 無法工作：mask / redact 全部、保留不下 utility — 改 FPE / tokenization 保留 join 能力、k-anonymity 保留 statistical utility
Findings table 自己變成 PII 洩漏面：findings 含 sample value（預設 quotable）、findings table 無獨立 IAM — 設定 includeQuote: false、findings table 走獨立 dataset + 嚴格 IAM
DLP service account 權限太大 / 沒 audit：service account 能讀全 organization PII、用 long-lived key — 改 Workload Identity Federation + short-lived credential + Cloud Audit Log 監控 DLP API call

何時改走其他服務

需求形狀	改走
Microsoft 365 / Office docs 為主	Microsoft Purview
AWS-only + S3 為 PII 儲存層	AWS Macie
只要 access control 不要 transformation	Cloud-native data policy
Secret / credential scanning（非 PII）	GitGuardian / Gitleaks
Data lineage / catalog	Dataplex / Atlan / Collibra
KMS / key management for FPE	Google Cloud KMS
SIEM ingestion of DLP findings	Splunk / Chronicle

不在本頁內的主題

預定義 infoType 完整 list 跟各自 detection 邏輯（150+ 種、見官方 InfoType reference）
Cloud DLP for Healthcare 的 FHIR / DICOM 完整 pipeline 細節
BigQuery column-level security / policy tag 的 policy 設計（屬 Data Governance 章節）
GDPR / HIPAA / PCI 合規逐條對應（屬 7.8 資料駐留與刪除證據鏈跟 7.4 資料保護與遮罩治理章節）
Differential privacy 的數學定義跟 epsilon budget 設計

案例回寫

Google DLP 在 07 案例庫沒有直接 vendor-level 事件、但所有資料外洩 / 敏感資料治理 case 都是 DLP 控制覆蓋率的對照：

案例	跟 Google DLP 的關係（對照啟示）
Snowflake 2024 Credential Abuse	資料平台 export 流程應該有 DLP scan gate — query result 含批量 PII / 整 table dump 直接 alert 或自動 redact、不是事後審 audit log
Mailchimp 2023 Support Tool Abuse	客服工具的客戶資料 export 應走 DLP Inspect API、單次 export 超過 N 筆 PII 或含 credential 直接擋住 + 觸發 alert、不靠 rate limit 一招
LastPass 2022 Backup Chain	Backup bucket 應該獨立 DLP scan、含 credential / token 的 backup 自動 quarantine 到獨立 bucket + 不同 IAM、不是跟 dev backup 同 bucket 同 IAM
Data Protection and Masking Governance (section)	Google DLP 是 transformation 工具的代表、章節原則對應 mask / FPE / tokenization / k-anonymity 的選用判讀
Data Residency Deletion and Evidence Chain (section)	DLP findings 是 deletion 證據鏈的一部分 — 哪些 PII 在哪些 dataset、deletion 後是否 re-scan verified、findings history 是 GDPR right-to-erasure 的稽核證據

下一步路由

上游：7.4 資料保護與遮罩治理、7.11 資料駐留、刪除與證據鏈
平行：Microsoft Purview、Cloud-native data policy
上下游 IAM：Google Cloud IAM（DLP service account 治理）、Google Cloud KMS（FPE / tokenization key）
SIEM 路由：Splunk（DLP findings 進 SIEM correlation）
跨模組：8 事故處理 vendor 清單（DLP alert → IR handoff）
官方：Google Cloud Sensitive Data Protection Documentation

Snyk

Mon, 18 May 2026 00:00:00 +0000

Snyk 是 developer-first 的 跨 SCM 多模組 application security platform、把 SCA、SAST、Container scan、IaC scan、CSPM 整合到一個 dashboard、五大模組共用同一套 Project / Issue / Fix 模型。流量打到 GitHub / GitLab / Bitbucket / Azure Repos 任一 SCM、Snyk 拉取 repo、按 manifest 建 Project、發現 Issue 後送 PR 修補。跟 GitHub Advanced Security 比、Snyk 跨 SCM 跟 跨技術棧；跟 Trivy 比、Snyk 是商業 SaaS、覆蓋面更廣、但年費按 Project 計價。

服務定位

Snyk 的核心定位是 用一個工具一個 dashboard 同時管 SCA + SAST + IaC + Container + Cloud。五大模組 — Snyk Open Source（SCA、依賴漏洞）、Snyk Code（SAST）、Snyk Container（image scan）、Snyk IaC（Terraform / CloudFormation / K8s manifest 安全）、Snyk Cloud（CSPM、雲端配置 drift）— 共用 Project / Target / Organization / Issue 模型、Issue 跨模組可一起 prioritize。對 多 SCM + 多技術棧 的組織、Snyk 比拼裝 GHAS + Trivy + Dependabot 更整合。

跟 GitHub Advanced Security 的核心差異是 部署模型跟 SCM 範圍：GHAS 綁 GitHub、走 GitHub Actions、PR 整合更深（Code Scanning alert 直接顯示在 PR review）；Snyk 走 SaaS、SCM 中立、但需要 OAuth 連到每個 repo。組織用 GitLab / Bitbucket / Azure Repos 或同時用多種 SCM、Snyk 是天然選擇。

跟 Trivy 比、Trivy 是 OSS、主 container + IaC、適合 CI 內 self-hosted；Snyk 商業 SaaS、覆蓋更廣（含 SAST 跟 Reachability）、適合 組織級 governance + 跨團隊統一 dashboard。Trivy 是 跑工具、Snyk 是 買治理。

關鍵張力：Snyk 的 Project 是計費單位。每個 manifest 算一個 Project（一個 repo 有 package.json + requirements.txt + Dockerfile = 3 Project）。大 monorepo 容易暴量、需要 project filter / archive 治理、否則年費失控。

本章目標

讀完本頁、讀者能判斷：

Snyk 五大模組在 application security stack 承擔哪一段、哪些靠其他工具
Project 計費模型、monorepo 跟 multi-manifest repo 的 Project 暴量風險跟治理路徑
Reachability analysis 的價值跟限制、何時減 noise、何時被誤判
何時用 Snyk、何時走 GHAS / Trivy / Dependabot 的取捨

最短判讀路徑

判斷 Snyk 配置是否健康、最少看四件事：

誰能 enable Snyk：Organization 的 admin / collaborator role 配置、Service Account token scope（不要用 personal API token 跑 CI、用 Service Account + scoped token）、Audit Log 是否同步到 SIEM
Project import 治理：每個 SCM target 自動 import 哪些 manifest、是否有 project filter 排除 test fixture / vendored dependency、archived project 是否真的不計費、monorepo 是否走 .snyk policy file 控制
Reachability analysis 是否啟用：Snyk Code + Open Source 整合、call graph 分析「我的 code 真的呼叫到 vulnerable 函式嗎」— 大幅減少 transitive dep 但 unreachable 的 noise、production 應該啟用
SBOM export 是否走 release pipeline：CycloneDX / SPDX 格式是否定期匯出、是否進 supply chain integrity 流程、合規要求（EO 14028 / NIS2）是否覆蓋

四件事任一缺失、就是 Audit Log 與 supply chain 治理邊界的待補項目。

日常操作與決策形狀

Project / Target / Organization 模型：Organization 是計費跟 RBAC 邊界、對應一個團隊或一個 BU。Target 是一個 SCM 來源（一個 GitHub repo / 一個 container registry image / 一個 Terraform stack）。Project 是 Target 內的單一掃描單位（一個 manifest 或一個 image tag）。Issue 是發現的漏洞 / license / misconfig、有 severity（Critical / High / Medium / Low）、CVSS、exploit maturity、fix availability。Project 暴量的根因通常是 monorepo 內 nested manifest 全被 auto-import、用 .snyk 或 import filter 排除。

五大模組分工：Snyk Open Source（SCA）掃 package manifest（npm、pip、Maven、Go modules、Composer、NuGet 等 20+ 生態）對 Snyk Vulnerability DB（自家維護、補強 NVD 延遲）。Snyk Code（SAST）掃源碼、symbolic execution + ML、覆蓋 OWASP Top 10 跟 CWE。Snyk Container 掃 image base layer + installed package、支援 Docker / OCI / ECR / GCR / Harbor。Snyk IaC 掃 Terraform / CloudFormation / K8s YAML / Helm chart 對 CIS Benchmark + custom policy。Snyk Cloud（2023 收購 Fugue 後加入）是 CSPM、scan AWS / Azure / GCP runtime 配置 + IaC drift detection（cloud 實際狀態 vs Terraform 狀態的差異）。

Snyk Code (SAST) vs GHAS CodeQL：Snyk Code 走 快速 inline scan（秒級回饋、走 cloud inference）、適合 dev loop；CodeQL 走 深度 dataflow query（分鐘級、執行更慢但表達力更強）、適合 release gate。同時用兩者並不矛盾 — Snyk Code 在 IDE / PR 給快速訊號、CodeQL 在 release 前跑深度檢查。

Reachability analysis：跟 純 dependency list 比對 CVE 不同、Snyk 結合 Snyk Code (SAST) 跟 Snyk Open Source (SCA)、做 call graph 分析、判斷「我的 code 是否真的呼叫到 vulnerable 函式」。實務影響：多數 transitive dependency 的 CVE 在你的 app 內 不 reachable（你引入的 lib 沒呼叫到那條 path）— Reachability 過濾後、可以從 幾百個 Critical / High 降到 幾個真的 exploitable。限制：只支援部分語言（Java / JS / Python / Go 較完整）、且 dynamic dispatch / reflection / runtime plugin load 會被當成 reachable（false positive）或 unreachable（false negative）— 不可全信、是 prioritization signal 不是 binary verdict。

Fix advice / Auto PR：發現 vuln 後、Snyk 自動發 PR 升級到 最小 fix version（包含 transitive dep 的 root cause upgrade）。跟 Dependabot 功能重疊、差異是 Snyk 跨 SCM（不只 GitHub）、且 fix advice 含 Reachability 標註（reachable vuln 的 PR 優先級高）。重複用兩者要關掉其一、否則 PR 量翻倍。

跟 CI 整合：snyk CLI（snyk test / snyk monitor / snyk container test / snyk iac test）走 SNYK_TOKEN 環境變數、可在任何 CI 跑。官方 Snyk Action（GitHub Actions）跟 Jenkins / GitLab CI / CircleCI plugin 是 wrapper。release gate 推薦在 build 後跑 snyk test --severity-threshold=high --fail-on=upgradable、只擋 可升級 的 high+ vuln（無 fix 的 vuln 阻塞 release 沒意義、走 .snyk policy 暫時 ignore + alert）。

SBOM export：snyk sbom --format=cyclonedx1.4+json / --format=spdx2.3+json 產 SBOM、支援 Snyk attestation（signed SBOM）。近年 supply chain compliance（US EO 14028、EU NIS2 / CRA）要求 SBOM、Snyk 是自動產線之一。SBOM 應該在 release artifact 旁 一起發布、走 supply chain integrity 流程。

License compliance：除了漏洞、Snyk 也掃 dependency license（GPL / AGPL / LGPL / proprietary / unknown）、可設 license policy（allow / disallow / require-review）、PR 引入違規 license 直接 fail check。對需要避開 copyleft license 的商業產品、license scan 跟 vulnerability scan 一樣關鍵。

API token 治理：CI / 第三方 integration 用 Service Account + scoped token（限 Organization、限 permission）、不要用個人 personal token（離職就失效）。Token 進 HashiCorp Vault / AWS Secrets Manager / Google Secret Manager、定期 rotate。

核心取捨表

取捨維度	Snyk	GitHub Advanced Security	Trivy
部署模型	商業 SaaS	GitHub 整合 SaaS	OSS、self-hosted CLI
SCM 範圍	跨 SCM（GitHub / GitLab / Bitbucket / Azure Repos）	GitHub only	SCM 無關（CI / local 跑）
SCA	Snyk Open Source（含 Reachability）	Dependabot（純 manifest 比對）	是、限 OS package + language package
SAST	Snyk Code（fast inline）	CodeQL（dataflow query）	否
Container scan	Snyk Container	透過 Dependabot + 第三方	Trivy Container（主打）
IaC scan	Snyk IaC	透過 Code Scanning + KICS / Checkov	Trivy Config（主打）
CSPM	Snyk Cloud	無	無
Reachability	有（限部分語言）	部分 CodeQL query 有	無
Auto-fix PR	Snyk PR + fix advice	Dependabot PR	無
計費模型	按 Project（manifest）數	GitHub seat-based	免費
學習曲線	中 — UI 友善、CLI 直觀	低 — 跟 GitHub 一體	低 — 單一 binary、CLI 為主
適合場景	多 SCM + 多 stack + 想統一 dashboard	純 GitHub + 想跟 PR 深整合	純 container / IaC + 想 OSS + 預算敏感

選 Snyk 的核心訴求：組織用多個 SCM 或多技術棧（後端 + 前端 + container + Terraform + cloud） + 需要 統一 dashboard + 跨團隊 prioritization + 接受按 Project 計費的成本。純 GitHub 組織用 GHAS 更整合、純 container CI 用 Trivy 免費、極大型 monorepo 用 Snyk 容易爆 Project 數要小心。

進階主題

Snyk Cloud (CSPM) 跟 IaC drift detection：Snyk Cloud 連 AWS / Azure / GCP read-only role、掃 runtime 配置（S3 bucket public、IAM over-permission、security group 0.0.0.0/0）對 CIS Benchmark + custom policy。跟 Snyk IaC 結合做 drift detection — Terraform 內定義是 private bucket、但 cloud 實際是 public（有人 console 手改）、Snyk 報 drift。對標 Wiz / Prisma Cloud / Lacework、Snyk Cloud 是 跟 Snyk IaC 同源治理 的優勢（同個 dashboard 看 IaC + runtime）。

Custom Rule（Snyk IaC custom policy）：Snyk IaC 預設規則庫覆蓋 CIS Benchmark + AWS / GCP / Azure 最佳實踐、可寫 custom policy（Rego-like / SnykIQL）擴展。例：禁止 RDS 沒開 encryption-at-rest、禁止 S3 沒 versioning、禁止 K8s pod 跑 hostNetwork。Custom policy 走版控（git）跟 PR review、避免在 console 直接改。

Reachability vs 純 static SCA：純 SCA（如 Dependabot / Trivy）只看 manifest 中聲明的版本是否有 CVE、不分 reachable / unreachable。結果是 Critical / High alert 大量、開發者 alert fatigue 後直接 ignore。Snyk Reachability 用 SAST + SCA 整合做 call graph、過濾掉 vulnerable lib 載入了但 vulnerable 函式從未被呼叫 的案例。限制：dynamic dispatch / reflection / 動態載入 plugin / native binding 都會讓 reachability 判斷失準、不可當成 binary truth。

Snyk Insights（風險優先級 prioritization）：除了 CVSS、Snyk 加入 exploit maturity（exploit in-the-wild / PoC / no known exploit）、fix availability（有無 fix version）、social trend（CVE 被討論度）、Reachability 綜合算 Priority Score。production 用 Priority Score 排 backlog、而非單純 CVSS — 一個 Critical 但 unreachable + no fix 的 vuln 不該擋 release。

SBOM 流程整合：把 snyk sbom 接到 CI release step、SBOM artifact 跟 release binary 一起進 registry / object store、走 in-toto attestation 或 SLSA provenance 流程、合規時可回溯。跟 Syft + Grype 流程的差異：Syft + Grype 是 OSS local-first + Unix philosophy、Snyk 是 SaaS、SBOM 含 Snyk Issue ID 跟 fix advice link。

License policy enforcement：除了 vulnerability、license 違規（GPL / AGPL 引入到 proprietary product、unknown license dep）走同套 policy / PR fail-check 機制、production 應該把 license policy 跟 vulnerability policy 並列當 release gate。

排錯與失敗快速判讀

Project 暴量計費：monorepo 自動 import 把 test fixture / node_modules-vendored 全當 Project — 用 .snyk 跟 import filter 排除、archived project 確認真的不計費
Reachability 漏判 / 誤判：dynamic dispatch / reflection / plugin load 讓 call graph 失準、Critical vuln 被標 unreachable 但實際 reachable — 對 framework-heavy code（Spring / Django middleware / Rails initializer）保守處理、不全信 Reachability
PR noise：Snyk + Dependabot 同時開、依賴升級 PR 翻倍 — 二選一、或讓 Snyk 處理 vuln-driven upgrade、Dependabot 處理 routine version bump
CI fail-on 設不對：--severity-threshold=low 把 release 整個擋死 / --severity-threshold=critical 漏 high — production 通常 --severity-threshold=high --fail-on=upgradable、再用 .snyk policy file 例外管理
License check 誤殺：transitive dep 引入 LGPL 被當 GPL 阻擋 — 細分 license policy（allow LGPL-with-dynamic-linking、disallow GPL）、走 review workflow 而非 fail-fast
API token over-scoped：CI 拿到 admin-level Service Account token、整 org Project 都能改 — 改 scoped token、限 Organization + 限 permission、進 Vault
SBOM 沒進 release pipeline：SBOM 只在 Snyk dashboard、release artifact 沒附 — 把 snyk sbom 加進 CI release step、SBOM 跟 binary 一起發
Snyk Cloud drift 沒人看：CSPM alert 進 dashboard 但沒 routing 到 on-call — 接 SIEM / Slack / PagerDuty、高 severity drift 觸發 ticket

何時改走其他服務

需求形狀	改走
純 GitHub + 想跟 PR / Action 深整合	GitHub Advanced Security
純 container / IaC + OSS + 預算敏感	Trivy
純 dependency 升級（routine version bump）	Dependabot
Secret scanning（leaked API key in repo）	GitGuardian / Gitleaks（Snyk 不主打）
Runtime container threat detection	Falco / Cilium Tetragon
深度 SAST（dataflow query / taint analysis）	CodeQL / Semgrep（Snyk Code 偏 fast inline、深度查走 CodeQL）
CSPM 跨 multi-cloud + asset inventory	Wiz / Prisma Cloud / Lacework（Snyk Cloud 較新、功能仍在追）

不在本頁內的主題

Snyk 完整 pricing tier（Team / Business / Enterprise）跟 Project 計費細節
Snyk Vulnerability DB 跟 NVD / GHSA 的覆蓋差異對照
Snyk Code SAST 規則完整 reference
Snyk IaC 內建 policy 完整列表 + CIS Benchmark 對照
Snyk Cloud 多雲 onboarding 步驟（AWS / Azure / GCP read-only role 設置）

案例回寫

Snyk 在 07 案例庫沒有直接 vendor-level 事件、但多個 supply chain 案例展示 Snyk 工具能力的 範圍跟邊界：

案例	跟 Snyk 的關係
Log4Shell CVE-2021-44228	對照啟示 — Reachability analysis 能快速回答「我的 service 是否真用到 vulnerable JndiLookup」、減少 emergency triage 的 noise
XZ Backdoor 2024 Open Source Supply Chain	對照啟示 — Snyk 看 package version + CVE、看不到 maintainer takeover；需補 release-tarball 比對 + maintainer trust signal
3CX 2023 Desktop App Supply Chain	對照啟示 — Snyk Container 看 image 內 package CVE、看不到 update channel 被植入；需配合 artifact provenance / SLSA
7.12 供應鏈完整性與 Artifact 信任	章節對應 — Snyk SBOM + License policy 是 supply chain governance 的工具、合規門檻（EO 14028 / NIS2）的標準產線之一

下一步路由

上游：7.12 供應鏈完整性與 Artifact 信任
平行：GitHub Advanced Security、Trivy、Dependabot
下游：7.4 資料保護與遮罩治理（vuln 阻擋不完全時、資料層也要遮罩）
跨類：HashiCorp Vault、AWS Secrets Manager（Snyk API token 存放）
跨模組：8 事故處理 vendor 清單（Critical CVE 揭露時的 emergency triage routing）
官方：Snyk Documentation

Vegeta

Fri, 15 May 2026 00:00:00 +0000

Vegeta 的核心責任是用簡潔 CLI 對 HTTP endpoint 產生固定 rate 負載，快速探測 latency、throughput、error rate 與 saturation。它適合單一 endpoint、少量 header / body 變化、快速 baseline、incident 後驗證與工程師本機或 CI 中的輕量壓測。

服務定位

Vegeta 是 Go 寫的 HTTP load testing CLI，核心模型是 constant rate attack：指定「每秒 N 個 request」就持續打 N rps、不會因 server 變慢就降速，跟「fire-and-wait」型工具（hey / wrk 預設 closed-loop）行為差異很大。constant rate 是 open-loop 模型 — 模擬真實流量「不會因服務慢而減少」的行為、所以 saturation 點才會明確浮現。

Vegeta 是 Unix philosophy CLI：targets 從 stdin 讀（可以 pipe 進複雜 generator）、binary report 從 stdout 出（可以 pipe 進 vegeta report / vegeta plot / vegeta encode）。這個設計讓 Vegeta 容易跟 shell pipeline / CI script 接合、但同時也決定它不適合表達多步驟 session。

跟 k6 比、Vegeta 走 CLI-first + open-loop constant rate、k6 走 JS scenario + threshold + CI artifact。Vegeta 適合「我要對這個 URL 打 200 rps 60 秒」的一次性壓測、k6 適合「我有 3 種 user journey、各占 40/30/30%、跑 ramp-up profile」的可維護 scenario。跟 hey 比、Vegeta 的 constant rate 是真的 open-loop、hey 的 -q 是 per-worker rate（worker 變慢整體就降速）— 探測 saturation 時 Vegeta 比較誠實。跟 wrk / wrk2 比、Vegeta 沒有 LuaJIT 那麼極致的單機壓測效能、但 binary report + vegeta plot + targets pipe 對日常工程師工作流更友善。

本章目標

讀完本頁、讀者能判斷：

何時用 Vegeta、何時走 k6 / hey / wrk / Gatling / Locust 的取捨
constant rate attack 的設計意涵（open-loop vs closed-loop、為什麼這對 saturation discovery 重要）
target file / rate / duration / report 四件套的 baseline workflow 跟 evidence package 對應
排錯時的常見陷阱：runner 端 TCP socket exhaust、open file limit、constant rate 跟 target server 限速 disconnect

定位

Vegeta 適合快速回答「這個 endpoint 在某個 rate 下表現如何」。當團隊需要先找出大概 knee point、驗證一個修補是否降低 latency、或在 CI 裡跑小型 performance smoke test，Vegeta 的 CLI workflow 很直接。

這個定位讓 Vegeta 接到 9.4 Saturation Discovery 與 9.5 瓶頸定位流程。它提供的是快速壓力探針，後續若要表達複雜 workload model，通常要轉向 k6、Gatling、Locust 或 JMeter。

最短判讀路徑

判斷一次 Vegeta 壓測是否有效、最少看四件事：

Target 描述完整性：targets file 是否包含 method / URL / headers / body、是否反映真實 request shape（含 auth header、content-type、representative payload size），缺一就會讓壓測結果偏離正式環境
Rate model 設計：選的是 constant rate（-rate=200/s）還是 ramp（用多段 attack pipe），constant rate 適合 saturation probe、ramp-up 要 wrap script 自己 stage、Vegeta 沒有原生 ramp profile
Report 解讀：vegeta report 給 mean / p50 / p95 / p99 / max latency + success rate + throughput，重點看 p99 跟 max 的距離 與 requested rate vs actual throughput 是否 disconnect — disconnect 表示 server / runner 端有人在限速
Duration vs warm-up：短 duration（< 30s）容易吃到 JIT / cache / connection pool warm-up 噪音，baseline 壓測 duration 至少 60s、且第一段 result 要 discard，否則 p99 會被前 5s 拉高

適用場景

單 endpoint saturation probe 是 Vegeta 的主要入口。工程師可以對 login、search、read API、feature flag endpoint 或 internal health-like endpoint 施加固定 rate，觀察 p95 / p99 與 error rate 何時開始上升。

Regression smoke test 適合用 Vegeta。CI 或 pre-release 可以用短時間固定 rate 測試，確認 hot path 沒有明顯退化，再把更完整的 scenario 交給 k6、Gatling 或 Locust。

Incident 後修補驗證適合用 Vegeta。當事故根因是某個 endpoint 的 query、cache miss、lock contention 或 timeout，修補後可以用相同 request set 重跑，快速比較 latency distribution。

選型判準

判準	Vegeta 的價值	需要補的能力
CLI 簡潔	本機、CI、shell workflow 容易接	長期報表與 artifact 標準化
固定 rate	探測 rate / latency 關係清楚	複雜使用者行為與 arrival pattern
HTTP 導向	API hot path 快速驗證	非 HTTP protocol 與 multi-step flow
快速 probe	適合 smoke test 與修補驗證	完整 workload model 與資料治理

CLI 簡潔價值來自低摩擦。當問題還在定位階段，工程師可以很快產生可重跑 command 與 target file，先取得 baseline，再決定是否需要完整壓測平台。

固定 rate 價值來自可比較。用相同 request set、rate、duration 與 target environment 重跑，可以讓修補前後的 latency distribution 有清楚對照。

跟其他工具的取捨

Vegeta 和 k6 的主要差異是 scenario 深度。Vegeta 適合固定 rate HTTP probe；k6 適合多步驟 scenario、threshold、CI artifact 與 browser-style flow。

Vegeta 和 JMeter 的主要差異是工具重量。Vegeta 適合快速 CLI；JMeter 適合 GUI、多 protocol、plugin 與企業測試資產。

Vegeta 和 Gatling 的主要差異是長期維護模式。Vegeta 用 command / target file 保持簡單；Gatling 用 simulation 維護複雜 flow 與 injection profile。

Vegeta 和 Locust 的主要差異是自訂能力。Locust 適合 Python user behavior 與 custom client；Vegeta 適合 HTTP endpoint 的直接壓力測量。

操作成本

Vegeta 的主要成本是 workload coverage 有限。它能快速測 endpoint，但多步驟 session、資料依賴、payment mock、queue side effect 與 realistic user journey 需要額外工具或腳本補上。

Artifact 成本來自命令可追溯性。每次測試要保存 rate、duration、targets、headers、body、環境、版本與結果檔；否則快速 probe 很容易變成不可比較的一次性觀察。

Runner 成本通常較低，但仍要檢查本機瓶頸。高 rate 測試時，產生負載的機器也可能先被 CPU、network、file descriptor 或 connection limit 卡住。

Evidence Package

Vegeta 結果應回寫到 evidence package。最小欄位包括 command、target file hash、rate、duration、workers、target environment、p95 / p99、max latency、error rate、throughput、target saturation metric、known gap 與 owner。

欄位	Vegeta 證據來源
Source	command、targets file、binary result、report
Time range	test start / end
Query link	APM / metrics / logs 查詢連結
Data quality	target set freshness、header / body correctness
Confidence	runner capacity、endpoint representativeness
Known gap	未覆蓋多步驟 flow、資料偏差、runner limit

Evidence package 的核心用途是讓快速測試可以比較。Vegeta 的結果通常很短，反而更需要保存 command 與 target set，讓下一次修補驗證能跑同一組條件。

核心取捨表

取捨維度	Vegeta	k6	hey	wrk / wrk2
負載模型	Open-loop constant rate（rps 不隨 latency 降）	Open-loop（k6 default）/ closed-loop（VU mode）	Per-worker rate（closed-loop 傾向）	wrk closed-loop / wrk2 open-loop
Scenario 深度	單 endpoint pipe target、多 endpoint 需 script	JS script、多步驟、staging / threshold / SLO 內建	單一 URL CLI flag	Lua script 可寫複雜邏輯但 idiom 較陡
輸出形式	Binary stream + `vegeta report/plot/encode`	stdout summary + JSON + 內建 dashboard	stdout 文字 summary	stdout 文字 summary、HdrHistogram
CI 整合	用 shell 包、自寫 threshold gate	內建 threshold / exit code、CI artifact 標準化	簡單 smoke、無 threshold	需自寫 wrapper
學習成本	低 — 幾個 flag 就上手	中 — 要寫 JS scenario	極低 — 一行 CLI	中 — Lua 加 HdrHistogram 概念
適合場景	修補驗證、CI smoke、saturation probe	完整壓測平台、SLO gate、多 scenario	一次性 ad-hoc 探測	極致單機壓測效能、低 overhead 量測

選 Vegeta 的核心訴求：工程師本機 / CI smoke / 修補驗證 / saturation probe 都要快速可重跑、且結果要可以保存比較；不需要完整 scenario 模型也不需要 GUI 報表。若團隊需要完整 user journey、threshold / SLO gate、長期 trend dashboard，直接走 k6 或 Gatling。

進階主題

Reporting 多輸出 format：vegeta report 預設 text summary、加 -type=hist[0,10ms,50ms,100ms,500ms] 給 latency bucket histogram、-type=json 給機器可讀 result、vegeta plot 出 HTML latency chart、vegeta encode -to=csv 轉成可進 spreadsheet / dashboard 的 CSV。binary result 檔可重複 decode 成不同 format，不用重跑壓測。修補驗證的標準作法是保留 results.bin、之後可隨時 re-render report。

Pipe attack workflow：Vegeta 的 stdin/stdout 都是 stream — 可以用 shell pipe 串接 jq 動態產 targets（jq -r '.urls[] | "GET " + .'）、用 vegeta attack | tee results.bin | vegeta report 同時寫檔跟即時看 summary、用 cat results-old.bin results-new.bin | vegeta report 比較兩次結果。這個設計讓 Vegeta 跟 incident drill / chaos test script 容易接合 — 修補 deploy 完跑一次 attack、result 直接 commit 進 git 當 evidence。

CI integration pattern：CI 裡 Vegeta 沒有 k6 那種內建 threshold，要自寫 gate — vegeta report -type=json results.bin | jq '.latencies.p99' 出 p99、bash 比較 budget、超標 exit 非零。把 targets.txt + attack.sh + expected-budget.json commit 進 repo、CI artifact 上傳 results.bin + plot.html，下次 regression 時可以 diff。

排錯與失敗快速判讀

Requested rate 跟 actual throughput disconnect（要 200rps 實際只跑 80rps）：runner 端先飽和、不是 server 飽和 — 看 vegeta attack stderr 是否報 socket: too many open files、檢查 ulimit -n（生產壓測 runner 至少設 65535）；或 server 端有限速 / rate limit / connection cap 把 request reject 在 TCP 層、Vegeta 看不到完整 response 就被卡
TCP socket exhaust（runner 端）：constant rate 模型下、若 server 回應慢、connection 會堆積、TIME_WAIT socket 爆 ephemeral port range — 用 -keepalive=true（預設）並調 net.ipv4.tcp_tw_reuse=1、或加 -connections=N 限制 connection pool 上限避免無限堆 socket
p99 / max latency 異常高、但 server-side metrics 看不到：runner 端 GC pause / CPU steal / network jitter 把 latency 量測污染 — 把 runner 移到跟 target 同 placement group / same AZ、確認 runner CPU 沒被其他 process 搶、duration 拉長到 5min 讓 outlier 變稀釋
Success rate 100% 但 server 已經爆：targets 沒帶 auth header / 打到 LB 而非 backend、所有 request 在前面就 200 / cache hit、server 根本沒收到壓力 — 檢查 target server access log 的 request count 跟 Vegeta requested rate 是否對得上
短時間壓測結果不穩定（同 command 跑兩次差很多）：duration 太短（< 30s）、warm-up 噪音占比太高 — 至少 60s、第一段 5-10s discard、若 endpoint 有 lazy initialization（cache / connection pool / JIT compile）先跑一段 warm-up attack 再正式量

案例回寫

Vegeta 適合回寫單 endpoint hot path 與修補驗證案例。它可接 9.C3 Coinbase ultra-low latency 的 sub-millisecond latency distribution 判讀、9.C25 Tubi feature store 的 p99 < 10ms lookup 驗證、9.C29 Lemino connection limit 的 RDB bottleneck 探測、9.C6 Tinder ElastiCache 的次毫秒 cache lookup 驗證，以及 9.C5 Amazon Ads DynamoDB 的 hot partition 探測。

這些案例的重點是快速定位與比較。Vegeta 頁引用案例時，要把 case 轉成 endpoint、rate、duration、latency budget、target saturation metric 與 runner limit — 例如 Coinbase 的 sub-ms 目標要求 Vegeta runner 必須跟 target 同 placement group、否則 runner 自身的網路 jitter 會吃掉觀測精度。

下一步路由

DynamoDB

Wed, 13 May 2026 00:00:00 +0000

DynamoDB 是 AWS managed key-value store、用 partition-based scaling 提供 可預測 P99 latency 跟 elastic capacity。Amazon 自家 Ads（9000 萬 RPS）、Disney+、Zoom（COVID 30x surge）、Capcom（billions of requests / single-digit ms）都用 DynamoDB 撐核心 workload — 它是目前公開 case 最多、最被驗證的 managed KV 服務。

教學路線：Access pattern 與 partition capacity

DynamoDB 服務頁的教學目標是把 access pattern 轉成 partition key、sort key、GSI、capacity mode 與 global tables 的設計判斷。讀者讀完後要能從查詢路徑反推資料模型，並估算 hot partition、成本與 consistency trade-off。

學習段	核心問題	對應段落
Access pattern	查詢形狀如何先於資料表設計	定位、適用場景
Partition key	hot partition、single-digit latency、GSI 如何成為設計核心	容量規劃要點、常見陷阱
Capacity mode	on-demand、provisioned、auto scaling 如何對應高峰與成本	容量特性、案例對照
Global tables	multi-region availability 與 consistency 會付出哪些代價	適用場景、跟其他 vendor 的取捨
替代路由	何時回 SQL、MongoDB、Cosmos DB 或 cache / queue	不適用場景、下一步路由

定位：partition-based KV scale

DynamoDB 的核心設計是「partition 透明、capacity 抽象化」。不像 MongoDB 要主動 shard、不像 Cassandra 要管 ring topology、不像 PostgreSQL 要選 instance type — DynamoDB 把所有底層 scaling 隱藏在 RCU / WCU 抽象層後。

容量單位：

1 RCU（Read Capacity Unit）= 1 strongly consistent read of 4KB / sec、2 eventually consistent reads
1 WCU（Write Capacity Unit）= 1 write of 1KB / sec
每個 partition 上限：3000 RCU / 1000 WCU
總容量 = partition 數量 × 每 partition 上限（partition 數量透明、vendor 自動管理）

延遲特性：

single-digit millisecond p99 latency（read / write）
同 region 跨 AZ replication 內建、預設 eventually consistent reads
strongly consistent reads 依 region 內 quorum 成立，跨 region 讀寫要看 Global Tables 語意

詳見 1.10 KV / Document DB 容量規劃跟 9.4 Saturation Discovery 的 partition 設計章節。

適用場景

按公開 case 提煉的典型適用場景：

1. KV / single-table design 為主的查詢：

用 partition key + sort key 設計、單筆 / 範圍查詢
查詢路徑固定，JOIN / ad-hoc query 需求低
對應案例：9.C5 Amazon Ads — 9000 萬 reads/sec + 500 萬 writes/sec、99.999% 可用

2. 可預測 sub-10ms p99 latency 需求：

遊戲後端（玩家狀態、戰績）
內容平台 metadata（watchlist、播放進度）
對應案例：9.C19 Capcom（billions of requests / single-digit ms）、9.C27 Disney+（每日數十億 actions）

3. 流量 spiky 或 surge 場景：

on-demand capacity 自動吸收 burst
不需 connection pool（HTTP API、無 stateful connection）
對應案例：9.C18 Zoom（COVID 1000 萬 → 3 億 DAU）、9.C15 Tixcraft（IOPS 20 → 135K、售票搶購）、9.C29 Lemino（RDB connection limit → 改 DynamoDB）

4. 大規模通知 / 訊息系統：

TTL 自動清理過期 records
partition key 用 user_id / message_id 天然均勻
對應案例：9.C26 PayPay（行動支付每日 3 億訊息）

5. 5 個 9 可用性 B2B SaaS：

multi-region Global Tables active-active
對應案例：9.C24 Genesys（99.999% 跨 15 region）

6. 高吞吐 budget 敏感：

on-demand 適合突發、provisioned 適合 sustained
對應案例：9.C20 Zomato — TiDB over-provision 壓力轉成 DynamoDB on-demand pay-per-use，成本下降 50%

不適用場景

1. 複雜 ad-hoc query / JOIN：

DynamoDB query 以 partition key + sort key 為主，JOIN-heavy workload 交給 SQL 系統
PartiQL 提供 SQL-like 語法但底層還是 KV、複雜 query 會 scan 全表
替代：用 Aurora / PostgreSQL / Spanner

2. 強一致 multi-row transaction：

DynamoDB Transaction 支援 25 個 item 的 ACID
超過 25 個 item 或跨 region 的 transaction 要改用 workflow / SQL / distributed SQL 設計
替代：Spanner / Aurora DSQL / CockroachDB

3. 跨雲需求：

DynamoDB only on AWS、vendor lock-in
替代：Cosmos DB（Azure global NoSQL）、自管 ScyllaDB

4. 大物件 / 文件儲存：

單一 item 最大 400KB
大物件用 S3、metadata 用 DynamoDB

5. 預算極度敏感 + 流量穩定：

流量高度 predictable 的 sustained workload，自管 PostgreSQL / MySQL 可能更便宜
DynamoDB 的 managed 跟 elastic 是有溢價的

跟其他 vendor 的取捨

vs MongoDB（自管或 Atlas）：

DynamoDB：managed、partition 透明、application 主要管理 partition key，有 5 個 9 SLA
MongoDB：彈性高、可自管、aggregation pipeline 強、跨雲可用
選 DynamoDB：AWS-only、想轉移 operation、partition 設計簡單可預測
選 MongoDB：跨雲、複雜 query、ad-hoc analysis

vs Aurora（同 AWS）：

DynamoDB：KV、partition 擴展、無 connection pool 限制
Aurora：SQL（PostgreSQL / MySQL）、有 transaction、ad-hoc query
詳見 1.10 KV / Document DB 容量規劃跟 9.C29 Lemino case — connection limit 是 RDB vs DynamoDB 的關鍵差異

vs Redis（含 ElastiCache）作為 KV 替代：

DynamoDB：持久化、單 item 持久查得到、有 TTL 但物件不會自動失蹤
Redis：純記憶體、預設不持久（MemoryDB 例外）、快但易失
選 DynamoDB：data 是 source of truth，需要持久保存
選 Redis：data 是 cache、丟了能 recompute

vs Cosmos DB（cross-cloud）：

DynamoDB：AWS-only、KV 為主、無 multi-model
Cosmos DB：Azure-only、multi-model（SQL / Mongo / Cassandra / Gremlin / Table）、5 個 consistency levels
選 DynamoDB：AWS 生態、KV 純粹
選 Cosmos DB：Azure 生態、需要 multi-model、需要 multi-region active-active write

vs Cassandra / ScyllaDB（self-managed）：

DynamoDB：managed、5 個 9 SLA、無 ops 負擔
Cassandra / ScyllaDB：可自管、更深 tuning、跨雲可用
選 DynamoDB：團隊想把 DBA / SRE 操作責任交給 AWS
選 Cassandra / ScyllaDB：有 DBA、想 lock-in 風險低、需要極限 throughput tuning

vs PostgreSQL（SQL baseline）：

詳見 PostgreSQL vendor page 取捨段、跟 1.10 KV / Document DB 容量規劃的 connection model 對比
摘要：DynamoDB 是 access pattern 固定 + 需要避免 connection-bound 的選項；ad-hoc query / 複雜 transaction 留 PostgreSQL

容量規劃要點

從 09 案例庫提煉的 DynamoDB 容量規劃實踐：

1. partition key 設計是命脈：

partition key 不均 → hot partition → 名義容量達不到
composite key（event_id + user_id_hash）強制分散
對應 9.C5 Amazon Ads 9000 萬 RPS 靠 partition 均勻、9.C15 Tixcraft 用 composite key 分散售票流量
詳見 Hot Partition 卡片

2. on-demand vs provisioned 選型：

流量 peak/avg > 5x → on-demand
sustained predictable → provisioned + auto-scaling
知名大事件（Black Friday）→ provisioned baseline + scheduled scale-up
對應 9.C20 Zomato — on-demand 解放 over-provisioning

3. Global Tables（multi-region active-active）：

每個 region 都能寫、conflict resolution 用 LWW
容量在每個 region 獨立配置，全球總和要按 region 分別估算
對應 9.C24 Genesys — 15 region 達 5 個 9 可用

4. DAX（DynamoDB Accelerator）：

DynamoDB 前置 in-memory cache
從 single-digit ms 降到 microsecond
適合超高 read 重複的 workload（同樣 key 大量讀）
對應 9.C29 Lemino 用 DAX 加速

5. Streams + Lambda：

DynamoDB 寫入 → Stream event → Lambda 處理
適合 CDC、event-driven 工作流
對應 9.C15 Tixcraft 用 Stream 把 DynamoDB 當 durable queue 給 legacy server 消費

Anti-recommendation 與升級路由

DynamoDB 的 managed elasticity 會讓團隊忽略 access pattern 的前置成本。這一段先說何時維持單純 table / index，再說何時升級到 Global Tables、DAX、Streams、或改回 SQL / document DB。

機制 / 路線	維持簡單設計的條件	升級訊號	主要引用路徑
單 table / 少量 GSI	access pattern 穩定、partition key 均勻、query 成本可預測	新查詢路徑大量增加、GSI 成本壓過主表、hot partition 出現	Hot Partition、Workload Model
On-demand capacity	peak/avg 差距大、流量有事件性 surge	sustained traffic 穩定、成本曲線可預測	Peak Forecast、Cost Per Request
Provisioned + autoscaling	baseline 穩定、團隊能預測高峰	黑五、售票、直播等已知大事件需要預先升配	Scheduled Scaling
DAX	read 重複率低、single-digit ms 已足夠	同 key 超高讀取、需要 microsecond read	Cache Aside、Stale Data
Global Tables	single-region availability 已足夠	RTO/RPO、region residency 或 active-active write 是產品需求	RTO、RPO、Consistency Level
SQL / document DB	access pattern 可提前列舉	ad-hoc query、JOIN、multi-row transaction 或 document traversal 成主題	Aurora vendor、MongoDB vendor

DynamoDB 的簡單路徑是先把每個 query path 寫成契約。table、partition key、sort key、GSI 與 TTL 都應從 access pattern 反推；如果需求仍在探索期，PostgreSQL 或 MongoDB 可能提供更低的變更成本。

Global Tables 的升級路徑要先處理 conflict 與讀寫語意。它提供 multi-region availability，但 LWW conflict resolution、region-local capacity 與跨 region reconciliation 仍要由 application contract 承擔。

Deep article（已完成）

本 vendor 現有 deep article 覆蓋 DynamoDB 從 access pattern 反推到寫一致性、讀加速、事件驅動與資料生命週期的核心 production 議題：

主題	文章	對應 production 議題
適用度 4 軸前置判讀 + access pattern 反推 PK/SK + durable queue	single-table-design-pattern	適用度判讀 + control plane vs data plane + 9.C15 Tixcraft Stream durable queue
1000 WCU partition 上限 + composite key / calculated shard 修法	partition-key-antipatterns	9.C15 Tixcraft 6750x 擴展、mode × partition 在 provisioned / on-demand 表現
GSI / LSI projection 三型、sparse、DAX 補位	gsi-lsi-design	GSI 自己會 hot partition、Capcom derive vs Lemino case fact 分層
6 軸 capacity mode 決策 + auto-scaling 邊界 + cost crossover	on-demand-vs-provisioned	Zomato 50% 成本下降、Zoom 30x permanent surge、Amazon Ads sustained workload
Multi-region active-active + LWW conflict + cross-device sync	global-tables-conflict	Genesys 99.999% / 15 region、Disney+ 跨裝置同步
Strongly / eventually consistent read 取捨	consistency-model-optimization	read consistency 成本選擇
跨 item 原子性 + conditional write + optimistic lock + idempotency	transactions-conditional-writes	雙寫不一致、超賣 race、transaction 2x 成本邊界
DAX cluster + item/query cache + write-through + invalidation 邊界	dax-caching-strategy	讀峰值 p99 尖刺、query cache 只靠 TTL 失效、strong read 繞過 cache
Streams CDC + shard 順序 + Lambda 消費 + 失敗處理	streams-lambda-event-driven	下游即時反應、at-least-once 冪等、毒丸 record 隔離
TTL 自動過期 + 48h 刪除延遲 + 過期仍可讀 + storage 成本	ttl-data-lifecycle	9.C26 PayPay 每日上億訊息 storage 清理、過期未刪 item 讀取陷阱

Migration playbook：從 RDS / MongoDB 遷移到 DynamoDB（Type E paradigm shift、access-pattern-first 重建模 + 混合架構 + Zomato cost crossover）。

跨 vendor entry：先看 DB3 vendor selection（MongoDB / DynamoDB / Cosmos DB 三方選型 + workload shape 前置判讀），再進本 vendor 的 deep article。

後續擴充（仍待補）

DynamoDB Streams 進階 lab：Kinesis Data Streams for DynamoDB 多消費者 fan-out 與長 retention 重播（Lambda vs Kinesis 比較層已在 streams-lambda-event-driven 覆蓋、此處指可操作的深度 hands-on lab）
Export to S3 / point-in-time export 做離線分析
DynamoDB → SQL / search / analytics split（遷出方向 playbook）
Backup / PITR restore drill（hands-on lab）

案例對照

案例	規模	教學重點
9.C5 Amazon Ads	9000 萬 RPS + 500 萬 WPS	partition 均勻設計典範
9.C15 Tixcraft	IOPS 20 → 135K（6750x 擴展）	flash-sale 緩衝模式
9.C18 Zoom	30x DAU surge（1000 萬 → 3 億）	SaaS surge baseline 重新校準
9.C19 Capcom	billions of requests / single-digit ms	遊戲後端 KV、跨遊戲共用平台
9.C20 Zomato	4x 吞吐、90% latency 降、50% 成本降	TiDB → DynamoDB cross-DB 遷移
9.C24 Genesys	99.999% / 15 region / 8000+ orgs	B2B SaaS 5 個 9 可用性
9.C26 PayPay	3 億訊息 / 天	行動支付通知系統、TTL 自動清理
9.C27 Disney+	每日數十億 actions	串流 metadata 層 + cross-device 同步
9.C29 Lemino	tens of thousands req/sec、5M MAU / 3 月	RDB connection limit → DynamoDB

DynamoDB case 的讀法是先分類 access pattern，再看容量模式。Amazon Ads / Capcom / Disney+ 說明高吞吐 KV，Zoom / Tixcraft / Lemino 說明 surge 與 connection-free scaling，Zomato 則說明 on-demand cost model 如何改變 over-provision 壓力。

反向 sibling 路由

DynamoDB 的反向 sibling 路由用來把 RDBMS 退場條件寫清楚。若讀者從 PostgreSQL / MySQL 的 connection bottleneck 過來，先讀 Lemino case 與 1.10 KV / Document DB 容量規劃；若需求仍需要 ad hoc SQL、join 與 transaction report，回 Aurora vendor 或 PostgreSQL vendor；若需求是 global document model 與 Azure 生態，再對照 Cosmos DB vendor。

這條路由的判準是 access pattern 是否穩定到可以先設計 key。DynamoDB 擅長固定 lookup、寫入尖峰、connection-free scaling 與 TTL 類生命週期；資料探索、報表 join 與多條件查詢仍應留在 SQL / search / analytics service。

常見陷阱

從公開 incident 跟 case 提煉：

partition key 集中：event_id 一個演唱會、bot user 大量同 user_id 寫入 → 用 composite key 或 write sharding
單一 partition 達 3000 RCU / 1000 WCU 上限：throttling event 出現、即使整體 capacity 還沒滿
Scan 全表：scan 會吃光 capacity，正式讀取路徑應回到 query / index design
DAX 跟 DynamoDB 直連混用：寫入直連 DynamoDB、讀經過 DAX → cache 一致性問題
Global Tables conflict：跨 region 同 key 同時被寫、LWW 可能丟失寫入、要設計 idempotency

下一步路由

完整 T1 對照：01-database vendors index
平行：Aurora vendor page（SQL 對比）
上游：1.10 KV / Document DB 容量規劃
下游：1.12 大規模 DB 遷移實戰（從 RDBMS 遷 DynamoDB 案例）
跨模組：9.4 Saturation Discovery、9.6 容量規劃模型
Last reviewed：2026-05-22（capacity mode / Global Tables / best practices 屬時間敏感 claim）
官方：Amazon DynamoDB Customers、DynamoDB 設計 best practices

Apache JMeter

Fri, 01 May 2026 00:00:00 +0000

JMeter 是 Apache 出品的老牌 load test 工具、承擔三個責任：GUI-driven test plan 設計、多 protocol sampler（HTTP / JDBC / JMS / FTP / mail）、plugins 生態廣 + 企業環境普及。設計取捨偏向「GUI 易上手 + 既有測試資產治理 + 多 protocol」、跟 code-first（k6 / Gatling）的取捨在 dev workflow 跟 version control 友善度。

本章目標

讀完本章後、你應該能：

用 GUI 設計 test plan（thread group / sampler / listener / assertion）
跑 non-GUI mode 給 CI
用 Distributed mode（master / slave）擴張 VU
用 JMeter Plugins Manager 加擴展
評估 JMeter vs 現代 CLI-first（k6 / Gatling / Locust）的選用

最短路徑：5 分鐘把 JMeter 跑起來

1# 1. 安裝
2# TODO: brew install jmeter / 下載 zip
3
4# 2. GUI 設計 .jmx
5# TODO: 開 jmeter GUI、加 Thread Group / HTTP Sampler / Listener
6
7# 3. CI 跑 non-GUI mode
8# TODO: jmeter -n -t test.jmx -l result.jtl -e -o report/

日常操作與決策形狀

Test plan 結構

子議題：

Thread Group（VU + ramp-up + loop count）
Sampler（HTTP / JDBC / JMS / FTP / Java Request）
Listener（aggregate report / view tree / graph）
Assertion（response / duration / size）

Non-GUI mode for CI

子議題：

-n non-GUI
-t test file / -l log file
-e -o 產生 HTML dashboard
Exit code 0 / 1（搭配 backend listener / assertion）

Distributed testing

子議題：

Master / slave 配置
RMI port 設定
Result aggregation 在 master

進階主題（按需閱讀）

Plugins Manager

子議題：

jmeter-plugins.org plugins
常用：PerfMon / Dummy Sampler / Custom Thread Groups / WebSocket
安裝管理：Plugins Manager 安裝後可 UI 管

Recording controller

子議題：

HTTP(S) Test Script Recorder
Browser proxy 設定
適合：快速錄製 user flow

CSV data set / parameterization

子議題：

CSV Data Set Config
各 thread 取不同資料
適合 data-driven test

CI / Jenkins integration

子議題：

Jenkins JMeter plugin
Performance plugin（trend analysis）
對應 6.13 Performance Regression Gate

既有 .jmx 資產治理

子議題：

XML 不友善 git diff
大 test plan 可讀性差
改用 module 拆 + Test Fragment
對應企業遷移到 k6 / Gatling 評估

排錯快速判讀

High VU 起不來

操作原則：JVM heap 不夠 / GUI 模式有限制（永遠 non-GUI for production load）。

Listener 拖慢

操作原則：View Results Tree 記錄太多 → 改 simple data writer / disable detail。

Distributed RMI 連不上

操作原則：firewall + RMI port 不對。

Assertion noise

操作原則：assertion failed 多但實際 OK → response time / size 設過嚴。

何時改走其他服務

需求形狀	改走
Code-first / CI-first	k6 / Gatling
Python	Locust
Cloud managed	BlazeMeter / Octoperf / Tricentis NeoLoad
Browser flow	Playwright / Cypress / k6 browser
Capacity planning	09 performance capacity

不在本頁內的主題

完整 plugins 列表
BeanShell / Groovy scripting
JMeter internal architecture

案例回寫

案例方向	對應主題
LinkedIn：Capacity 與 On-call 分層	企業內部 load test pipeline + headroom 驗證
Shopify：BFCM 容量治理與 Game Day	峰值前 load test scenario 與 capacity baseline 的對照組

待補 JMeter customer case：企業內部 JMeter 大規模採用案例、JMeter → k6 遷移案例。

下一步路由

上游概念：6.13 Performance Regression Gate
平行 vendor：k6、Gatling
下游能力：09 performance capacity

AWS ElastiCache

Fri, 01 May 2026 00:00:00 +0000

AWS ElastiCache 是 AWS managed cache 服務、承擔三個責任：託管 Redis / Valkey / Memcached engine（無需自管 broker）、自動 failover + 跨 AZ 複製、AWS 生態原生整合（IAM / VPC / CloudWatch / KMS）。設計取捨偏向「把運維責任轉給 AWS、付 managed premium 換可預測 SLA」、AWS 生態下的 cache 預設選擇。2024 起 default engine 從 Redis 改為 Valkey（成本約低 20%）。

對「AWS 生態服務需要 cache、不想自管 Redis cluster、跨 AZ 高可用」這條路徑、ElastiCache 是首選。本頁先給最短路徑、再展開日常 cluster 管理跟 engine 選擇、最後進階治理（Serverless、MemoryDB 對照）跟排錯。

本章目標

讀完本章後、你應該能：

用 AWS CLI 建立 ElastiCache cluster、選擇 engine（Redis / Valkey / Memcached）
區分 Cluster mode enabled vs disabled 的選用條件
配置 auto failover、cross-AZ replication、snapshot backup
評估 ElastiCache Serverless vs node-based 的成本取捨
區分 ElastiCache 跟 MemoryDB（durable）跟自管 Redis 的定位

最短路徑：5 分鐘把 ElastiCache 跑起來

 1# 1. 建立 Valkey replication group（cluster mode disabled、單 primary + 1 replica、Multi-AZ）
 2aws elasticache create-replication-group \
 3  --replication-group-id demo \
 4  --replication-group-description "demo cache" \
 5  --engine valkey \
 6  --cache-node-type cache.t4g.micro \
 7  --num-cache-clusters 2 \
 8  --automatic-failover-enabled \
 9  --multi-az-enabled
10
11# 2. 取得 primary endpoint（建立需數分鐘、status 變 available 才有 endpoint）
12aws elasticache describe-replication-groups \
13  --replication-group-id demo \
14  --query "ReplicationGroups[0].NodeGroups[0].PrimaryEndpoint.Address" --output text
15
16# 3. 從 VPC 內（EC2 / Lambda）用 redis-cli 連線（ElastiCache 只在 VPC 內可達）
17redis-cli -h  -p 6379 PING   # → PONG

指令依 AWS ElastiCache CLI 官方文件、最後檢查日 2026-06-16（managed 服務需 AWS 帳號與 VPC、本機無法 docker 驗證、引數以官方為準）。ElastiCache 端點只在 VPC 內可達、不對公網開放。實際 production 需要評估 cluster mode、節點大小、replica 數、AZ 分布。

日常操作與決策形狀

AWS CLI 與 console

子議題：

CLI 指令對照表（create-cache-cluster / create-replication-group / describe-* / modify-* / delete-*）
Console 操作流程（VPC subnet group / security group / parameter group）
Terraform / CloudFormation 範例
對應指令範例：aws elasticache describe-replication-groups --replication-group-id

Engine 選擇

子議題：

Valkey（2024+ default）：成本低 20%、OSI 開源、Redis 7.2.4 fork
Redis OSS（legacy support）：仍可選、但 AWS 不推
Memcached：純 cache 場景、無 cluster mode 概念（client-side sharding）
選擇判讀：新部署 → Valkey；既有 Redis 遷移 → Valkey（API 相容）；純 cache → Memcached

Cluster mode enabled vs disabled

子議題：

Disabled：1 primary + N replica（最多 5）、單 shard、上限 ~340GB
Enabled：多 shard（最多 500）、自動 sharding、橫向擴展
客戶端要求：Cluster mode enabled 需要 cluster-aware client
選擇判讀：< 300GB + 簡單 → disabled；> 300GB 或要 sharding → enabled

Snapshot 與 backup

子議題：

Automatic snapshot（保留 1-35 天）
Manual snapshot（保留永久、可跨 region 複製）
Restore：從 snapshot 建新 cluster
對應指令：aws elasticache create-snapshot

進階主題（按需閱讀）

Auto failover 機制

子議題：

Multi-AZ 部署：primary 失敗、replica 自動晉升
Failover 時間：~30 秒到幾分鐘（依 client 重連)
Client 影響：DNS 切到新 primary、client 要 reconnect
對應 2.C6 Netflix EVCache 跨 AZ 對照

ElastiCache Serverless

子議題：

On-demand 模式：不選 node type、按 ECPU + storage 計費
自動 scale：流量增加自動擴
適合：流量不可預測、不想規劃容量
不適合：成本敏感（serverless premium）、極大 dataset

跨 region replication（Global Datastore）

子議題：

Global Datastore：1 primary region + 多個 secondary region read replica
跨 region replication lag < 1 second（業界宣稱）
適合 active-passive DR
不支援 active-active multi-master

MemoryDB 對照

子議題：

ElastiCache：cache、Multi-AZ replica 但仍是 cache 語意（資料可重建）
MemoryDB：Redis-compatible durable database、multi-AZ transaction log
MemoryDB cost 2-3x ElastiCache、但提供 source-of-truth 語意
選擇判讀：要 source-of-truth Redis API → MemoryDB；cache 用途 → ElastiCache

Parameter group 與配置

子議題：

Parameter group：custom maxmemory-policy、timeout、client-output-buffer-limit
Cluster vs parameter group 的應用範圍
對應指令：aws elasticache modify-cache-parameter-group

IAM authentication（Redis 7+）

子議題：

從 Redis AUTH password 升級到 IAM-based authentication
IAM role / user 連 ElastiCache、無需傳 password
對應 security 模組

Cost 模型

子議題：

Node type 成本（t4g.micro 到 r7g.16xlarge）
Reserved Instance（1/3 年承諾、折扣 30-60%）
Data transfer cost（同 AZ 免費、跨 AZ 收費）
Snapshot storage cost

排錯快速判讀

Endpoint 連不上

操作原則：先確認 VPC + security group + subnet group 配置正確。

1aws elasticache describe-replication-groups --replication-group-id  \
2  --query "ReplicationGroups[0].Status"
3# 從 VPC 內 EC2 測試連通性
4redis-cli -h  -p 6379 PING

判讀路徑：security group 沒開 6379 → VPC peering 不通 → DNS 解析失敗。

Failover 過程中 client 持續 error

操作原則：failover 期間 client 重連需要時間、確認 client 有 reconnect 邏輯。

1aws elasticache describe-events --source-identifier  --source-type replication-group
2# 看 failover 開始 / 完成事件、對照 client 重連時間軸

Replication lag 高

操作原則：cross-AZ replication 通常 ms 級、若 > 1 sec 看 CloudWatch ReplicationLag metric。原因可能是 write throughput 過高、replica node 規格不足。

Memory pressure / eviction

操作原則：看 CloudWatch DatabaseMemoryUsagePercentage、超 80% 考慮 scale up node type 或調 maxmemory-policy。

Snapshot 失敗

操作原則：snapshot 過程暫時 fork（Redis）會佔用記憶體、若 memory 已緊張可能失敗。看 CloudWatch BytesUsedForCache。

何時改走其他服務

需求形狀	改走
需要 source-of-truth Redis API	AWS MemoryDB（durable Redis-compatible）
跨雲	自管 Redis / Valkey
極端 throughput single instance	DragonflyDB self-host
Edge / HTTP cache	CloudFront / Cloudflare Cache（T4 候選）
不在 AWS 生態	GCP Memorystore / Azure Cache for Redis
完全 serverless 計費	ElastiCache Serverless（同模組內）/ Momento

不在本頁內的主題

AWS IAM / VPC / Security Group 完整配置（見 security 模組）
CloudFormation / Terraform 完整模板
AWS pricing 詳細計算
ElastiCache vs Memorystore vs Azure Cache 完整對照

案例回寫

直接相關案例

案例	對 ElastiCache 的對應
2.C6 Netflix EVCache	EVCache 為 Netflix 自管 Memcached based 全域 cache、對應 ElastiCache for Memcached + Global Datastore
2.C5 Shopify write-through	Write-through 在 managed cache 的實作、ElastiCache 提供同樣 Redis/Valkey API、無 self-host 維運負擔
2.C3 Shopify serialization	Payload 雙軌遷移 client-side 實作、ElastiCache 對應為 engine version upgrade + parameter group 滾動

待補 ElastiCache-specific 案例：Airbnb / Lyft / Pinterest 等公開的 ElastiCache 規模化案例、re:Invent talks（如 ElastiCache for Valkey 遷移、Serverless 採用、Global Datastore active-passive DR 實作）。

跨 vendor 對照

案例	對 ElastiCache 的對應
2.C9 Cache Stampede	Managed 也會 stampede、AWS 不會幫你做 client-side jitter / singleflight、需自行設計
2.C10 規模對照	小型 single primary / 中型 Multi-AZ replica / 大型 Cluster mode enabled + Global Datastore
2.C2 Meta mcrouter	ElastiCache 對應為 Cluster mode + Configuration Endpoint（client-side discovery）、無原生 protocol proxy
2.C1 Meta cache consistency	Failover / replica promotion 期間 ElastiCache 也會出現一致性議題、CloudWatch ReplicationLag 是主要訊號
2.C7 Cloudflare Cache Reserve	分層儲存對照、AWS 對應為 ElastiCache（hot）+ S3 / DynamoDB（cold）的應用層分層設計

下一步路由

上游概念：2.2 Cache Aside、0.6 成本取捨
平行 vendor：Redis、Valkey
下游能力：2.7 cache copy boundary

AWS SQS

Fri, 01 May 2026 00:00:00 +0000

AWS SQS 是 AWS managed queue 服務、承擔三個責任：訊息排隊與重試（visibility timeout + DLQ）、解耦 producer / consumer（無 broker 運維）、AWS 生態原生整合（Lambda / EventBridge / Step Functions）。設計取捨偏向「極簡 API + managed 運維、用 visibility timeout 取代 broker ACK、無原生 ordering（standard queue）」。

對「AWS 生態 task queue、不想自管 broker、配合 Lambda 事件處理」這條路徑、SQS 是首選。本頁先給最短路徑、再展開日常 SendMessage / ReceiveMessage 操作與 visibility timeout 設計、最後進階治理（FIFO、DLQ、IAM、VPC endpoint）跟排錯。

本章目標

讀完本章後、你應該能：

用 AWS CLI 建 standard / FIFO queue、發送與接收訊息
設計 visibility timeout 對齊 consumer 處理時間
配置 DLQ（dead-letter queue）與 maxReceiveCount
區分 long polling vs short polling、配合 Lambda event source mapping
評估 IAM policy、VPC endpoint、cross-account 訪問等治理場景

最短路徑：5 分鐘把 SQS 跑起來

1# 1. 建 queue（回傳 QueueUrl、後續操作都用它）
2aws sqs create-queue --queue-name demo-queue
3
4# 2. 發送訊息
5aws sqs send-message --queue-url  --message-body "hello"
6
7# 3. 接收訊息（long polling、最多等 20 秒）
8aws sqs receive-message --queue-url  --wait-time-seconds 20

最短路徑驗證「queue 建得起來、能發能收」。實際應用配合 SDK / Lambda、見日常操作。指令對真實 AWS 需設定 credentials 與 region；本機要先驗證可加 --endpoint-url 指向 SQS-相容的 local 模擬器跑同一組指令。

日常操作與決策形狀

AWS CLI 與 SDK

子議題：

AWS CLI 指令對照表（create-queue / send-message / receive-message / delete-message / set-queue-attributes）
SDK 配置：region / credentials / retry policy / timeout
Batch operation（SendMessageBatch、DeleteMessageBatch、最多 10 條）
對應指令範例：aws sqs get-queue-attributes --queue-url

Standard vs FIFO queue

子議題：

Standard：高吞吐、at-least-once、無 ordering、適合多數 task queue
FIFO：exactly-once-ish（去重 5 分鐘窗口）、ordering（per MessageGroupId）、吞吐受限（3000 msg/sec with batching）
選擇判讀（ordering 需求 vs 吞吐）

Visibility timeout 與 in-flight

Visibility timeout 是 SQS 的 delivery 控制機制、取代 broker ACK：

訊息被接收後變 in-flight、其他 consumer 看不到
Consumer 處理完呼叫 DeleteMessage、否則 timeout 後回到 queue
ChangeMessageVisibility 動態延長（長任務）
預設 30 秒、上限 12 小時

DLQ 設計（dead-letter queue）

子議題：

maxReceiveCount：訊息被接收 N 次後送 DLQ
DLQ 監控與 alarm（CloudWatch metric）
Redrive policy（從 DLQ 重新放回 main queue）
對應 poison message 處理思路

進階主題（按需閱讀）

visibility timeout、polling、Lambda event source 與 cost 已展開為 deep article：visibility timeout / long polling / Lambda + cost。下列子議題段保留選題判讀入口。

Long polling vs Short polling

子議題：

Short polling（預設）：立即回應、可能空回（高 cost）
Long polling（WaitTimeSeconds 1-20）：等到有訊息或超時
對 cost 與 latency 的取捨

SQS + Lambda event source mapping

子議題：

Lambda 自動 poll SQS（managed event source）
Batch size / batch window 配置
Partial batch failure（ReportBatchItemFailures）
對應 3.C8 Cloudflare Queues 的全球交付對照

IAM / Cross-account 訪問

子議題：

Queue policy（resource-based）vs IAM policy（identity-based）
Cross-account producer / consumer 設定
Encryption（SSE-SQS / SSE-KMS）

VPC endpoint（私網訪問）

子議題：

Interface endpoint（PrivateLink）
適合不想經 public internet 的場景
跟 NAT Gateway 的 cost 對照

CloudWatch metric 與 alarm

子議題：

ApproximateNumberOfMessagesVisible（queue depth）
ApproximateAgeOfOldestMessage（lag 訊號）
NumberOfMessagesSent / Received / Deleted
Alarm 設計（depth 暴增、age 超 SLO）

Cost 模型

子議題：

Request cost（每百萬 request）
Data transfer cost（跨 region 才有）
FIFO 比 standard 貴的判讀
對應 0.6 成本取捨

排錯快速判讀

Message 反覆 redelivery（看到同訊息多次）

操作原則：visibility timeout 設定 < consumer 處理時間、訊息回 queue 又被另一 consumer 領走。

1aws sqs get-queue-attributes --queue-url  --attribute-names VisibilityTimeout
2# 新建 queue 預設 VisibilityTimeout 為 30 秒、處理時間長於此值就會看到 redelivery

調整：延長 VisibilityTimeout 或 consumer 主動 ChangeMessageVisibility。

DLQ 累積

操作原則：先看 DLQ 訊息內容、判斷 poison message vs 下游卡。

判讀路徑：訊息格式錯（永遠失敗）→ 下游服務 down（暫時失敗、可 redrive）→ consumer bug。

Throttling（account quota）

操作原則：超過 account-level SendMessage / ReceiveMessage TPS、看 CloudWatch ThrottledRequests。處理：requeue exchange、quota 申請。

IAM 權限錯

操作原則：access denied 大多是 queue policy 跟 IAM policy 互動。判讀：用 IAM Policy Simulator 或 CloudTrail 看 deny 原因。

Lambda event source 失敗

操作原則：Lambda 失敗會自動 retry、超過 retry 進 DLQ。看 Lambda 的 DLQ 跟 SQS 的 DLQ 分工。

何時改走其他服務

需求形狀	改走
需要 streaming / replay	AWS Kinesis / Kafka / MSK
需要 pub/sub fan-out	AWS SNS（搭配 SQS 做 fan-out）/ EventBridge
需要複雜 routing	RabbitMQ on EC2
跨雲 / 跨平台	Kafka / NATS
嚴格低延遲（< 100ms）	NATS / Redis
Workflow + durable execution	AWS Step Functions / Temporal

不在本頁內的主題

SNS / EventBridge 細節（另開 cloud event routing 章節）
Step Functions / Lambda 完整功能
AWS SDK 各語言完整 API

案例回寫

SQS 專屬案例（C48-C59）

案例	主討論議題
3.C48 Airbnb Dynein	分散式延遲任務 / at-least-once + DLQ
3.C49 Airbnb Inspekt	Visibility timeout 當隱式 retry
3.C50 Capital One	Visibility timeout / Lambda event source
3.C51 Atlassian JiRT	Kinesis + per-consumer SQS
3.C52 Nielsen Spark on EKS	雙 SQS / queue depth autoscale
3.C53 FINRA Large File	S3 → SQS 合規 / IAM 多層稽核
3.C54 Twitch EventSub	SNS-SQS fan-out + Dispatcher
3.C55 SmugMug search	Workload generator / 平行 scan + replay
3.C56 PostNL EBE	完整 DLQ + redrive + 隔離 stack
3.C57 Lob sqs-consumer	Client library / SDK v3 / FIFO bug
3.C58 Twilio webhook	Webhook → SQS buffer / FIFO 300 TPS
3.C59 Rapid7 scale	100 億 msg/day 規模參考點

跨 vendor 對照

案例	對 SQS 的對應
3.C2 VMware → MSK	反面對照：何時 managed queue 不夠用、要升 streaming
3.C8 Cloudflare Queues	全球交付對照（SQS 是 region-scoped）
3.C10 規模對照	小型直接用 SQS / 中型補 idempotency / 大型補 streaming

下一步路由

Elastic Stack

Fri, 01 May 2026 00:00:00 +0000

Elastic Stack（前 ELK）是 logs-heavy observability 棧、承擔三個責任：Elasticsearch 搜尋與分析（full-text + structured query）、Beats / Logstash 採集 pipeline、Kibana 視覺化 + Elastic APM（traces）。設計取捨偏向「搜尋為核心 + 統一搜尋介面 + Elastic Security SIEM 整合」。AWS 因 2021 license 變動 fork OpenSearch、提供 Apache 2.0 替代。

本章目標

讀完本章後、你應該能：

部署 Elasticsearch + Kibana + Beats 基本棧
用 KQL / Lucene 查詢 logs、用 ES DSL 寫進階搜尋
設計 index lifecycle（hot / warm / cold / frozen）
評估 Beats / Logstash / Fluent Bit / Vector 的採集選擇
評估 Elastic License vs OpenSearch fork 的取捨

最短路徑：5 分鐘把 Elastic Stack 跑起來

1# 1. 用 docker-compose 跑 ES + Kibana
2# TODO: docker-compose.yml with elasticsearch + kibana
3
4# 2. 用 Filebeat 採集 host logs
5# TODO: filebeat.yml with inputs + output.elasticsearch
6
7# 3. 在 Kibana 查詢驗證
8# TODO: KQL: `@timestamp >= now-15m AND log.level: "error"`

日常操作與決策形狀

採集 pipeline

子議題：

Beats（Filebeat / Metricbeat / Packetbeat / Heartbeat / Auditbeat）：輕量、各自專屬
Logstash：重型 ETL（grok parsing / enrichment / 多 output）
Fluent Bit / Vector：替代採集 agent（更輕量、OSS）
對應 4.C6 ADOT EKS 對照

查詢語法

子議題：

KQL（Kibana Query Language）：直覺、適合日常查詢
Lucene query string：複雜搜尋、boolean operators
ES DSL（JSON）：API 級進階查詢
ES|QL（Elastic Query Language、ES 8.11+）：類 SQL pipeline 語法

Index 設計

子議題：

Index template（mapping / settings）
Data streams（time-series log / metrics）
Field types：keyword / text / date / numeric / object / nested
Dynamic mapping 風險：unbounded field 爆 index

Index Lifecycle Management（ILM）

子議題：

Hot phase：active write
Warm phase：read-only、查詢頻率低
Cold phase：searchable snapshot（S3 / object storage）
Frozen phase（ES 7.12+）：searchable snapshot + minimal cluster resource
Delete phase

Deep Article

Index Lifecycle Management 與 Log Pipeline：ILM policy 設計、data stream / rollover、Beats vs Elastic Agent 採集選擇、ingest pipeline 與 shard sizing、cost governance

Migration Playbook

Elastic Cloud 遷移：自管 Elastic Stack 遷移到 Elastic Cloud

進階主題（按需閱讀）

Elastic APM

子議題：

APM Server 接收 trace data
各語言 APM agent（Java / Python / Node / .NET / Go / Ruby / PHP）
接受 OTLP（ES 7.16+）
Service map / dependency 視覺化

Elastic Security（SIEM）

子議題：

SIEM dashboard / detection rule
ECS（Elastic Common Schema）跨資料統一 field naming
Sigma rule import
跟 07 security 模組對照

Cluster scaling

子議題：

Node roles：master / data / ingest / coordinating / ML / transform
Hot-warm-cold architecture
Shard sizing（推薦 20-40GB per shard）
Cross-cluster search / replication

Elastic License vs OpenSearch fork

子議題：

2021 Elastic 改 ELv2 / SSPL（非 OSI 認可）— AWS 不能提供「Elasticsearch as a Service」
AWS fork OpenSearch（Apache 2.0、基於 ES 7.10）
OpenSearch 持續演進、跟 ES 功能逐漸分歧
選擇判讀：合規 → OpenSearch；要最新 ES feature → Elastic

Searchable Snapshots

子議題：

把 cold/frozen index 存 S3 / GCS / Azure Blob
查詢時動態 hydrate、成本降 80%+
適合 logs retention 長但查詢頻率低
對應 4.C3 Healthcare retention

Vector / Fluent Bit 採集替代

子議題：

為何用 Vector / Fluent Bit：更輕、resource 用量低
Beats 在 K8s 跑起來資源耗較大
對應 cost 跟 maintainability 取捨

排錯快速判讀

Index mapping explosion

操作原則：dynamic mapping 對未知 field 自動建 index、大量 field 爆 ES。

1# TODO: GET /_cat/indices?v 看 field count
2# TODO: PUT index/_mapping 鎖定 fields

Cluster yellow / red

操作原則：cluster status 影響 query。

1# TODO: GET /_cluster/health
2# TODO: GET /_cat/shards?v 看 unassigned shards

Query 過慢

操作原則：query 結果 > 10K → 用 search_after / scroll；text field 上做 aggregation → 改 keyword field。

Disk pressure

操作原則：cluster disk > 85% → ES 進 read-only 模式。判讀：cluster.routing.allocation.disk.watermark。

Logstash backpressure

操作原則：Logstash queue full → upstream Beats 累積 backpressure。判讀：Logstash monitoring page。

何時改走其他服務

需求形狀	改走
Pure metrics	Prometheus / Mimir
純 logs 但 less search	Loki（Grafana Stack）— 更便宜
SaaS turnkey APM	Datadog
AWS-managed Elastic	OpenSearch on AWS（Apache 2.0）
Cloud-native logs	CloudWatch Logs / Cloud Logging
多 tier observability	Datadog / Grafana Stack
Enterprise SIEM	Splunk / Microsoft Sentinel

不在本頁內的主題

ES query DSL 完整 reference
Lucene scoring 演算法
Kibana dashboard 美術
Elastic ML / Anomaly Detection 細節

案例回寫

直接相關案例

案例	主討論議題
4.C1 Fintech audit	Logs 作為 audit evidence
4.C3 Healthcare retention	Index Lifecycle / retention

跨 vendor 對照

案例	對 Elastic Stack 的對應
4.C6 ADOT EKS pipeline	Beats / Logstash ↔ OTel Collector 採集 pipeline 對照
4.C10 規模對照	小型 single-node / 中型 hot-warm / 大型 hot-warm-cold-frozen

下一步路由

上游概念：4.17 Telemetry Data Quality
平行 vendor：Grafana Stack（Loki 對照）、OpenTelemetry
下游能力：4.20 Observability Evidence Package

Envoy

Fri, 01 May 2026 00:00:00 +0000

Envoy 是 CNCF graduated 的 service proxy、承擔三個責任：cloud-native L7 + L4 proxy（HTTP/1.1 + HTTP/2 + HTTP/3 + gRPC）、xDS dynamic config（不需 reload）、observability 內建（access log / stats / tracing）。設計取捨偏向「dynamic config + advanced traffic management + filter chain extensibility」、是 Istio / Linkerd2-proxy / AWS App Mesh / Envoy Gateway 的底層實作。

對「service mesh data plane、API Gateway、advanced traffic management、gRPC / HTTP/2 / HTTP/3」這條路徑、Envoy 是首選。

本章目標

讀完本章後、你應該能：

跑起 Envoy + 基本 reverse proxy config
用 xDS API 動態更新 config（不 reload）
配置 listener / route / cluster / filter chain
看懂 Envoy access log + stats + admin endpoint
評估 Envoy 直接用 vs 用 Istio / Envoy Gateway 抽象

最短路徑：5 分鐘把 Envoy 跑起來

1# 1. 啟動 Envoy
2docker run -d --name envoy-demo \
3  -p 9901:9901 -p 10000:10000 \
4  -v "$(pwd)/envoy.yaml:/etc/envoy/envoy.yaml:ro" \
5  envoyproxy/envoy:v1.31-latest

Static config 範例（envoy.yaml）：

 1static_resources:
 2  listeners:
 3  - name: listener_0
 4    address: { socket_address: { address: 0.0.0.0, port_value: 10000 } }
 5    filter_chains:
 6    - filters:
 7      - name: envoy.filters.network.http_connection_manager
 8        typed_config:
 9          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
10          stat_prefix: ingress_http
11          route_config:
12            virtual_hosts:
13            - name: backend
14              domains: ["*"]
15              routes:
16              - match: { prefix: "/" }
17                route: { cluster: service_backend }
18          http_filters:
19          - name: envoy.filters.http.router
20            typed_config:
21              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
22  clusters:
23  - name: service_backend
24    connect_timeout: 5s
25    type: STRICT_DNS
26    load_assignment:
27      cluster_name: service_backend
28      endpoints:
29      - lb_endpoints:
30        - endpoint: { address: { socket_address: { address: app, port_value: 8080 } } }
31admin:
32  address: { socket_address: { address: 0.0.0.0, port_value: 9901 } }

1# 3. 驗證 + admin endpoint
2curl http://localhost:10000                    # proxy 路徑
3curl http://localhost:9901/stats               # metrics
4curl http://localhost:9901/clusters            # upstream health
5curl http://localhost:9901/config_dump         # running config

日常操作與決策形狀

Envoy config 結構

子議題：

Listener：listen address + filter chain
Route：path matching + cluster routing
Cluster：upstream endpoint discovery + load balancing
Endpoint：實際 backend
對應 5.3 LB Contract

Static vs Dynamic config

子議題：

Static：YAML 寫死、適合 dev / debug
Dynamic（xDS）：control plane push config
xDS protocol：LDS / RDS / CDS / EDS / SDS
對應 control plane：Istio / Gloo / 自寫

Admin endpoint

子議題：

/stats / /clusters / /config_dump / /listeners / /server_info
runtime config（/runtime_modify）
對應 observability 跟 debug
對應指令：curl admin:9901/clusters

進階主題（按需閱讀）

xDS API 細節

子議題：

LDS / RDS / CDS / EDS / SDS / RTDS / ECDS
ADS（Aggregated Discovery Service）統一通道
Delta xDS（incremental）vs SOTW（State of the World）
對應案例 5.C7 Airbnb Istio

Filter chain（HTTP / network filter）

子議題：

HTTP filters：router / cors / fault / rate_limit / ext_authz / jwt_authn
Network filters：tcp_proxy / mongo_proxy / redis_proxy
自訂 filter（C++ / WebAssembly）
對應 security 模組（ext_authz）

Observability 內建

子議題：

Access log（structured / configurable format）
Stats（envoy 內建 metrics）
Distributed tracing（Jaeger / Zipkin / Datadog / OpenTelemetry）
對應 04 observability

Envoy Gateway / Emissary / Gloo

子議題：

Envoy Gateway：Gateway API native（CNCF project）
Emissary（前 Ambassador）：K8s ingress + API Gateway
Gloo：Solo.io 商業 Envoy 整合
選型判讀：純 K8s ingress → Envoy Gateway；商業支援 → Gloo / Emissary

Service mesh data plane

子議題：

Istio：control plane + Envoy sidecar
Linkerd2：自家 Rust proxy（不是 Envoy）— Linkerd2-proxy
Cilium Service Mesh：eBPF + Envoy
對應 5.C7 Airbnb Istio governance

WebAssembly extension

子議題：

WASM filter：跨語言寫 Envoy extension（Rust / AssemblyScript / Go）
跟 Lua（OpenResty 模式）對比
適合：custom auth / rate limit / metric collection

Advanced traffic management

子議題：

Retry / Circuit breaker / Outlier detection
Timeout（connect / request / idle）
Traffic split（canary / blue-green / mirror）
Rate limit（local + global）

排錯快速判讀

Config sync 失敗

操作原則：xDS control plane 連不上 / config 格式錯。判讀：admin /stats 看 update_failure、/config_dump 看當前 config。

Listener config error

操作原則：YAML 格式錯、port 衝突、bind address 錯。判讀：startup log + admin /listeners。

Cluster endpoint 全 unhealthy

操作原則：health check 失敗、SDS 沒提供 cert、network 不通。判讀：admin /clusters 看 endpoint state。

Circuit breaker trip

操作原則：upstream 失敗率 > threshold、Envoy 主動切。判讀：admin /stats 看 cb 相關 metric。

Tracing missing spans

操作原則：tracer config + sampler rate 設錯、context propagation 不對。對應 04 observability OTel。

何時改走其他服務

需求形狀	改走
配置簡單 / 小場景	nginx
Cloud-native auto-discovery	Traefik
AWS managed	AWS ELB
K8s ingress only	Ingress-nginx / Envoy Gateway / Gateway API
Service mesh control plane	Istio / Linkerd / Consul Connect
Edge proxy / CDN	Cloudflare / Fastly / CloudFront

不在本頁內的主題

完整 Envoy YAML schema reference
xDS protocol binary format
各 Istio / Gloo / Emissary 細節（見各自 docs）
Envoy C++ filter 開發

案例回寫

直接相關案例

案例	主討論議題
5.C7 Airbnb Istio governance	Envoy-based service mesh 在大規模叢集的分批升級與可重播流程

跨 vendor 對照

案例	對 Envoy 的對應
5.C1 Tradeshift self-managed → EKS	Tradeshift 選 Linkerd（非 Envoy）做切流、對照 Envoy/Istio 的取捨
5.C9 cutover without drain	Envoy outlier detection / circuit breaker / draining listener 是回退面
5.C10 規模對照	大規模 / 複雜 traffic / 多 DC → Envoy mesh 才能撐住協同節奏

待補 Envoy 案例：Lyft 自家 Envoy production 案例、Stripe / Reddit 用 Envoy 邊緣案例、Envoy Gateway 早期 adopter。

下一步路由

上游概念：5.3 LB Contract
平行 vendor：nginx、Traefik
下游能力：04 observability OTel、07 security

FireHydrant

Fri, 01 May 2026 00:00:00 +0000

FireHydrant 是 IR 平台、承擔三個責任：incident response lifecycle（declare / respond / update）、retrospective workflow + runbook automation、cross-platform integration（Slack + Microsoft Teams 雙支援）。內建 status page、後加 on-call 模組。設計取捨偏向「完整 IR + retrospective + Teams 支援」、跟 incident.io 的差異是 Teams 友善。

服務定位

FireHydrant 的核心定位是 service catalog 驅動的 IR platform — 強調 service ownership + runbook automation + retrospective workflow 三角支撐、而不是只把 Slack 當 chat surface。底層是 service catalog（service / team / dependency / owner metadata）、incident 一宣告就自動關聯 affected service 跟 on-call team；上層是 runbook engine（trigger + action DAG）跟 retrospective workflow（template + facilitator + action item tracking）。跟 incident.io 同層、差異在 Teams-native 而非 Slack-only — Microsoft 365 + Salesforce-heavy enterprise 是 FireHydrant 主場。跟 PagerDuty 比是 IR + retrospective platform vs paging platform、覆蓋 lifecycle 更廣但 on-call 模組相對年輕。跟 Rootly 比走 catalog-first 而非 AI / no-code first。

關鍵張力：service catalog 完整度 ↔ runbook automation 黑箱 是 FireHydrant 客戶最大的 trade-off。catalog 沒維護好、runbook 自動 page 錯 team、retrospective owner 找不到；catalog 維護成本又會被視為 platform team 負擔。要看清楚自己 願意投多少 catalog 治理換多少 IR 自動化。

本章目標

整合 FireHydrant 到 Slack / Teams
配置 incident lifecycle + severity matrix
用 Runbook automation 自動化 standard response
用 Retrospective facilitator 跑復盤
評估 FireHydrant vs incident.io / Rootly

最短判讀路徑

判斷 FireHydrant deployment 是否健康、最少看四件事：

Runbook automation 範圍：runbook 是否走版控（API / Terraform Provider）、trigger 條件是否有 staging dry-run、high-impact action（自動 page exec / 自動發 customer notification）是否走 approval gate 而非 fire-and-forget
Service catalog 完整度：service / team / dependency / owner 是否齊全、stale entry 是否有 review cadence、incident declare 時 affected service dropdown 是否能立即定位、catalog 是否跟 ServiceNow CMDB / Backstage / Salesforce 同步
Retrospective workflow：incident close 後是否自動觸發 retrospective、facilitator 是否指定、action item 是否寫回 Jira / Linear 並 track close-rate、template 是否區分 sev1 / sev2 不同深度
SSO + audit：SCIM provisioning 是否跟 IdP 同步、admin / responder / viewer 三層角色是否區分、audit log 是否 export 到 Splunk 或 SIEM

四件事任一缺失、就是 Drills and On-call Readiness 邊界的待補項目。

最短路徑

1# 1. 註冊 + install Slack / Teams app
2# 2. 配置 severity matrix / roles
3# 3. Declare test incident
4# 4. 跑 retrospective workflow

日常操作與決策形狀

Incident lifecycle

子議題：

Severity matrix（impact × urgency）
Status workflow（detected → investigating → identified → monitoring → resolved）
Role：commander / scribe / SME

Runbook automation + Retrospective

子議題：

預定 runbook（auto page / 建 Jira / open Zoom）
Trigger condition
Retrospective template + facilitator role + action items

核心取捨表

取捨維度	FireHydrant	incident.io	PagerDuty	Rootly
Chat 主場	Slack + Teams 雙支援	Slack-first（Teams 後加）	Slack / Teams（chat 非核心）	Slack-first
核心抽象	Service catalog + runbook	Incident workflow + AI assist	On-call schedule + paging	No-code workflow + AI
Retrospective	內建 facilitator + template + action 追蹤	內建、AI assist 草稿	弱、靠 integration	內建、AI summary
Catalog	一級概念、service / team / dependency	有 catalog、深度較淺	Service 概念存在、不強調 ownership	有 catalog、強調 no-code 編輯
On-call	後加模組、相對年輕	內建、跟 incident workflow 整合	業界最成熟	內建
整合主場	ServiceNow / Salesforce / Microsoft	Linear / Notion / GitHub	廣泛、paging-centric	Jira / Slack
適合場景	Enterprise + Teams + service ownership-heavy	Slack-native + 高速 startup	Paging-first + 已有 IR tooling	No-code / AI-forward + 中型團隊

選 FireHydrant 的核心訴求：service ownership 是組織一級概念（platform team / SRE 已維護 catalog）、Microsoft 365 / Teams 是預設辦公 surface、retrospective + action item 追蹤要 first-class。Slack-only + startup 速度優先走 incident.io；paging 是核心走 PagerDuty。

進階主題（按需閱讀）

Status page 內建

子議題：不需另接 Statuspage / Instatus、Component / incident sync、Subscriber notification

Cross-platform（Slack + Teams）

子議題：同帳號跨兩平台、Microsoft Teams enterprise 需求

On-call 模組 + Service catalog

子議題：後加 module、service / team / dependency metadata 跟 incident 自動關聯

Runbook automation（trigger + action DAG）

Runbook 是 trigger（severity 升級 / service 標籤 / 時間 elapsed）+ action（page team / 建 Zoom / 建 Jira / 發 customer notification / 更新 status page）的 DAG。production 設計要回答：哪些 action 可以 fire-and-forget（建 Zoom / 建 ticket）、哪些要 approval gate（發 customer notification / 自動 page exec）、失敗回退是什麼（action 失敗時 commander 是否會收到通知、還是默默 skip）。Runbook 走 API / Terraform Provider 版控、不在 console 直改 production。

Service catalog + dependency

Catalog 一級欄位：service / owning team / on-call rotation / upstream dependency / downstream consumer / tier（critical / standard / experimental）。意義是 incident declare 時 affected service 一選、systems team + on-call + 通報範圍自動推導。catalog stale 是最大失敗模式 — team 重組沒同步、deprecated service 沒下架、ownership 落在離職員工身上。對應 9 IT asset 模組的 CMDB / inventory 治理原則。

ServiceNow / Salesforce 整合

FireHydrant 的 Microsoft / Salesforce 生態整合是 differentiator：incident 自動建 ServiceNow ticket（CMDB CI 關聯）、Salesforce case escalate 自動 declare incident、Customer Success 在 Salesforce 看到 affected account list。enterprise customer 常見部署模式。

Signals（alerting layer）

FireHydrant Signals 是 alerting / paging layer、跟 PagerDuty 直接對打 — alert source（Datadog / Prometheus / Sentry etc）→ Signals → on-call rotation。意義是 paging 不再需要外接 PagerDuty、FireHydrant 一站涵蓋 alert → incident → retrospective。但成熟度仍年輕、PagerDuty paging 細節（escalation policy / override / global event routing）仍有差距。

AI features

FireHydrant 後加 AI assist：incident summary 草稿、retrospective draft、similar incident suggestion。定位是 assist、不取代 commander / facilitator 判斷。production 用法限制在 草稿 + human review、不自動 publish 對外 communication。

排錯快速判讀

Severity matrix 不一致：跨 team 定義不同、用 catalog default + onboarding
Runbook 沒觸發：trigger 不滿足 / integration token 失效
Status page 不同步：自動 / 手動 sync 配置錯
Retrospective 沒人做：close 後沒 prompt / facilitator 沒指派
Service catalog stale：team 重組沒同步、ownership 落在離職員工身上 — 設 quarterly review cadence、catalog 走 PR + owner attestation、跟 IdP / HR system join 偵測 orphan ownership
Runbook action 黑箱 fire-and-forget：自動發 customer notification 結果發錯客群、自動 page exec 結果半夜誤叫 — high-impact action 走 approval gate、failure path 要顯式通知 commander、不能默默 skip
SSO sync drift：SCIM 沒同步離職 user、admin 角色沒回收 — SCIM provisioning 必開、admin 角色走 break-glass、audit log export 到 SIEM 對賬

何時改走其他服務

需求形狀	改走
Slack-first	incident.io
No-code / AI	Rootly
Paging-first	PagerDuty
Atlassian 套件	Opsgenie + JSM

不在本頁內的主題

各 integration 完整 setup / Pricing / Teams workflow 細節

案例回寫

FireHydrant 偏向 Microsoft Teams + Jira 生態的 IR 平台：本案例庫尚無直接揭露 FireHydrant 使用細節的事故；可參照的閱讀脈絡是「企業套件 + 跨產品 IR」與「service ownership-heavy enterprise 跨產品依賴」的事故。

案例	對應主題
Microsoft 365 cases	Teams + 套件級事故的 IR 協作對照、ServiceNow ticket join 場景
Azure AD cases	身份控制面事故的跨產品依賴對照、SSO drift 跟 service catalog ownership 失準對應
Atlassian cases	Jira / Confluence 生態事故、retrospective action item 寫回流程的失敗模式

待補 candidate：Snyk / Vercel / 大型 Microsoft 生態 customer 公開 story。

下一步路由

KeyDB

Tue, 16 Jun 2026 00:00:00 +0000

KeyDB 是 Redis 的 multi-threaded fork、承擔三個責任：把 Redis 的命令執行從單執行緒改成多執行緒（不只 I/O、連命令處理都多核）、提供 active-active 多主複製（兩個 master 互相同步、都可寫）、維持 Redis protocol 相容（drop-in 替換）。設計取捨偏向「沿用 Redis 生態 + 單實例榨多核 + 多主寫入」、是 Redis 單執行緒撞牆但又不想重寫 client 的中間選項。

對「單 key 極熱、Redis Cluster 切不開、需要單實例多執行緒撐單 partition」這條路徑、KeyDB 是值得評估的 fork。Snap 在 GCP 上用 KeyDB 是這條路線最大的公開採用者——但要注意該案例的主因是 multi-cloud 架構下的 cross-cloud latency 治理（把 cache 跟 application 放同一個 cloud），KeyDB 的 multi-threaded 單實例吞吐是附帶優勢、不是 Snap 採用的主要驅動。

本章目標

讀完本章後、你應該能：

跑起 KeyDB、用 redis-cli 驗證 protocol 相容
評估 multi-threaded 命令執行跟 Redis I/O threads 的差異
判斷 active-active 多主複製適用與衝突風險
評估 KeyDB on FLASH 對大 dataset 的成本意義
區分 KeyDB 跟 DragonflyDB / Redis Cluster 的選用判讀，並評估 Snap 收購後的治理風險

最短路徑：5 分鐘把 KeyDB 跑起來

 1# 1. 啟動 KeyDB（--server-threads 開多執行緒、命令執行也多核）
 2docker run -d --name keydb -p 6379:6379 \
 3  eqalpha/keydb keydb-server --server-threads 4
 4
 5# 2. 用 redis-cli 驗證（KeyDB 100% Redis protocol 相容）
 6redis-cli SET foo bar    # → OK
 7redis-cli GET foo        # → bar
 8
 9# 3. 確認版本（KeyDB 回報 redis_version、client 以此判斷相容性）
10redis-cli INFO server | grep -E "redis_version|redis_mode"
11# redis_version:6.3.4    ← KeyDB 的版本方案、client library 以此協商相容
12# redis_mode:standalone

實機驗證於 eqalpha/keydb image、最後檢查日 2026-06-16；--server-threads 是啟動參數（不在 CONFIG GET 內、改值要重啟）。多主複製見進階主題。

日常操作與決策形狀

CLI 與 client API

子議題：

直接用 redis-cli / 所有 Redis client library（KeyDB 維持 Redis protocol）
--server-threads N 設命令執行的執行緒數、對齊 CPU 核數
INFO server 確認 redis_version（KeyDB 的版本對應 Redis 哪個 base）

Multi-threaded 命令執行

KeyDB 跟 Redis I/O threads 的差異是核心賣點。子議題：

Redis 6+ 的 I/O threads 只分擔 socket 讀寫、命令仍在 main thread；KeyDB 連命令執行都多執行緒
--server-threads 對齊核數、單實例吞吐隨核數擴展
多執行緒下單 key 的並發保護由 KeyDB 內部處理、application 端語意不變

Active-active 多主複製

子議題：

兩個（含以上）KeyDB master 互相複製、都可接受寫入
衝突解決用 last-write-wins（依時間戳）、不是強一致
適合跨 AZ / 跨 region 的讀寫就近、但要接受最終一致與衝突覆蓋風險

進階主題（按需閱讀）

Active-active 多主複製

子議題：

replicaof + active-replica yes 開雙向複製
衝突語意：同 key 並發寫入、last-write-wins、可能丟其中一側的寫入
適用：跨區讀寫就近、可容忍最終一致的 cache；不適用：需要強一致的 counter / lock

KeyDB on FLASH

子議題：

把冷資料放 SSD、熱資料留記憶體、降低大 dataset 的記憶體成本
對應 Meta CacheLib + Kangaroo 的 DRAM + flash 分層思路
代價：FLASH 路徑延遲高於純記憶體、適合冷熱分明的 workload

跟 DragonflyDB / Garnet 的對比

子議題：

KeyDB：Redis fork（沿用 Redis code base、相容度高、base 版本較舊）
DragonflyDB：C++ 從零重寫（架構更激進、shared-nothing、相容核心但非 fork）
Garnet（Microsoft）：研究型高吞吐 store、生態淺
對應 DragonflyDB 多核架構 deep article 的 fork vs 重寫光譜

治理風險（Snap 收購後）

子議題：

KeyDB 公司 2022 年被 Snap 收購、開源版本的後續投入與 roadmap 不確定
評估採用前確認專案活躍度（commit 頻率、release cadence）
對長期依賴敏感的場景、Redis fork 光譜上的 Valkey（Linux Foundation 治理）治理更穩

排錯快速判讀

多執行緒下吞吐沒提升

操作原則：先確認 --server-threads 對齊 CPU 核數、再看是否 CPU 密集 workload。判讀：thread < core → 沒用滿多核；單 key 極熱 → 仍受單 partition 限制。

Active-active 衝突丟資料

操作原則：last-write-wins 下並發寫同 key 會覆蓋。判讀：跨區同 key 高頻寫入要改設計（key 分區到不同 master）、或改用強一致儲存。

Protocol 相容問題

操作原則：KeyDB base 版本較舊（redis_version 6.x），用到 Redis 7+ 新命令會不支援。判讀：INFO server 確認 base 版本、對照 application 用到的命令。

何時改走其他服務

需求形狀	改走
要最新 Redis 功能 / 治理穩定	Valkey（Linux Foundation、跟上 Redis）
更激進的多核 / 記憶體效率	DragonflyDB（重寫、shared-nothing）
需要 Redis Cluster sharding	Redis / Valkey Cluster
純 KV、極簡運維	Memcached
AWS managed	AWS ElastiCache（無 managed KeyDB）
需要強一致 + durability	AWS MemoryDB

不在本頁內的主題

KeyDB 完整 command reference（沿用 Redis、查 redis.io/commands）
各語言 client API（用 Redis client 即可）
KeyDB on FLASH 詳細調參
Active-replication 內部複製協定細節

案例回寫

直接相關案例

案例	對 KeyDB 的對應
9.C35 Snap KeyDB cross-cloud	Snap 在 GCP 部署 KeyDB cluster、主因是 multi-cloud 的 cross-cloud latency 治理（cache 與 application 共置同 cloud）；9.C35 另記 KeyDB multi-threaded「單實例 throughput 提升 5-10x」（通則、依 workload）

待補 KeyDB-specific 案例：Snap 收購後的公開技術分享、KeyDB on FLASH 的 production 成本案例、active-active 多主複製的跨區衝突治理實例。

跨 vendor 對照

案例	對 KeyDB 的對應
2.C4 Meta CacheLib + Kangaroo	KeyDB on FLASH 對應 DRAM + flash 分層的成本決策
2.C9 Cache Stampede	TTL jitter / singleflight 通用、KeyDB 多執行緒不消除 stampede 風險
2.C10 規模對照	KeyDB 是「單實例多核撐大」的選項、介於 Redis Cluster 與 DragonflyDB 之間

下一步路由

deep article：KeyDB active-active 多主複製（last-write-wins 衝突與跨區寫入）
上游概念：2.6 high concurrency（單執行緒邊界的四個選項）
平行 vendor：DragonflyDB、Valkey、Redis
下游能力：2.7 cache copy boundary（跨區資料引力）
回退路徑：KeyDB → Redis/Valkey

Azure Key Vault

Mon, 18 May 2026 00:00:00 +0000

Azure Key Vault 是 Azure 平台把 secret、cryptographic key、X.509 certificate 三類資產 合進同一個 service 的設計。Vault instance 本身是 first-class ARM resource、有 FQDN endpoint（https://.vault.azure.net）、跟 Azure RBAC 跟 Entra ID Managed Identity 深度整合 — 每個 Vault 自己一個邊界、區別於 region-wide service 的模型。

服務定位

Azure Key Vault 的核心定位是 三合一 secret + key + cert service 加 Azure-native secret-less 取用。AWS 是 Secrets Manager + KMS + ACM 三個獨立 service、職責邊界清楚但要管三套權限；GCP 是 Google Secret Manager + Cloud KMS + Certificate Authority Service 三個獨立；Azure 把這三件事合在 Key Vault — 同一 RBAC role 可同時管 secret / key / cert、減少 IAM 維護成本、但治理上需要在 Vault 內用 naming convention + 多 Vault instance 自己劃分敏感度邊界（例：production secret / cert 分開不同 Vault、admin access 分人）。

跟 HashiCorp Vault 相比、Azure Key Vault 是 Azure-only 的 static-focused 服務 — 沒有 dynamic credential engine、沒有 transit encryption-as-a-service、沒有跨雲統一介面。優勢是 零運維 + Managed Identity 取用免 client secret + Premium tier 直接 HSM-backed。Azure-heavy + 一站式 secret/key/cert + secret-less workload 取用是 Key Vault 的甜蜜點。

本章目標

讀完本頁、讀者能判斷：

哪些 secret / key / cert 適合放 Key Vault、哪些該走 Managed HSM（FIPS 140-2 Level 3 需求）
Access Policy 跟 Azure RBAC 兩種授權模型的差異與 migration 路徑
Soft Delete + Purge Protection 的 防誤刪 與 防勒索 邊界
何時用 Key Vault、何時改走 HashiCorp Vault（跨雲 + dynamic credential）的取捨

最短判讀路徑

判斷 Azure Key Vault deployment 是否健康、最少看四件事：

誰能 access：Vault 用 Access Policy 還是 Azure RBAC、是否還有 legacy Access Policy 沒清掉、Managed Identity 的 role assignment 是否最小化（Key Vault Secrets User 而非 Key Vault Administrator）
RBAC vs Access Policy 模型：production 應該全走 Azure RBAC（跟 Azure RBAC vendor 同套）、舊 Access Policy 是 migration backlog、不可長期兩軌並存
Soft Delete + Purge Protection：兩個都應開、Soft Delete 90 天 retention、Purge Protection 開了之後連 owner 都不能立即 purge — 防誤刪 + 防 ransomware 一次性刪光
Diagnostic Logs：Key Vault 預設不記操作 log、必須手動配 Diagnostic Setting 推 Log Analytics / Event Hub / Storage — 沒這層 KeyVaultGet / SecretGet 都沒 audit trail

四件事任一缺失、就是 Audit Log 與 Secret Management 邊界的待補項目。

日常操作與決策形狀

Vault Standard vs Premium：Standard 用 software protection（key 存在 Microsoft-managed software boundary）、Premium 用 FIPS 140-2 Level 2 HSM-backed key、key material 在 HSM 內、不可 export。Premium 適合 signing key / wrapping key 等高敏 key、Standard 適合 application secret + 常規 envelope encryption key。要 FIPS 140-2 Level 3、Standard 跟 Premium 都不夠、必須用 Managed HSM。

Access Policy vs Azure RBAC（兩種授權）：Access Policy 是 Key Vault legacy 模型 — 在 Vault 物件上掛一張 capability 表（Get / List / Set / Delete / Encrypt / Sign 等細粒度權限）、跟 Azure RBAC 體系獨立。Azure RBAC 模型是新版 — 用 Azure built-in role（Key Vault Secrets User / Key Vault Crypto User / Key Vault Administrator）走 Entra ID 統一身份治理。production 全走 RBAC、舊 Vault 的 Access Policy 是 migration backlog — 兩軌並存會出現 RBAC 拒絕但 Access Policy 允許 的權限漏洞。

Managed Identity 取用（secret-less）：Azure VM / Function / App Service / AKS pod 走 Managed Identity 直接呼叫 Key Vault API — 不需要存 client secret 或 cert。Workload 拿 IMDS token、token 帶 Entra ID identity、Key Vault 端用 RBAC role assignment 驗證 — 這是 Azure-native 的 secret-less 取用模式、跟 AWS IAM Role for Service Account / GCP Workload Identity 同類設計。production 應該 只允許 Managed Identity 取用、禁用 service principal + client secret。

Secret rotation（手動 / event-driven）：Key Vault Secret 沒有像 AWS Secrets Manager 內建的 rotation Lambda。Rotation 走兩條路：手動更新 secret version（app 端拉新版）、或 Event Grid 通知 secret 過期 + Azure Function 觸發 rotation。後者需要自己寫 rotation logic、Key Vault 只提供 版本管理 跟 過期通知、不負責執行 rotation。

Key Rotation Policy：Key（不是 Secret）有 native Rotation Policy — Vault 在 key 到期前自動生成新版、舊版保留可解密但不再 encrypt。policy 設 rotationPeriod + notifyBeforeExpiry、Key Vault 自動跑、不需要外部觸發。Secret 沒這功能、Key 才有。

Certificate auto-renewal：Certificate object 可整合 Issuer（DigiCert / GlobalSign / 自簽）做 auto-issue + auto-renew — Key Vault 在到期前自動跑 CSR、向 Issuer 申請新 cert、寫回同一個 Certificate object（保留歷史版本）。比起手動跑 OpenSSL + 寫進 AWS ACM、Certificate object 的優勢是 Issuer 在 Vault 端統一治理 — 不過只支援整合過的 public CA。

Soft Delete + Purge Protection：Soft Delete 預設開（2020 後新 Vault 強制開）、delete 後 90 天 retention、Recover 可救回。Purge Protection 是額外開關 — 開了之後 retention 內任何人（包含 subscription owner）都不能 purge 立即清除、必須等 90 天到期才會物理刪除。這是 防勒索 的關鍵 — 沒 Purge Protection、attacker 拿到 owner role 可以 delete + purge 一次性清光。

Private Endpoint：Key Vault 預設是 public endpoint（FQDN 走 internet）。Private Endpoint 把 Vault 拉進 VNet、只走內網存取 — 高敏 Vault 應該關 public access、強制走 Private Endpoint + Firewall rule（IP 白名單）。

核心取捨表

取捨維度	Azure Key Vault	AWS（拆三個）	GCP（拆三個）	HashiCorp Vault
部署模型	Azure managed、三合一	AWS managed、Secrets Manager + KMS + ACM 各獨立	GCP managed、GSM + Cloud KMS + CAS 各獨立	自管或 HCP managed
服務邊界	一個 Vault 內 secret/key/cert 共用 ACL	三個 service 各自 IAM policy、邊界清楚	三個 service 各自 IAM policy	一個 cluster 內 path-based policy
Secret-less 取用	Managed Identity 原生	IAM Role for Service Account / IRSA	Workload Identity Federation	AppRole / K8s / cloud IAM auth
Dynamic credential	無 — 純 static	部分（RDS rotation Lambda）	較弱（依靠 IAM impersonation）	強 — database / cloud / SSH engine
HSM 等級	Standard 軟體 / Premium FIPS 140-2 Level 2 / Managed HSM Level 3	KMS Level 3 / CloudHSM Level 3	Cloud KMS HSM Level 3 / Cloud HSM Level 3	走後端 KMS（AWS / GCP / Azure）
Certificate auto-renew	內建（整合 DigiCert / GlobalSign）	ACM auto-renew、限 AWS-issued	CAS + Public CA 整合	PKI engine 自簽 + cert-manager
跨雲	弱 — Azure-only	弱 — AWS-only	弱 — GCP-only	強 — 跨雲統一介面
適合場景	Azure-heavy + 三合一一站式 + Managed Identity	AWS-heavy + 職責拆分 + RDS 自動 rotation	GCP-heavy + Workload Identity Federation	跨雲 + dynamic credential + 內部 PKI

選 Azure Key Vault 的核心訴求：Azure-only、需要 secret + key + cert 一站式、workload 走 Managed Identity secret-less 取用、可接受 無 dynamic credential。需要跨雲統一 secret 控制面、或要 dynamic database credential、走 HashiCorp Vault。

進階主題

Managed HSM（dedicated）：Managed HSM 是 dedicated single-tenant HSM cluster、FIPS 140-2 Level 3、跟 multi-tenant 的 Key Vault Premium 是不同 service。Managed HSM 適合 主權合規（key material 完全自有控制權、Microsoft 也不可存取）、金融 / 醫療 / 政府場景。代價是貴跟 初始化要走 ceremony（多人持有 activation key、Microsoft 不可單方面操作）— 不是 Premium 的簡單升級、是另一條 product line。

Premium tier HSM-backed Key：Premium tier 的 key 有 HSM-protected 屬性、key material 在 multi-tenant HSM 內、API call 還是走標準 Key Vault endpoint、但 cryptographic operation 在 HSM 跑。比 Standard 慢一點、價格高、適合 signing key / wrapping key / root encryption key — 一般 application secret 還是 Standard 即可。

Certificate Issuer 整合：Vault 內可註冊 Issuer（DigiCert / GlobalSign / Entrust）、提供 API credential、Vault 在 Certificate 到期前自動跑 CSR、向 Issuer 申請、Issuer 簽完寫回 Vault。Self-signed / Unknown Issuer 也支援、後者表示 Vault 產 CSR、人或 pipeline 拿去外部 CA 簽完再 import 回 Vault。

Cross-tenant key access（federated identity）：Key Vault 可允許跨 Entra ID tenant 的 service principal 取用 — 透過 Federated Identity Credential（Workload Identity Federation）、外部 tenant 的 identity（甚至 GitHub Actions OIDC、AWS workload）拿 token 來 Key Vault 驗證。這是 cross-cloud workload 拉 Azure secret 的方式、不需要存 Azure service principal credential。

跟 Entra ID Conditional Access 整合：Key Vault 用 Azure RBAC 模型時、可走 Conditional Access policy — 特定 IP、已 enrolled 裝置、MFA 已驗證 才能取用 secret / key。production 高敏 Vault 應該疊 Conditional Access、避免單純 RBAC 在 token leak 時就直接被存取。

排錯與失敗快速判讀

Diagnostic Setting 沒開：production Vault 啟用後忘了配 Diagnostic Setting 推 log、事故發生時無 SecretGet / KeyDecrypt 紀錄 — 啟動 checklist 必含「Diagnostic Setting → Log Analytics」、Azure Policy 強制全 subscription Vault 都配
Access Policy 跟 RBAC 兩軌並存：migration 過程中 RBAC 已切換但舊 Access Policy 沒清、出現 RBAC 拒絕但 Access Policy 允許 — migration 一次切斷、跑 az keyvault update --enable-rbac-authorization true 後清空所有 Access Policy
Soft Delete 沒開 / Purge Protection 沒開：誤刪 secret 救不回、或 attacker 拿到 owner role 一次 purge 清光 — 新 Vault 兩個都強制開、Azure Policy 阻擋 enablePurgeProtection: false 的 Vault 建立
Managed Identity role 過寬：給 workload identity Key Vault Administrator 而非 Key Vault Secrets User — workload 拿到 admin role 等於可改 ACL — role assignment 走 least privilege built-in role
Premium key 跑非 HSM operation：Premium key 配錯 attribute、key 變成 software-protected 而非 HSM-protected — 建 key 時明示 --protection hsm、CI 驗證 key attribute
Certificate auto-renew Issuer credential 過期：Vault 內 DigiCert API credential 過期、auto-renew 默默失敗、cert 到期前才發現 — Issuer credential 也要 rotation + monitor
Public access 開著：Vault 沒關 public endpoint、secret 暴露在 internet（雖然有 RBAC、但 attack surface 多一層）— 高敏 Vault 強制 Private Endpoint + Firewall rule

何時改走其他服務

需求形狀	改走
跨雲統一 secret 控制面	HashiCorp Vault
Dynamic database / cloud credential	HashiCorp Vault（database / cloud secret engine）
FIPS 140-2 Level 3 HSM	Managed HSM / CloudHSM
內部 PKI workload mTLS	cert-manager + Vault PKI / SPIRE
公開 web cert 自動更新（非 Azure-issued）	Let’s Encrypt + cert-manager
Entra ID 身份治理 / Conditional Access	Azure RBAC
Secret rotation 證據鏈	7.5 Credential Rotation Scoped Evidence

不在本頁內的主題

Key Vault REST API / Azure CLI 完整 reference
Managed HSM activation ceremony 完整步驟
Bicep / Terraform 配置 Key Vault 的完整 IaC 範例
Certificate Issuer（DigiCert / GlobalSign）的合約與計價細節
每個 Entra ID role 的細粒度 permission map

案例回寫

案例	跟 Azure Key Vault 的關係
Azure AD Identity Control Plane 2021	Key Vault 是身份控制面下游、Entra ID 出事時 Managed Identity 取 Vault 也失敗 — 需要 fallback access plan（emergency Access Policy + separate identity 走 break-glass）
Microsoft Storm-0558 Signing Key 2023	Key Vault Premium / Managed HSM 把 signing key 鎖硬體、key 不離保護邊界、跟 HSM-bound 同 mindset — signing key 必上 Premium 或 Managed HSM、不放 Standard
Microsoft Storm-0558 Signing Key Chain (red-team)	Asymmetric Key + Diagnostic Logs 是「誰用 key」的稽核基礎 — production Vault 必開 Diagnostic Setting 推 SIEM、不然 key 被誰用過完全沒紀錄
Failure: Credential Rotation Without Scope	Key Vault Secret 跨 service 共用時 rotation 要分域 — Vault 端用 Event Grid 通知 + app 端訂閱 rotation event、不能一次 push 全域更新

下一步路由

上游：7.6 秘密管理與機器憑證治理、7.5 傳輸信任與憑證生命週期（Key Vault Certificate + Managed HSM 為 TLS / signing key 的 root custodian）、7.2 身分與授權邊界
平行（secret store）：AWS Secrets Manager、Google Secret Manager、HashiCorp Vault
平行（KMS-class）：AWS KMS、Google Cloud KMS、CloudHSM（Key Vault 是跨類 vendor、同時是 secret store 跟 key management）
下游：Azure RBAC（Managed Identity + RBAC 取用模型）
下游：cert-manager（K8s workload cert 自動化、可整合 Key Vault Certificate）
跨模組：8 事故處理 vendor 清單（Key Vault 事件如何 routing 進 IR 流程）
官方：Azure Key Vault Documentation

Dependabot

Mon, 18 May 2026 00:00:00 +0000

Dependabot 是 GitHub 內建的 依賴更新自動化 工具、原為 Dependabot Inc.、2019 年被 GitHub 收購後改為 GitHub native feature、目前 public repo 免費、private repo 部分功能 (Alerts / Security Update) 也免費、Version Update 跟進階治理納入 GitHub Advanced Security 套餐。它做三件事：Dependabot version updates（定期 PR 升級依賴到最新 compatible 版本）、Dependabot security updates（CVE 觸發的緊急 PR 升級到 fix version）、Dependabot alerts（看到漏洞列在 Security tab、不一定自動 PR）。它的設計目標 狹窄而深 — 只做 GitHub repo 的依賴 PR 自動化、不做容器掃描、不做 IaC 掃描、不跨 SCM。

服務定位

Dependabot 的核心定位是 把依賴升級從人工 ritual 變成 PR review 工作流。它把「找新版」「跑 manifest update」「開 PR」「附 release note」自動化、剩下的 是否合併 留給人類 / CI 判斷。這跟 Snyk 看似重疊 — 兩者都會自動發升級 PR — 但 Snyk 是 跨 SCM + 多 stack（GitHub / GitLab / Bitbucket、SCA + 容器 + IaC + Code）、Dependabot 是 GitHub-only + 純依賴。多數組織選一個、混用兩者會在同一個 manifest 上各自開 PR、造成 noise。

跟 GHAS 的關係比較細：Dependabot Alerts 跟 Security Updates 本身是 GHAS Dependabot 子模組的核心、但功能上 Alerts 對所有 repo 免費、Security Update 也免費自動發 PR、Version Update 也免費；GHAS 提供的是 Dependency Review（PR-time gate、阻擋 PR 引入新漏洞依賴）、Security Overview（org-wide dashboard）跟 enterprise-level 控制。Dependabot 是 background PR 工廠、GHAS Dependency Review 是 PR-time blocker、兩者互補不重疊。

跟 Renovate（Mend 維護的 OSS）的差異：Renovate 配置更彈性、跨 SCM、支援 ecosystem 數量多（含 Helm chart、Docker tag、ArgoCD 等）、Grouped Updates 規則更細；Dependabot 整合 GitHub 原生 UI（Security tab、Dependency graph、PR diff）更深、設定簡單。需要 跨 SCM 或 Helm / ArgoCD / 自訂 ecosystem 走 Renovate；單純 GitHub-only 加 npm / Maven / pip 等主流 ecosystem、Dependabot 配置成本更低。

本章目標

讀完本頁、讀者能判斷：

Dependabot 在 supply chain 防護裡承擔哪一段（背景 PR 升級）、哪些不在它責任內（容器掃描、IaC 掃描、PR-time gate）
dependabot.yml 的關鍵配置面：ecosystem、schedule、open-pull-requests-limit、groups、reviewers
Version Update vs Security Update vs Alerts 三個功能何時開、PR noise 怎麼控制
Auto-merge 政策的邊界：哪種更新可以全自動、哪種要保留 human approval

最短判讀路徑

判斷一個 repo 的 Dependabot 配置是否健康、最少看四件事：

dependabot.yml 配置：repo 是否有 .github/dependabot.yml、ecosystem 是否覆蓋所有 manifest（npm / Maven / pip / Docker / GitHub Actions / Terraform）、directory 路徑對不對（monorepo 各 sub-package 是否獨立配置）
Update Schedule：schedule.interval 是 daily / weekly / monthly、open-pull-requests-limit 是否合理（預設 5、太低會卡住 backlog、太高會 PR noise）、Grouped Updates 是否啟用（減少 minor / patch PR 數量）
Auto-merge 政策：branch protection 是否設「CI green + required reviewer」、auto-merge 是否限定 patch + minor 自動、major 強制 human review、production 跟 staging branch 是否有差異化規則
Token 治理：repo secrets 是否被 Dependabot PR 誤用、Dependabot secrets（私有 registry credential）是否獨立配置、PR 觸發的 Actions 是否假設 read-only token

四件事任一缺失、就是 Supply Chain Integrity 邊界的待補項目。

日常操作與決策形狀

dependabot.yml 是版控的配置檔：放在 .github/dependabot.yml、跟 manifest 同 repo、所有變更走 PR review。不在 GitHub UI 直接改 — UI 只能 啟用 / 停用 Dependabot 本身、細節必須 commit 進 repo。Monorepo 結構（例：/services/api、/services/web 各自 package.json）每個 sub-package 寫一個 entry、directory 指到 sub-package 根目錄、package-ecosystem 標 manifest 類型。schedule.interval 一般 weekly 開始、daily 適合高活躍度團隊但 PR noise 高、monthly 適合穩定 lib 但 CVE 延遲風險高。

Version Update vs Security Update 分開：Version Update 是 定期掃 manifest 看有沒有 newer compatible 版本、不分 CVE、是 hygiene 工作；Security Update 是 Dependabot 偵測到 CVE 且 manifest 指到 vulnerable 範圍時自動發 PR 升級到 fix version、是 incident 工作。多數組織開 Security Update 全 repo + 選擇性開 Version Update（核心服務開、archived repo 不開）— 避免 PR noise 淹沒緊急 PR。Security Update 預設啟用、Version Update 要 explicit 在 dependabot.yml 寫 entry 才會跑。

Grouped Updates：2023 推出、單一 PR 含多個 minor / patch 升級（例：一個 PR 升 10 個 npm package）、PR 數量從 10 個降到 1 個。配置在 dependabot.yml 的 groups 區、可以按 dependency name pattern（例：@types/* 一組、eslint* 一組）或 update-type（patch / minor 分組）。Major version 仍分開 PR — 因 breaking change 風險、需要單獨 review。Grouped Updates 配 auto-merge 是 minor / patch 全自動 的標準配置。

Auto-merge 是 PR 級、不是 commit 級：Dependabot 發 PR、搭配 GitHub branch protection 設「CI green + 1 approver」就 auto-merge — GitHub gh pr merge --auto 或 Actions workflow（peter-evans/enable-pull-request-automerge）都行。production 環境應該保留 human approval（至少對 major version）、staging / dev 可以全自動。常見模式：staging branch 全自動合（patch + minor）+ 自動 deploy；production branch 走 staging → cherry-pick / promote 流程、human approve。

Reviewer / Assignee / Label 自動標記：dependabot.yml 的 reviewers / assignees / labels 欄位讓 Dependabot 開 PR 時自動標 reviewer 跟 label。實務上配 labels: ["dependencies"] 讓 Dependabot PR 在 PR list 跟一般 feature PR 分開、CI workflow 可以針對 dependencies label 跑特化 lint（例：跑完整 e2e、不只 unit test）。

Token 治理：Dependabot PR 跑 GitHub Actions 時、secrets.GITHUB_TOKEN 是 read-only（GitHub 設計上限制、防 PR 觸發 supply chain attack）— 這代表 Dependabot PR 不能跑需要 write permission 的 job（推 image / 改 status / comment）。需要的話用 pull_request_target event（用 base branch 的 workflow + 完整 secrets）、但這也是 supply chain attack 高風險面、必須 最少 permission。私有 registry credential（npm private registry token、Maven private repo password）用 Dependabot secrets（org / repo level）配置、跟 GitHub Actions secrets 是 不同 namespace、不會互相讀到。

跟 GHAS Dependency Review 搭配：GHAS Dependency Review 在 PR-time 看 manifest diff 阻擋 引入新漏洞依賴、Dependabot Security Update 在 background 升級舊有漏洞依賴、兩個方向互補。production repo 標準配置：GHAS Dependency Review 設 high severity block + Dependabot Security Update 全開 + Dependabot Version Update 選擇性開。

核心取捨表

取捨維度	Dependabot	Snyk	Renovate
SCM 範圍	GitHub only	GitHub / GitLab / Bitbucket / Azure DevOps	GitHub / GitLab / Bitbucket / Azure DevOps / Gitea
涵蓋面	純依賴（SCA）	SCA + 容器 + IaC + Code	純依賴（SCA）+ Docker tag / Helm / 自訂
Ecosystem 數量	主流（npm / Maven / pip / Docker / Actions / Terraform 等 20+）	主流相近 + 商業資料庫優先	多（含 Helm / ArgoCD / preCommit / 自訂 regex）
Grouped Updates	有（2023+、按 pattern / update-type）	有（按 type）	有（規則最細、按 manager / depType / pattern）
Auto-merge	走 GitHub branch protection + auto-merge	Snyk 自家 PR + 走 SCM auto-merge	內建 `automerge` 配置、規則細
漏洞資料庫	GitHub Advisory Database（公開 + 私有）	Snyk Intel（商業、揭露快、加入專屬 advisory）	OSV / NVD / GitHub Advisory（聚合）
PR 整合深度	GitHub Security tab / Dependency graph 原生	Snyk UI 為主、SCM PR 是延伸	SCM PR 原生、Renovate dashboard issue 集中管理
設定方式	`dependabot.yml`（簡單）	UI + `.snyk` policy file（漏洞例外）	`renovate.json`（極彈性、配置複雜）
商業成本	GitHub 免費（Version Update / Security Update / Alerts 都免費）	商業授權（含免費 tier、規模上來付費）	OSS 免費、Mend 商業版加分析 dashboard
適合場景	GitHub-only + 純依賴 + 設定要簡單	跨 SCM、要容器 / IaC、商業 advisory 加值	跨 SCM 或要 Helm / ArgoCD / 自訂 ecosystem

選 Dependabot 的核心訴求：GitHub-only + 只要依賴 PR 自動化、不要容器 / IaC scan、配置成本要低、整合 GitHub Security tab。要跨 SCM 或多 stack 走 Snyk、要彈性 ecosystem / Helm chart / ArgoCD 走 Renovate。混用 Dependabot + Snyk 對同一 manifest 自動 PR 會 noise、二選一。

進階主題

Multi-ecosystem repo：一個 repo 同時有 npm + Docker + Terraform + GitHub Actions、dependabot.yml 寫四個 entry、各自 schedule。實務常見配置：application 依賴（npm / pip）weekly、base image（Docker）weekly、IaC（Terraform provider）monthly、GitHub Actions（CI workflow）weekly。Actions ecosystem 要特別注意 — Dependabot 升級 uses: 指向的 action version、可以同時 pin commit hash（防 tag re-publish 攻擊）、但 pin hash 後 release note 看不到 — 取捨 安全 vs 可讀性。

Private registry support：私有 npm registry（GitHub Packages / Artifactory / Nexus）、私有 Maven repo、私有 PyPI mirror、私有 container registry 都要在 dependabot.yml 配置 registries 區、credential 走 Dependabot secrets。Dependabot 從私有 registry 抓 package metadata 跟 release info、否則只能看 public registry、會誤判 internal lib 沒新版。Org-level Dependabot secrets 適合共用 credential、repo-level 適合特殊 credential 隔離。

Self-hosted runner 隔離：Dependabot PR 觸發的 Actions 預設跑在 GitHub-hosted runner、跟 Dependabot 本身的 sandbox 不同。如果 CI 跑在 self-hosted runner（內網資源 / 大 build cache）、Dependabot PR 也會跑在 self-hosted runner — 要確認 runner 不會被 PR 注入的惡意 manifest 攻擊（npm install 跑 postinstall script 是經典攻擊路徑）。Mitigation：Dependabot PR 用 ephemeral runner（每次新 VM）、隔離 build cache、不掛 sensitive volume。

Auto-merge 風險：auto-merge 加速合併、但也放寬 攻擊者升級 dep 攻擊我 的窗口。XZ Backdoor 2024 的攻擊路徑就是攻擊者花兩年取得 upstream maintainer 信任、發 release 帶 backdoor — 如果下游 auto-merge 升級、攻擊就直達 production。Mitigation：major version 永不 auto-merge、critical infra dep（auth / crypto / network 函式庫）pin commit hash + 手動 review、auto-merge 範圍縮到 patch + minor + low-criticality dep。

GitHub Actions 跟 Dependabot 互動：Dependabot PR 觸發的 workflow 預設 GITHUB_TOKEN 是 read-only、secrets.* 是 empty（Dependabot context）— 防止 PR 注入腳本竊取 secret。需要在 Dependabot PR 跑帶 secret 的 job、用 pull_request_target event（workflow 從 base branch 取、有完整 secret）— 但這會 讀 PR 的 code 跑 workflow、必須先 checkout base 然後最小化 PR code 的執行（不跑 PR 的 install script、只跑既有 lint）。

排錯與失敗快速判讀

PR noise 淹沒緊急 PR：Version Update 全開 + 沒 Grouped Updates、一週 30+ PR — 啟用 groups 按 pattern 分組（@types/* / eslint* / dev-dependencies）、open-pull-requests-limit 設 5、archived repo 關 Version Update
Security Update 沒發 PR：CVE 公告了但 Dependabot 沒動 — 確認 manifest 真的指到 vulnerable 範圍、dependabot.yml 沒 ignore 該 dependency、Security Updates 在 repo settings 是啟用、Dependency graph 有抓到該 manifest
私有 registry 抓不到：Dependabot 在私有 npm / Maven repo 失敗 — dependabot.yml 配 registries 區、credential 進 Dependabot secrets（不是 Actions secrets）、URL 跟 token 範圍對齊
Auto-merge 不觸發：PR 開了 CI 也綠了但沒合 — 確認 branch protection required check 跟 CI workflow 名稱對齊、gh pr merge --auto 在 PR comment / workflow 有觸發、reviewer count 達標
Dependabot PR 跑 Actions 失敗：PR 的 workflow 報 permission denied — GITHUB_TOKEN 在 Dependabot context read-only、改用 pull_request_target 或拆 job（push secret 的部分跑在 merge 後 main branch event）
Major version 被 auto-merge：規則沒寫對、major 也自動合進 production — dependabot.yml 的 ignore 加 update-types: ["version-update:semver-major"] 或 auto-merge 條件改 ${{ steps.metadata.outputs.update-type == 'version-update:semver-minor' }}
Monorepo 漏掃：/services/api/package.json 沒掃 — dependabot.yml 每個 sub-package 寫一個 entry、directory 指到正確路徑、不是只在 root 一個 entry
GitHub Actions ecosystem 升級拿掉 commit hash pin：原本 uses: actions/checkout@a12b3c4 被升成 uses: actions/checkout@v5 — Dependabot 會 follow 既有 reference 風格、想要 hash pin 設 dependabot.yml 的 ecosystem-level config 但目前限制較多、實務常另用 pinact 或 Renovate 處理 Actions hash pinning

何時改走其他服務

需求形狀	改走
跨 SCM（GitLab / Bitbucket）	Snyk / Renovate
容器 / IaC scan	Snyk / Trivy
Helm / ArgoCD / 自訂 ecosystem	Renovate
PR-time block 引入新漏洞	GHAS Dependency Review
SAST / Code scanning	GHAS Code Scanning / Snyk Code
SBOM 生成 / 簽章	Syft / Grype（含 Sigstore cosign 整合段落）
Secret scanning	GHAS Secret Scanning / GitGuardian

不在本頁內的主題

dependabot.yml 完整欄位 reference（看 GitHub 官方文件）
GitHub Advisory Database 詳細運作（CVE 來源、curation 流程）
GHAS 其他模組（Code Scanning / Secret Scanning / Dependency Review）細節 — 看 GHAS 頁
Renovate / Snyk 完整配置 — 看各自 vendor 頁
Container base image 升級的 multi-stage Dockerfile 處理

案例回寫

Dependabot 沒有自身 vendor-level case、但在 supply chain case 中是 標準 mitigation 或 風險面：

案例	跟 Dependabot 的關係
Log4Shell CVE-2021-44228	對照啟示 — Dependabot Security Update 在 Log4Shell 期間自動發 log4j-core 升級 PR、auto-merge 必須有 functional + security 雙重 CI verify、不能單看 build pass
GitHub OAuth 2022 Token Supply Chain	對照啟示 — Dependabot 自己用 GitHub token、需確認 Dependabot PR 不能讀 production secrets（GitHub 設計上已 read-only / empty secrets）
CircleCI 2023 Secrets Rotation	對照啟示 — CI 出事時 Dependabot secrets（私有 registry credential）也要 rotate、不是只 rotate Actions secrets
XZ Backdoor 2024	對照啟示 — Dependabot auto-merge 隱含 maintainer trust、攻擊者控制 upstream 後升級 = 自動進 production；major 不 auto-merge + 重要 dep pin commit hash

下一步路由

上游：7.12 供應鏈完整性與 Artifact 信任
平行：GitHub Advanced Security、Snyk
下游：Trivy（容器 scan）、Syft / Grype（SBOM）
跨類：artifact 簽章（Sigstore cosign）見 Syft / Grype 頁的 SBOM attestation 段
跨模組：6 可靠性驗證流程（Dependabot PR 進 release flow 的 gate 設計）、8 事故處理 vendor 清單
官方：Dependabot Documentation

Google Cloud IAM

Mon, 18 May 2026 00:00:00 +0000

Google Cloud IAM 是 GCP 的 cloud resource permission engine、把 誰能對哪個 resource 做什麼 統一成一個模型：Principal + Role + Resource scope 三件事拼成一個 role binding。它跟 Okta 等 IdP 是兩層責任 — Okta 回答「這個人是誰」、Google IAM 回答「這個身份能對 GCP resource 做什麼」。設計上比 AWS IAM 統一、沒有 resource-based policy vs identity-based policy 雙軌、也沒有 SCP / Permission Boundary 多層覆蓋、policy 評估路徑短而可預測。

服務定位

Google Cloud IAM 的核心抽象是 role binding on a resource scope：把 role grant 給 principal、生效範圍是某個 Organization / Folder / Project / 個別 resource、沿 resource hierarchy 向下繼承。同一個 principal 在不同 scope 可以有不同 role、有效權限是所有 binding 的 union。這跟 AWS IAM 的「identity policy + resource policy + SCP + boundary 多層 intersect / union」相比、推理成本低、但也意味著 guardrail 必須走 Organization Policy 這另一個系統 — 不是 IAM grant 的一部分。

跟 Azure RBAC 相比、兩者都是 scope-based、都靠 hierarchy 繼承。差異在 Service Account 是 GCP 的 first-class identity：有自己的 email、可被 impersonate、可以 grant role 給它也可以 grant iam.serviceAccountUser 讓人類 act-as 它。Azure 的對應是 Managed Identity、語義接近但 impersonation chain 的表達更隱晦。選 GCP（= 用 Google Cloud IAM）的核心訴求通常是：BigQuery / Vertex AI / GKE workload、想用 Workload Identity Federation 取代 long-lived key、團隊偏好較統一的 policy 模型。

本章目標

讀完本頁、讀者能判斷：

Google Cloud IAM 該承擔哪一段權限（resource access、service-to-service、cross-cloud federation）、哪一段該交給 Okta / IdP
Role 的選擇順序（Predefined > Custom > Basic）與 IAM Conditions 何時補上
Service Account / Workload Identity Federation 的信任邊界、何時不該再發 service account key
何時改走 AWS IAM / Azure RBAC / Organization Policy / VPC Service Controls

最短判讀路徑

判斷一個 GCP project 的 IAM 配置是否健康、最少看五件事：

Principal 級別：誰是 Owner / Editor / Viewer（Basic Role 應該幾乎為空）、Service Account 是否獨立列管、有沒有 user 直接 grant 沒走 group
Role 種類：Predefined Role 是 baseline、Custom Role 收斂 least privilege、Basic Role 視為待修；user-managed Service Account key 是否存在（理想是 0）
Impersonation chain 展平稽核：誰有 iam.serviceAccountTokenCreator / iam.serviceAccountUser 對哪個 SA、間接 chain（A → B → C）展平後 誰最終能 act as 高權限 SA。這是 GCP IAM 最容易漏稽核的一條 — 直接 binding 看 Role、但 lateral movement 走 impersonation chain
IAM Conditions：高敏 resource（prod bucket、KMS key、BigQuery dataset）是否用 condition expression 補 attribute-level 限制（resource name prefix、request time、IP）
Audit Logs：Admin Activity 預設開、Data Access logs 在 sensitive resource 是否手動開、System Log 是否同步到 SIEM 並 alert role 變更與 service account key 建立

五件事任一缺失、就是 Audit Log 與 Authorization 邊界的待補項目。

日常操作與決策形狀

Role 選擇順序：Predefined Role 是 baseline、覆蓋 80% 場景；Custom Role 用於收斂 least privilege（例如只給 bigquery.dataViewer 的特定子集）；Basic Role（Owner / Editor / Viewer）幾乎不該再用 — Editor 預設帶寫權限到幾乎所有資源類型、Owner 還能改 IAM policy 本身、粒度過粗。Project 建立預設給的 Owner role 是 人類自己 grant 自己、不是無法避免的 baseline。

Principal type：人類用 Google Workspace user / external user，群組走 Google Group（grant 給 group 比 grant 給 user 更穩、離職 lifecycle 由 IdP / HRIS 推 group 變更即可）。Service Account 是 第一級身份、跟 user 同等、有自己的 email（name@project.iam.gserviceaccount.com）、可被 grant role 也可被 impersonate。Workload identity（K8s SA、外部 OIDC subject）是 federation 層、不在 IAM 內直接列管、但 最後仍 impersonate 一個 Service Account 來拿 GCP 權限。

IAM Conditions：在 role binding 上加 attribute-based 條件、補純 RBAC 不足。常見 expression：resource.name.startsWith("projects/_/buckets/prod-")、request.time < timestamp("2026-12-31T00:00:00Z")、resource.type == "storage.googleapis.com/Bucket"。適合 temporary access、resource name 範圍限定、環境隔離；不適合複雜 ABAC 規則（會難以稽核、且 condition 只能用在支援的 resource type 上）。

Service Account impersonation：人類或另一個 Service Account 透過 iam.serviceAccountTokenCreator role 借用目標 SA 的權限、不需要 SA key。impersonation chain 可以串（A 可 impersonate B、B 可 impersonate C）— 這條鏈是 lateral movement 風險、稽核時要展平看 誰最終能 act as 高權限 SA。對應 Failure: Credential Rotation Without Scope 的教訓：rotation 沒分域時、單點 SA compromise 會跨環境擴散。

Workload Identity Federation（WIF）：GCP 接受外部 OIDC / SAML issuer（GitHub Actions、AWS、Azure、自管 K8s OIDC、CircleCI 等）發的 token、在 Workload Identity Pool 設 attribute mapping 後、外部 token 換成 short-lived GCP credential、最後 impersonate 指定 Service Account。是 取代 SA JSON key 的 modern best practice、CI / 跨雲 / 邊緣 workload 都該優先用。Trust 條件要鎖 issuer + audience + subject（例：assertion.repository == "myorg/myrepo"）— 缺一個就可能被同 issuer 下其他 subject 借用，這是 Microsoft Storm-0558 Signing Key Chain 對 external OIDC 信任的提醒：發 token 的 issuer 一旦被攻破、所有信任它的 audience 都跟著受害。

Service Account key（避免）：user-managed JSON key 是 long-lived credential、無 TTL、無 IP 限制、外洩偵測難。應該以 Workload Identity Federation 或 Service Account Impersonation 取代；若必須用、走 Organization Policy iam.disableServiceAccountKeyCreation 預設禁用、例外申請走 ticket、key 進 Secret Management、季度盤點未使用 key 刪除。

Organization Policy（guardrail）：跟 IAM 完全不同層 — 不是 grant、是 限制可以做什麼設定。常用 constraint：iam.disableServiceAccountKeyCreation、iam.allowedPolicyMemberDomains（限制只能 grant 給特定 domain 的 principal）、compute.vmExternalIpAccess（限制 VM external IP）、storage.publicAccessPrevention。Org Policy 在 Organization / Folder / Project 層設定、IAM 即使想 grant 也擋得住。

Audit / handoff：Admin Activity Log 預設開、不能關、保留 400 天免費；Data Access Log 預設關、開了會大量 log（也大量計費）— 對 sensitive resource（KMS key access、BigQuery dataset read、Secret Manager access）應該手動開；System Event Log 補基礎設施事件。三類都接 Cloud Logging sink 推到 SIEM、特別 alert 三件事 — IAM policy 變更、Service Account key 建立 / 上傳、Workload Identity Pool / Provider 變更。

核心取捨表

取捨維度	Google Cloud IAM	AWS IAM	Azure RBAC
Policy 模型	Role binding on resource scope、單軌	Identity policy + resource policy + SCP + boundary	Scope-based、Management Group 階層
表達力	中等、IAM Conditions 補 attribute	最高、policy language 表達 ABAC / 條件 / 否決	中等、Azure Policy 補 ABAC
Guardrail 機制	Organization Policy（獨立系統、constraint）	SCP（policy 同語法、separate plane）	Azure Policy（獨立系統、constraint）
Machine identity	Service Account first-class + WIF	IAM Role + STS AssumeRole + OIDC trust	Managed Identity + Workload Identity Federation
Cross-cloud federation	WIF 接外部 OIDC 是 modern best practice	OIDC trust on IAM Role、表達力強	Federated credentials、近年補齊
學習曲線	較緩、模型統一	陡、policy 評估順序複雜	中等、scope inheritance 直覺
推理 / 稽核成本	低 — binding union、Org Policy 獨立看	高 — 多層 intersect / union、需 policy simulator	中 — scope 繼承明確、policy 分散

選 Google Cloud IAM 的核心訴求：已在 GCP 上、或想用 BigQuery / Vertex AI / GKE、團隊偏好較統一的 policy 模型、跨雲場景靠 WIF 對外發 trust 而不維護多套 key。

進階主題

Workload Identity Federation 的深層應用：除了 GitHub Actions、AWS、Azure 這類常見 issuer、WIF 也支援自管 K8s OIDC issuer（OSS K8s cluster 跑 GKE workload identity 等價物）、SaaS（Snowflake、Terraform Cloud）發的 OIDC token。trust 設定要鎖 issuer URL、audience、subject pattern 三件事 — 任何一個太寬都是同 issuer 下別人借用你 SA 的入口。

Organization Policy 的 dry-run / 例外：constraint 可以先設 dryRun 觀察會擋掉哪些操作再 enforce；例外用 exception folder（特定 folder 不繼承上層 constraint）或 condition（特定 resource pattern 不擋）。直接全 org 一次 enforce 通常會打掉既有 workload、要分階段。

IAM Conditions 的有限性：condition 只能用在支援的 resource type 上、不是全 GCP 通用；複雜 expression 難稽核（CEL 語法、不易讀）；condition 不能否決 — 只能限制 binding 的生效範圍、不能像 AWS policy 那樣寫 Deny。複雜 ABAC 場景該走 Organization Policy + 應用層授權邊界、不是把所有規則塞進 IAM Conditions。

Service Account Impersonation chain 的稽核：列出 有 serviceAccountTokenCreator 的 principal 是基本；展平 chain（A → B → C）需要 graph walk 工具或 Policy Analyzer；高權限 SA（owner-equivalent custom role、跨 project 寫權限）的 impersonation 來源應該是 寫死的少數 admin SA + break-glass、不該開放給 CI / 一般 service。

VPC Service Controls（資料邊界、跟 IAM 互補）：在 IAM 之外加 資料 perimeter — 即使 principal 有 IAM 權限、如果請求不是來自 perimeter 內（VPC、特定 IP、特定 service account），仍然會被擋。適合 BigQuery / GCS / Secret Manager 這類存資料的 service、防 合法 credential 從外部 exfiltrate 資料（Azure AD Identity Control Plane 2021 場景的下游補位：identity 控制面失守時、資料層仍有獨立 perimeter）。

排錯與失敗快速判讀

Basic Role 還在用：Project Owner / Editor 散落、新人 onboard 直接 Editor — 改 group + Predefined Role、Basic Role 改成 break-glass 限定
Service Account key 散落：CI 用 JSON key、key 進 git 或環境變數、無 rotation — 改 WIF（GitHub Actions / GitLab CI 都支援）、Org Policy 禁用 SA key 建立
WIF trust 太寬：只鎖 issuer 沒鎖 subject、同 GitHub org 任何 repo 都能借用 SA — trust 要含 assertion.repository、assertion.ref（main branch only）等 condition
IAM Conditions 越寫越多：condition expression 過度複雜、稽核時沒人讀得懂 — 簡化條件、把複雜規則上移到應用層或 Org Policy
Data Access Logs 沒開：sensitive resource 出事時只有 Admin Activity、看不到 誰讀了什麼 — KMS key、Secret Manager、BigQuery 高敏 dataset 必開 Data Access Log
Impersonation chain 失控：太多人有 serviceAccountTokenCreator 到高權限 SA — 用 Policy Analyzer 展平、收斂到必要 admin + break-glass
Org Policy 沒設：root org 沒有 baseline constraint、新建 project 預設可建 SA key / public IP / public bucket — 至少設 disableServiceAccountKeyCreation + publicAccessPrevention + allowedPolicyMemberDomains

何時改走其他服務

需求形狀	改走
人類身份的 SSO / MFA / lifecycle	Okta / IdP
AWS resource permission	AWS IAM
Azure resource permission	Azure RBAC
跨雲 unified IAM	沒有單一答案 — 各雲 IAM + Workload Identity Federation 對接、或外部 PAM（Teleport / Boundary）
Secret / Service Account key 治理	7.6 秘密管理與機器憑證治理
資料分類 / DLP / 匯出控制	7.4 資料保護與遮罩治理
Workload runtime detection（容器、syscall）	04 + Falco / Cilium Tetragon 類工具

不在本頁內的主題

各 Predefined Role 的完整權限清單與細部 permission 差異
IAM Conditions CEL 語法的完整 spec
Workload Identity Federation 跟特定 issuer（GitHub / AWS / Azure）的逐步設定教學
BigQuery / GCS / KMS 等服務的 service-specific IAM 行為細節
GCP 計費 / SKU 對 Audit Log 開關的影響

案例回寫

案例	跟 Google Cloud IAM 的關係
Azure AD Identity Control Plane 2021	Identity 控制面故障不直接打到 Google IAM、但設計啟示是 IAM evaluation 路徑必須 HA、且 VPC Service Controls 等資料 perimeter 是 identity 失守時的下游補位
Failure: Credential Rotation Without Scope	Service Account key、WIF provider 的 rotation 必須分域 — 跨 project / 跨環境的 SA 共用是 blast radius 放大器
Microsoft Storm-0558 Signing Key Chain	對 WIF 的提醒 — 信任 external OIDC issuer 時、issuer 自己被攻破會打到所有 audience；trust condition 必須鎖 issuer + audience + subject 三件事

下一步路由

上游：7.2 身分與授權邊界、7.6 秘密管理與機器憑證治理
平行：AWS IAM、Azure RBAC、Okta、AWS IAM Identity Center
下游：7.6 秘密管理與機器憑證治理（Google Secret Manager / Google Cloud KMS 個別 vendor 頁 S2 批次撰寫中）
跨模組：8 事故處理 vendor 清單（GCP IAM 事件如何 routing 進 IR 流程）
官方：Google Cloud IAM Documentation

Microsoft Purview

Mon, 18 May 2026 00:00:00 +0000

Microsoft Purview 是 Microsoft 在 2022 年把原 Microsoft Information Protection (MIP)、Azure Purview data catalog、Microsoft 365 Compliance Center 合併後的統合品牌、定位是 跨 M365 / Azure / endpoint / 跨平台 的 data governance + information protection + DLP + audit + insider risk 平台。它跟 Google DLP 的本質差異在 控制層級、功能列表反而看起來相似 — Purview 走 information protection（document / email / collaboration tool 的 sensitivity label + endpoint inline 攔截）、Google DLP 走 infrastructure-level discovery + transformation（GCS / BigQuery 的 content scan + de-identification）— 兩者層級不同、典型大型 Microsoft + GCP 混合環境會並存而非互斥。

服務定位

Purview 的核心 first-class concept 是 sensitivity label — 一個 label 帶動 encryption、access restriction、watermarking、DLP policy 多個控制、可由 user 手動標也可由 trainable classifier 自動標、跨 Office docs / SharePoint / Teams / Power BI / endpoint 繼承。其上的模組包含：Data Loss Prevention (DLP) — 跨 Exchange / SharePoint / Teams / Endpoint / Microsoft Defender for Cloud Apps (MDA) 的 policy 引擎；Data Map / Data Catalog — Azure / 多雲資料源 discovery + lineage；Unified Audit Log — M365 + Azure AD + Defender 統一 audit；Insider Risk Management — 行為 risk score 偵測內部威脅；Communication Compliance — Teams / email 內容 review。

跟 Google DLP 比、Purview 走 information protection 層 + label-driven + endpoint inline、Google DLP 走 infrastructure 層 + content-based + transformation pipeline。跟 Splunk 比、Purview 不是 SIEM — Unified Audit Log 是 event source、Splunk 或 Microsoft Sentinel 才是 aggregation 平面；Purview audit 進 SIEM 是常見組合。跟雲端原生 data policy（BigQuery Column-Level Security / S3 Block Public Access）比、Purview 跨平台 + label 統一、雲端原生只覆蓋單一雲、不同責任邊界。

關鍵張力：label 設計簡單度 ↔ 自動分類精準度 ↔ 使用者教育成本 是 Purview 導入時最常踩的三角。label 太細（10+ 層 hierarchical）使用者選不出來、label 太粗（只有 Public / Internal / Confidential）DLP policy 觸發精度不夠。Trainable classifier + auto-labeling 是補救、但要投入訓練樣本維運。

本章目標

讀完本頁、讀者能判斷：

Purview 在 information protection stack 中承擔哪一段（label / DLP / audit / insider risk）、跟 Azure RBAC + Entra ID / SIEM / cloud-native policy 怎麼分工
Sensitivity label 的層級設計（粗細、auto-label 條件、跨 Office / endpoint / Power BI 一致性）
DLP policy 的 location + condition + action 三軸如何配置、跟 endpoint DLP / MDA 怎麼覆蓋 SaaS shadow IT
Purview 計費分 SKU 的 trap、E3 + add-on vs E5 license 的決策

最短判讀路徑

判斷 Purview deployment 是否健康、最少看四件事：

Label 層級設計：sensitivity label 幾層、是否 hierarchical（parent / sublabel）、是否定義 auto-labeling 條件（含某 SIT、來自某 SharePoint site、某 user group 建立）、跨 Office / endpoint / Power BI / Teams 是否一致繼承
DLP policy coverage：location 是否涵蓋 Exchange + SharePoint + Teams + Endpoint + MDA、condition 是否用 SIT + label 雙軸（而非只看 SIT）、action 是否依風險分層（block / warn / encrypt / audit-only）
Audit + Insider Risk 證據鏈：Unified Audit Log retention 是否足夠（預設 180 天、E5 可到 1 年、長期要 archive）、Insider Risk policy 是否定義「離職前 30 天 mass download」「異常時段 access」等 organization-specific pattern、是否 export 進 SIEM
License 跟模組對應：Information Protection / DLP / Insider Risk / Communication Compliance 屬不同 SKU、是否買到所需模組、E3 + add-on 還是 E5、避免「policy 寫好但 license 沒解鎖功能」

四件事任一缺失、就是 Data Protection and Masking Governance 邊界的待補項目。

日常操作與決策形狀

Sensitivity label 是 first-class control：label 不只是 metadata、而是 單一 identifier 帶動多個控制 — 標到 document 後同時觸發 AES encryption（透過 Azure Rights Management）、access restriction（誰能開 / 列印 / 轉寄）、watermarking、DLP policy condition、Power BI dataset 繼承。Hierarchical label（Confidential → Confidential\Finance、Confidential\Legal）讓子部門客製、但層級超過 3 層使用者選擇困難。Label 設計要先決定 跨 BU 共用 base set + 每 BU 自家 sublabel 的拓撲、不是一次列 20 個。

Trainable classifier 補 SIT 不足：預定義 SIT（Sensitive Information Type、如 credit card / SSN / passport）涵蓋通用 PII / PCI、但 organization-specific 敏感資料（內部 product spec、合約模板、未公開財報草稿）SIT 抓不到。Trainable classifier 用 ML 訓練 — 提供 50-500 個正例 + 反例、Purview 訓 classifier、跑 staging 驗證 precision / recall 達標再 promote。維運成本是樣本要定期 refresh、business 變動時 classifier 會 drift。

DLP policy = location + condition + action：location（Exchange email / SharePoint site / Teams chat / OneDrive / Endpoint / MDA-managed SaaS）決定 在哪攔、condition（含某 SIT N 次 / 標 Confidential / 來自外部 user / 含某 trainable classifier 命中）決定 何時觸發、action（block + notify / encrypt / quarantine / audit-only / require justification）決定 怎麼處理。production 不該一上來就 block — 先 audit-only 跑 2 週收集 baseline、tune false positive、再 promote 到 warn、最後選擇性 block 高風險 condition。

Endpoint DLP（Windows / macOS）：透過 Microsoft Defender for Endpoint agent 在端點 inline 攔截 — copy to USB / upload to non-corp cloud（Dropbox / Google Drive personal）/ print / paste to browser、針對標 Confidential 的 document 自動 block 或 warn。跟 Datadog Security 的 Sensitive Data Scanner 不同層 — 後者 scan log / APM payload 事後發現、Endpoint DLP 事前在 user action 攔截。Endpoint DLP 要 Defender for Endpoint license + Purview Endpoint DLP add-on 雙重 license、容易踩計費 trap。

Microsoft Defender for Cloud Apps (MDA) 整合：MDA 是 Microsoft 的 CASB（Cloud Access Security Broker）、把 Purview DLP policy 延伸到非 Microsoft 的 SaaS（Salesforce / Box / Slack / Google Workspace）。MDA 透過 API connector 或 reverse proxy 攔截 SaaS 上的 sensitive document、套 Purview label / DLP action。覆蓋 shadow IT 跟 third-party SaaS 是 MDA 的價值、但每個 connector 都要單獨配置 + 維運。

Data Map / Data Catalog discovery + lineage：Purview Data Map 自動掃描 Azure Storage / Synapse / SQL DB / Power BI / 部分 AWS / GCP 資料源、產 metadata + classification + lineage。跟 information protection 模組是不同 surface — Data Map 偏 data governance（誰擁有什麼資料、資料流向哪）、information protection 偏 control（誰能存取、能否 export）。中大型組織通常分開 onboard、不要一次全推。

Unified Audit Log 是 SIEM source：M365 + Azure AD + Defender + Purview 自身的 audit event 統一進 Unified Audit Log、可透過 Compliance Center search、或 Office 365 Management Activity API export 到 Splunk / Sentinel / Elastic Security。Purview 自己不做 correlation / alerting、要做跨來源 detection 必須接 SIEM。Retention 預設 180 天、E5 license 1 年、長期合規要走 Audit Premium 或 archive 到 long-term storage。

Insider Risk Management 跟 SIEM 互補：SIEM 主軸是 external threat + cross-source correlation、Insider Risk 主軸是 single-user 行為 risk score over time — 離職前 30 天 mass download、異常時段存取 sensitive folder、跨 sensitivity tier 大量 access。Risk score 累積到 threshold 觸發 case、進 Compliance officer review queue。預定義 policy template（departing employee、disgruntled employee、data leak）可快速 onboard、organization-specific pattern 要自己定。

跟 Azure RBAC + Entra ID 整合：Purview policy 的 user / group 引用直接吃 Entra ID identity、sensitivity label 的 access restriction 也走 Entra ID group。Compliance / Information Protection admin 是 Entra ID role、應該收緊到少數人 + 走 PIM (Privileged Identity Management) just-in-time elevation。Break-glass account 要單獨設計、不能跟日常運維混。

核心取捨表

取捨維度	Microsoft Purview	Google DLP	Splunk	雲端原生 data policy（BigQuery / S3）
控制層級	Information protection（document / label）	Infrastructure（content scan + transform）	Detection / aggregation	Resource policy（column / object 級別）
核心抽象	Sensitivity label + DLP policy	InfoType + de-identification	SPL + correlation rule	IAM policy + column tag
覆蓋面	M365 + Endpoint + MDA-managed SaaS + Azure	GCS / BigQuery / Pub/Sub / 任意 API content	任意 log source	單一雲服務內
計費模型	Per-user license（E3 + add-on / E5、模組分 SKU）	Per-GB scan + per-API call	Per-GB ingestion	多半免費 / 服務內計費
自動分類	Trainable classifier + 預定義 SIT	InfoType detector（150+ 預定義 + custom）	不做分類	Column tag 手動 / catalog 工具自動
Endpoint inline	強 — Endpoint DLP（Win/macOS）	無（基礎設施層）	無（觀測層）	無
Shadow IT 覆蓋	強 — 透過 MDA CASB	弱 — 限 GCP / API 整合	無	無
退場成本	高 — label 嵌入 document、跨 M365 黏著	中 — InfoType pattern 可移植	高 — SPL / detection content	低 — IAM policy 較通用
適合場景	M365 / Office / collaboration 為主、insider risk	Infrastructure data + multi-cloud + GCP	SIEM / SOC	單一雲服務內 fine-grained access

選 Purview 的核心訴求：M365 / Office / collaboration 為主、需要 label 統一控制跨 document / email / Teams / endpoint、insider risk 是主要威脅、且能買到 E5 或對應 add-on。Non-Microsoft 環境或 infrastructure data 為主（BigQuery / S3）走 Google DLP / cloud-native policy 更直接、不要硬塞 Purview。

進階主題

Trainable classifier 的 lifecycle：classifier 不是 train 一次永久用、business context 變化（產品線改、合約模板更新、合規詞彙變）會讓 precision / recall 下降。Production 應定期 review classifier hit / miss、補新樣本 retrain、跟 SIT 互補不是替代 — 通用 PII 走 SIT 穩定、organization-specific 走 trainable classifier。Staging 跑 2 週驗證 false positive < threshold 才 promote。

Endpoint DLP 跟 Datadog Security Sensitive Data Scanner 的不同層：Endpoint DLP 在 user action 當下攔截（copy / upload / print）、Datadog Sensitive Data Scanner 在 log / APM ingestion 時 scrub。兩者不互斥 — Endpoint DLP 防 資料離開端點、Datadog Scanner 防 PII 寫進觀測 log、典型 Microsoft + Datadog 環境會並存。

Data Loss Prevention for Power BI：Power BI dataset / report 可繼承 Purview sensitivity label、export to Excel / PDF 時 label 跟著走、DLP policy 可條件 標 Highly Confidential 的 dataset 不能 export。是 Microsoft analytics stack 比 Tableau / Looker 在 information protection 上的關鍵優勢。

Information Barriers（內部 walled garden）：合規場景（投行 research vs trading desk、law firm 對手客戶）需 organization 內部某 group 不能 Teams 對話 / 不能 share 檔案、Purview Information Barriers 設定 segment + policy 阻擋。是 compliance-specific feature、非合規環境用不到、但金融 / 法律 / 顧問業是 must-have。

E3 + add-on vs E5 的計費決策：Purview 完整功能（trainable classifier、Endpoint DLP、Insider Risk、Communication Compliance、Audit Premium）要 E5 license、單價約 E3 的 1.5 倍。中小組織從 E3 + 個別 add-on（Information Protection and Governance E5、Insider Risk Management E5）起步、避免一次 E5 全推；大組織直接 E5 反而簡化計費跟 license 管理。

排錯與失敗快速判讀

DLP policy 寫好但沒觸發：condition 或 location 設錯（policy 只覆蓋 Exchange 沒包 SharePoint）、或 license 沒解鎖該模組（Endpoint DLP 要額外 add-on）— 在 Compliance Center 看 policy match 統計、確認 license 對應
使用者抱怨 label 選不出來 / 選錯：label 層級太細 + 沒有預設 label、user 不知該選哪個 — 簡化到 3-5 個 base label、用 auto-labeling 補自動分類、加 label tooltip
Trainable classifier false positive 多：訓練樣本不足 / 正反例失衡 — 補樣本到 50+ per class、retrain、staging 跑 2 週驗證再 promote
Audit log retention 不夠 / 合規查不到：預設 180 天、合規要 1 年以上 — 升 E5 或 Audit Premium、或 export 到 SIEM / long-term storage
Insider Risk policy 太敏感 / 太多 case：預設 template 沒 tune organization baseline — 跑 audit-only 模式 30 天統計、調 threshold、加 user group 排除（VIP / legitimate bulk download role）
Endpoint DLP 攔到合法業務操作：policy 沒區分 corp managed device vs BYOD、或沒給 user override + justification — 加 device compliance condition、設 warn + justification 而非直接 block
MDA connector 落後 SaaS 新功能：API connector 有 lag、新功能未涵蓋 — 對高風險 SaaS 補 reverse proxy 模式、或在 SaaS 側設原生 DLP
License 模組混亂：policy 寫好但功能沒解鎖、admin 不知道哪些要 E5 — 維護 license-to-feature 對照表、Compliance Center 警示「需要 license」要直接修

何時改走其他服務

需求形狀	改走
Infrastructure data（GCS / BigQuery）	Google DLP
SIEM / cross-source correlation	Splunk / Microsoft Sentinel
Observability log PII scrubbing	Datadog Security
單一雲 column / object 級別權限	BigQuery Column-Level Security / S3 Block Public Access
AWS-centric data protection	AWS Macie / AWS KMS
Endpoint detection 為主（不只 DLP）	CrowdStrike Falcon / Microsoft Defender for Endpoint
Incident routing	8 事故處理 vendor 清單

不在本頁內的主題

Microsoft 365 / Azure AD 完整管理（屬 Azure RBAC + Entra ID）
eDiscovery 跟法律 hold 流程細節
Microsoft Sentinel SIEM 完整配置（屬 SIEM 群、跟 Purview 是互補不是同一頁）
Purview Data Map 對非 Azure 資料源（AWS / GCP / on-prem）的完整 connector 矩陣
Compliance Manager 的法規對照與 scoring 細節
Azure Information Protection (AIP) 舊版 client 的 migration 流程

案例回寫

Purview 在 07 案例庫沒有直接 vendor-level 事件、但 information protection + insider risk 角度跟多個案例對照：

案例	跟 Purview 的關係（對照啟示）
Mailchimp 2023 Support Tool Abuse	客服系統客戶資料應標「Customer Confidential」label、DLP policy 自動阻擋大量匯出、Insider Risk Management 偵測異常 operator 行為
Snowflake 2024 Credential Abuse	Endpoint DLP 在 Microsoft 端點攔截從 Snowflake 下載到 USB / personal cloud 的大量資料；對照啟示是「資料平台外洩仍可在 endpoint 端補位攔截」、不是依賴 Snowflake 自身控制
Okta Support System 2023	Unified Audit Log 紀錄 support tool 高風險操作、Insider Risk 偵測異常 pattern、跟 SIEM 串接做 cross-source correlation
Data Protection and Masking Governance (section)	Sensitivity label + DLP policy 是 information protection 的工具、跟 Google DLP transformation 不同層、可並存
Audit Trail and Accountability Boundary (section)	Unified Audit Log 是 accountability evidence chain、retention 跟 export 設計是合規證據可用性的關鍵

下一步路由

上游：7.4 資料保護與遮罩治理、7.8 稽核軌跡與責任邊界
平行：Google DLP（infrastructure 層 DLP、跟 Purview 並存）、Cloud-native Data Policy (BigQuery + S3)（resource-bound access control、跟 Purview label-driven 互補）
下游：Splunk / Elastic Security（Unified Audit Log export 進 SIEM）
跨類：Azure RBAC + Entra ID（identity 基底）、Datadog Security（log PII scrubbing、不同層互補）
跨模組：8 事故處理 vendor 清單（Insider Risk case → IR routing）
官方：Microsoft Purview Documentation

SQLite

Wed, 13 May 2026 00:00:00 +0000

SQLite 是世界上部署最多的 DB（手機、瀏覽器、car、IoT 都有）。傳統定位是 embedded、單檔案與低操作成本資料庫；multi-tenant 網路服務通常會先看 PostgreSQL、MySQL 或 managed SQL。但近年因 Cloudflare D1（serverless SQLite）、Turso（distributed SQLite）、Litestream（SQLite replication）等服務興起，出現「SQLite as production DB」的新場景。

教學路線：單檔正式狀態與 local-first

SQLite 服務頁的教學目標是把單機、單檔案、edge、desktop、test fixture 的正式狀態責任說清楚。讀者讀完後要能判斷 SQLite 何時是 production state，何時要轉向 server database、edge KV 或分散式 SQLite 變體。

學習段	核心問題	對應段落
Embedded state	單檔案資料庫如何成為 source of truth	定位、適用場景
Local-first	device、edge、desktop、test fixture 的責任形狀	適用場景、案例對照
Writer boundary	single writer、file lock、WAL 如何決定服務上限	容量特性、容量規劃要點
Distributed variants	Turso、LiteFS、rqlite、D1 解決哪類同步或 edge 問題	跟其他 vendor 的取捨、章節群結構
替代路由	何時升級 PostgreSQL、MySQL、DynamoDB 或 edge KV	不適用場景、下一步路由

定位：單檔案 embedded + 新興分散式 SQLite 生態

SQLite 跟 PostgreSQL / MySQL 承擔不同層級的資料責任：

以 function-call API 使用，省掉 server process
單一檔案（含 schema、data、index、metadata）
無 user / role / connection 概念
同 process 同時 read / write 受 file lock 限制

傳統定位：test fixture、CLI tool data store、mobile app（iOS / Android 內建）、edge device。

新興定位：edge serverless（Cloudflare D1）、distributed SQLite（Turso、rqlite）、replicated SQLite（Litestream）。

容量特性

單檔案上限：

DB 最大 281 TB（理論）
實務上單表 > 100 GB 開始有 vacuum / index 問題

並發寫：

WAL mode：可同時多 reader + 1 writer
寫入仍由 single writer boundary 控制
寫吞吐受 disk fsync 限制（通常 < 1K WPS）

並發讀：

WAL mode 多 reader 可同時跑
read-only workload 可以撐高吞吐

Cross-process / cross-instance：

多個 process / instance 同時寫同一檔案會破壞 single writer boundary
需要分散時用 Litestream（replication）或 Turso（distributed）

適用場景

1. Test fixture / CI 用 DB：

整合測試需要的 fixed DB
比 spin up PostgreSQL container 快
對應 1.4 Repository Adapter 的 contract test 模式

2. CLI tool / desktop app 內建 store：

Chrome / Firefox（cookies、history、bookmark）、Fossil SCM、iOS app
省掉 server、單檔案攜帶

3. Mobile app（iOS / Android）：

iOS Core Data 底層用 SQLite
Android 自帶 SQLite API
offline-first app 的標準

4. Single-instance backend（特殊場景）：

流量小 + HA 由備份 / restore / redeploy 流程承擔
例：Sidekick / 個人 SaaS / family-scale app
配合 Litestream 做 backup / DR

5. Edge / serverless（新興）：

Cloudflare D1：edge SQLite、跟 Workers 整合
Turso：distributed SQLite、跨 region replication
跟傳統 SQLite 不同等級、是 新的 product

6. Embedded device / IoT：

沒網路或要降低 server 依賴
SQLite 內建、無 external dependency

不適用場景

1. 多 instance / 多 region web service：

SQLite 的單檔模型以單 instance writer 為主要邊界
替代：PostgreSQL、Aurora、Spanner、CockroachDB

2. 高寫入吞吐（> 1K WPS）：

fsync 限制
替代：任何 server-based RDBMS

3. Multi-user 權限管理：

無 user / role 概念
替代：PostgreSQL / MySQL

4. 跨機器 transaction：

SQLite 是 single-machine
替代：分散式 SQL

5. 大規模 production OLTP：

大規模 production OLTP 需要 server database 的 HA、replica、權限與操作邊界
替代：MySQL / PostgreSQL / Aurora

跟其他 vendor 的取捨

vs PostgreSQL（作為 test DB）：

SQLite：快 spin up、SQL dialect 接近但有差異
PostgreSQL：跟 production 一致、發現的 bug 真實
選 SQLite：speed of iteration、簡單 query
選 PostgreSQL：catch production-like bug、PostgreSQL-specific 特性測試

vs Cloudflare D1：

SQLite（local）：單機、自管
D1：edge serverless、跟 Workers 整合
選 SQLite：embedded / CLI / app 場景
選 D1：edge web service、跟 Cloudflare 生態整合

vs Turso（distributed SQLite）：

SQLite：單機、單檔案
Turso：distributed、跨 region replication、SQLite-compatible
選 SQLite：simple use case
選 Turso：需要 SQLite simplicity + 全球分散

vs Litestream（replicated SQLite）：

SQLite：單檔案
Litestream：把 SQLite 變成 streaming replicated 到 S3
選 Litestream：想要 SQLite simplicity + DR

vs Firebase / Firestore（mobile app）：

SQLite：embedded、offline-first、無 sync
Firestore：realtime、自動 sync、雲端 store
選 SQLite：offline-first、單機
選 Firestore：multi-device sync、realtime

容量規劃要點

1. WAL mode 是 production baseline：

default journal mode 是 rollback journal（每寫都 lock）
WAL（Write-Ahead Log）讓多 reader 可同時跑
PRAGMA journal_mode = WAL

2. fsync 配置：

PRAGMA synchronous = FULL（durable、慢）
PRAGMA synchronous = NORMAL（faster、少數情況可能掉資料）
PRAGMA synchronous = OFF（最快、不安全）

3. mmap 加速 read：

PRAGMA mmap_size = 268435456（256 MB）
把 DB 部分內容 mmap 進 RAM、加速 read

4. Cache size：

PRAGMA cache_size = -64000（64 MB cache）
大 cache 對 read-heavy workload 有幫助

5. Auto-vacuum：

預設 off、delete 後檔案不縮小
PRAGMA auto_vacuum = INCREMENTAL + 定期 PRAGMA incremental_vacuum

章節群結構

SQLite 章節群的責任是把單檔正式狀態、embedded process、writer boundary、backup / restore、test fixture、local-first 與 edge SQLite 變體拆成可教學路線。完整結構見 SQLite Teaching Structure；下表列出目前已建立的 deep article、hands-on 與 migration route。

層級	文件	狀態	教學責任
結構總覽	Teaching Structure	已有正文	對齊 PG / MySQL 與 LLM 架構，固定 SQLite 後續讀法
Core deep	File lifecycle / backup boundary	已有正文	WAL sidecar、backup API、restore drill、corruption route
Hands-on	Hands-on 操作路線	已有正文	local file、backup restore、WAL busy、migration fixture
Concurrency	WAL concurrency / locking	已有正文	single writer、file lock、`SQLITE_BUSY`、checkpoint
Performance	PRAGMA tuning / performance	已有正文	journal、sync、cache、mmap、vacuum 的取捨
Migration	Schema migration / versioning	已有正文	app release、schema version、rollback、migration evidence
Testing	Test fixture best practice	已有正文	SQLite 測試便利性與 production dialect gap
Embedded app	Mobile / desktop embedded store	已有正文	device local state、privacy、backup、app version
Sync	Local-first sync boundary	已有正文	多裝置同步、conflict、server authority
Edge variant	D1 / Turso / libSQL comparison	已有正文	edge SQLite 產品與 local SQLite 的責任差異
Replication	Litestream / LiteFS replication	已有正文	continuous backup、read replica、failover boundary
SQL compatibility	SQL dialect and index limits	已有正文	type affinity、index、constraint、PostgreSQL / MySQL gap
Operations	Observability / runbook	已有正文	busy errors、WAL growth、backup evidence、incident route
Migration route	SQLite to PostgreSQL	已有正文	多 tenant、權限、HA、audit 出現時的升級路線
Migration route	SQLite to D1 / Turso	已有正文	edge / serverless 化路線
Migration route	PostgreSQL to SQLite simplification	已有正文	single-user / embedded 工具的反向簡化路線

章節群的讀法是先讀 file lifecycle，再按壓力選 deep article。若問題是 write contention，讀 WAL locking；若問題是測試，讀 test fixture；若問題是 edge / serverless，讀 D1 / Turso comparison；若問題是服務長大，讀 SQLite to PostgreSQL migration。

Anti-recommendation 與升級路由

SQLite 的低操作成本容易讓團隊忽略它的 writer boundary。這一段先說何時維持 SQLite，再說何時升級到 server SQL、edge SQLite 變體或 managed KV。

機制 / 路線	維持簡單設計的條件	升級訊號	主要引用路徑
Local SQLite	單 process、單 writer、資料可用檔案備份保護	多 instance 寫入、需要 HA、需要資料層權限	Database、Source of Truth
WAL + file backup	read-heavy、寫入量低、RPO 可接受定期 snapshot	restore 演練失敗、WAL growth 失控、RPO / RTO 變嚴格	RPO、RTO
Litestream / LiteFS	單 primary 寫入清楚、主要需求是 backup 或 read replica	需要多地 active write、跨 region transaction	Replication Lag、Stale Read
Cloudflare D1 / Turso	edge / serverless 生態已是主平台	SQL 特性、migration、observability 或 vendor 限制卡住	1.11 全球分散式 OLTP
PostgreSQL / MySQL	application 已進入多服務、多 tenant、權限與備份治理需求	schema migration、connection、audit 與 failover 成主題	PostgreSQL vendor、MySQL vendor

SQLite 的簡單路徑是讓檔案生命週期成為正式操作流程。只要單一 writer、備份、restore、migration 與 file ownership 都能被 runbook 控制，SQLite 可以是正式狀態，而非臨時 cache。

升級到 server SQL 的訊號是操作責任超過檔案邊界。當團隊需要資料庫帳號、權限分層、read replica、線上 schema migration、集中 audit 或跨 instance failover 時，PostgreSQL / MySQL / Aurora 會比繼續包裝 SQLite 更清楚。

已知 limitation 與後續路由

SQLite overview 目前已完成服務判斷與章節群正文路由。File lifecycle、WAL locking、PRAGMA tuning、schema migration、test fixture、local-first sync、edge product 差異、observability、hands-on 與 migration route 都已有對應正文；下一輪審查可集中在案例補強、引用精度與跨章重複整理。

案例對照

SQLite 不在 09 case 庫的「規模化 vendor」類別、但作為 embedded 跟 test 廣泛使用：

iOS Core Data：所有 iOS app 的 default DB
Chrome / Firefox：cookie、history、bookmark
Fossil SCM：repository metadata 與 application-file use case
Cloudflare D1：edge serverless（新興 production 場景）
Turso：distributed SQLite（新興 production 場景）

常見陷阱

default journal mode 不改 WAL：read 跟 write 互相 block、performance 差
多 process / instance 同時寫同檔：corruption
delete 後檔案沒縮小：忘了 vacuum
synchronous=OFF 給 production：power loss 可能掉資料
SQLite 跟 PostgreSQL 行為差異測試不足：SQLite test 過、PostgreSQL production 出 bug（特別是 date / time、NULL 處理、type coercion）

下一步路由

完整 T1 對照：01-database vendors index
平行：PostgreSQL vendor / MySQL vendor（production server-based RDBMS）
上游：1.4 Repository Adapter（test fixture 模式）
結構：SQLite Teaching Structure（完整章節群與寫作順序）
操作：SQLite Hands-on（local file、backup restore、WAL busy reproduction、migration fixture、D1 / Turso preview）
深入：SQLite file lifecycle 與 backup boundary（WAL、backup、restore、file ownership）
官方：SQLite Documentation、Litestream、Turso、Cloudflare D1

AWS ELB（ALB / NLB / CLB）

Fri, 01 May 2026 00:00:00 +0000

AWS ELB 是 AWS managed load balancer 系列、承擔三個責任：流量入口（HTTP/HTTPS for ALB、TCP/UDP for NLB）、health check + draining、跟 AWS 生態整合（ACM TLS / Target Group / WAF / Lambda）。包含 ALB（L7、HTTP/HTTPS）、NLB（L4、極低延遲）、CLB（legacy、不要選）。設計取捨偏向「managed + AWS-native + integrate with ECS/EKS/Lambda」、跨雲 / 進階 traffic management 是限制。

本章目標

讀完本章後、你應該能：

建立 ALB / NLB、配置 listener + target group
設計 health check + connection draining
用 ACM 自動憑證 + SNI
用 ALB Ingress Controller / AWS Load Balancer Controller for K8s
評估 ALB vs NLB vs CloudFront vs API Gateway

最短路徑：5 分鐘把 AWS ELB 跑起來

 1# 1. 建 ALB
 2aws elbv2 create-load-balancer \
 3  --name demo-alb \
 4  --subnets subnet-aaa subnet-bbb \
 5  --security-groups sg-xxx \
 6  --scheme internet-facing \
 7  --type application
 8
 9# 2. 建 target group + register targets
10aws elbv2 create-target-group \
11  --name demo-tg \
12  --protocol HTTP --port 8080 \
13  --vpc-id vpc-xxx \
14  --target-type instance \
15  --health-check-path /health \
16  --health-check-interval-seconds 15
17
18aws elbv2 register-targets \
19  --target-group-arn arn:aws:elasticloadbalancing:...:targetgroup/demo-tg/... \
20  --targets Id=i-0abc123 Id=i-0def456
21
22# 3. 建 listener + 驗證
23aws elbv2 create-listener \
24  --load-balancer-arn arn:aws:elasticloadbalancing:...:loadbalancer/app/demo-alb/... \
25  --protocol HTTP --port 80 \
26  --default-actions Type=forward,TargetGroupArn=arn:aws:...
27
28ALB_DNS=$(aws elbv2 describe-load-balancers --names demo-alb \
29  --query 'LoadBalancers[0].DNSName' --output text)
30curl "http://${ALB_DNS}"

日常操作與決策形狀

ALB vs NLB vs CLB

子議題：

ALB：L7、path/host routing、WebSocket、gRPC、Lambda target
NLB：L4、static IP、preserve client IP、極低延遲、TCP/UDP
CLB：legacy、不要新用
選擇判讀：HTTP/HTTPS → ALB；TCP/UDP / 高吞吐 → NLB

Target group / listener rule

子議題：

Target type：instance / IP / Lambda
Listener rule：path-based / host-based / header-based routing
Priority 排序
對應指令：aws elbv2 modify-rule

Health check 與 draining

子議題：

Health check：HTTP path / interval / threshold
Connection draining（deregistration delay）：deregister 後等到 in-flight requests 完成
對應 5.C9 反例 cutover without drain

進階主題（按需閱讀）

TLS termination + SNI

子議題：

ACM 自動憑證 + 續期
SNI：單 ALB 多 domain（最多 25 certificates）
TLS policy（min TLS version）
Mutual TLS（ALB 2023+）

ALB Ingress Controller / AWS Load Balancer Controller

子議題：

在 EKS 內配置 ALB / NLB（Ingress / Service of type LoadBalancer）
IngressClass / annotations
Pod readiness gate（pod 到 ALB target group healthy 才接流量）
對應 Kubernetes vendor 頁

Cross-zone load balancing

子議題：

ALB default enabled、NLB default disabled
Cross-zone 跨 AZ data transfer cost
跟 AZ failover 對應

WAF integration

子議題：

AWS WAF on ALB
Rate-based rule / managed rule group
對應 07 security WAF

Idle timeout

子議題：

ALB default 60s、可調 1-4000s
跟 keep-alive / WebSocket 長連線對應
跟 backend（K8s pod / EC2）的 timeout 對齊

Cost 模型

子議題：

LB-hour（per ALB / NLB）
LCU（Load Balancer Capacity Unit）— 多維度計算
Data processing charge
跨 AZ data transfer

排錯快速判讀

Target unhealthy

操作原則：health check path 不對 / security group 沒開 / backend 反應慢。

1aws elbv2 describe-target-health \
2  --target-group-arn arn:aws:elasticloadbalancing:...:targetgroup/demo-tg/...
3# HealthState: unhealthy → 查 Reason（Target.Timeout / Elb.InternalError / Target.ResponseCodeMismatch）
4# 常見根因：security group 沒開 health check port、health check path 回 404、backend 回應超過 timeout

504 Gateway Timeout

操作原則：backend 超 ALB idle timeout / 60s。判讀：backend log + ALB access log。

Cross-zone imbalance

操作原則：cross-zone disabled、流量集中單 AZ。修法：enable cross-zone（注意 cost）。

Draining 卡住

對應 5.C9 反例。判讀：deregistration delay 太短 / connection 未結束就被斷。

ACM cert renew 失敗

操作原則：DNS validation 失敗 / domain ownership 變動。判讀：ACM console 看 cert state。

何時改走其他服務

需求形狀	改走
跨雲 / 自管	nginx / Envoy
Service mesh	Envoy + Istio
Cloud-native auto-discovery	Traefik
CDN / edge	CloudFront / Cloudflare / Fastly
API Gateway	AWS API Gateway / Kong
極低成本	自管 nginx on EC2

不在本頁內的主題

AWS WAF rule 完整 reference
Network Firewall 配置
各 AWS region 限制差異
ELB classic（CLB）細節

案例回寫

直接相關案例

案例	主討論議題
5.C1 Tradeshift self-managed → EKS	遷 EKS 時 ALB / NLB 是入口、切流批次跟 target group 權重連動
5.C2 Condé Nast EKS	多集群整併 EKS、AWS Load Balancer Controller 統一 ingress 入口
5.C4 Mobileye EKS	大規模 workload 遷 EKS、ALB target group health check 是切流驗證點
5.C5 Miro EKS	Managed EKS 後 ALB / NLB 治理回到平台團隊

跨 vendor 對照

案例	對 AWS ELB 的對應
5.C9 cutover without drain	ALB deregistration delay / NLB connection draining 是切流的關鍵回退面
5.C10 規模對照	AWS 生態小型 ALB + EC2 / 中型 ALB + EKS / 大型 NLB + 多 region + WAF

待補 AWS ELB 案例：大規模 AWS Load Balancer Controller 客戶案例、NLB static IP 場景、AWS WAF + ALB 安全整合。

下一步路由

上游概念：5.3 LB Contract
平行 vendor：nginx、Envoy
下游能力：07 security WAF、6 reliability release gate

Google Cloud Pub/Sub

Fri, 01 May 2026 00:00:00 +0000

Google Cloud Pub/Sub 是 GCP managed pub/sub 服務、承擔三個責任：全球 topic 路由（無 region 概念）、彈性 delivery（push 跟 pull 並存）、GCP 生態整合（BigQuery / Dataflow / Cloud Run）。設計取捨偏向「topic 是 first-class、subscription 各自進度、ack deadline 控制重試」、跟 Kafka 的 partition / consumer group 思路不同。

對「GCP 生態事件分發、跨 region 全球路由、push HTTP endpoint 接收事件、Dataflow streaming」這條路徑、Pub/Sub 是首選。本頁先給最短路徑、再展開日常 topic / subscription 操作與 ack deadline 設計、最後進階治理（ordering、DLT、push endpoint、IAM）跟排錯。

本章目標

讀完本章後、你應該能：

用 gcloud CLI 建 topic / subscription、publish / pull 訊息
區分 push vs pull subscription、選擇對應的 delivery 模型
設計 ack deadline 與 ackExtension、處理長任務
配置 dead-letter topic 與 retry policy
評估 ordering key、Pub/Sub Lite、BigQuery subscription 等延伸場景

最短路徑：5 分鐘把 Pub/Sub 跑起來

1# 1. 建 topic
2gcloud pubsub topics create demo-topic
3
4# 2. 建 subscription（pull 模式、綁定 topic）
5gcloud pubsub subscriptions create demo-sub --topic=demo-topic
6
7# 3. publish + pull 驗證
8gcloud pubsub topics publish demo-topic --message="hello"
9gcloud pubsub subscriptions pull demo-sub --auto-ack

最短路徑驗證「topic / subscription 建得起來、能發能收」。實際應用見日常操作。指令對真實 GCP 需設定 project 與認證；本機要先驗證可啟動 Pub/Sub emulator、用 gcloud config set api_endpoint_overrides/pubsub 把同一組 CLI 指向 emulator 跑通。

日常操作與決策形狀

gcloud CLI 與 client library

子議題：

gcloud CLI 指令對照表（topics / subscriptions / publish / pull / ack）
Client library 配置：credentials / flow control / async vs sync
Batch publish（提高吞吐、增加延遲的取捨）
對應指令範例：gcloud pubsub subscriptions describe

Topic / Subscription 設計

Topic 是 first-class entity、跟 Kafka 不同的是 subscription 才是 consumer 抽象：

1 topic ↔ N subscription（fan-out 內建）
Subscription 各自進度（無 consumer group 概念）
Subscription expiration policy（閒置 N 天自動刪）

Push vs Pull subscription

子議題：

Push：Pub/Sub 主動 POST 到 HTTP endpoint、適合無狀態 worker / Cloud Run
Pull：consumer 主動拉取、適合長 worker / 需要 flow control
Push endpoint 要求（HTTPS、認證）
兩者的可靠性 / latency / cost 對照

Ack deadline 與 ack extension

子議題：

Ack deadline：subscription 等待 ack 的時間（預設 10 秒、上限 600 秒）
Modify ack deadline（長任務動態延長）
Client library 的自動 ack extension
跟 SQS visibility timeout 的對照（語意類似、機制不同）

進階主題（按需閱讀）

ordering key、dead-letter topic 與 schema enforcement 已展開為 deep article：ordering key / DLT / schema enforcement、push / pull / ack flow control。下列子議題段保留選題判讀入口。

Ordering key

子議題：

啟用 ordering 的限制（subscription 設定 enableMessageOrdering）
Ordering 在 push 跟 pull 的差異
跟 Kafka partition + key 的對照
性能影響（throughput 受限）

Dead-letter topic

子議題：

設定 max delivery attempt、超過送到 DLT
DLT 是另一個 topic、可以再訂閱重處理
跟 SQS DLQ 的差異（DLT 是 topic、不是 queue）

Pub/Sub Lite

子議題：

Pub/Sub Lite vs Pub/Sub（partition-based、zonal、cost 低）
何時用 Lite（高吞吐、確定 region）
何時用 standard（global routing 內建）

BigQuery subscription / Cloud Storage subscription

子議題：

BigQuery subscription：訊息直接寫入 BQ table（無需 Dataflow）
Cloud Storage subscription：訊息批次寫入 GCS object
適合 streaming analytics / data lake 場景

Schema enforcement

子議題：

Topic 綁定 schema（Avro / Protobuf）
Schema evolution
跟 Kafka Schema Registry 的對照

IAM / Service Account

子議題：

Pub/Sub IAM role（publisher / subscriber / viewer）
Service Account 認證（push endpoint 用）
VPC Service Controls

排錯快速判讀

Subscriber backlog（unacked messages 累積）

操作原則：先看是 push 還是 pull、再定位 endpoint 失敗 vs flow control 限制。

1gcloud pubsub subscriptions describe 2# 看 ackDeadlineSeconds（預設 10s）與 messageRetentionDuration（預設 604800s / 7 天）是否符合處理時間與 replay 需求

判讀：Cloud Monitoring metric 的 num_undelivered_messages 與 oldest_unacked_message_age。

Push endpoint 500（retry storm）

操作原則：push endpoint 持續 500、Pub/Sub 會 backoff retry、看 retry policy 設定。判讀：endpoint 健康 vs 訊息毒性。

Ordering key 限制誤用

操作原則：啟用 ordering 後 throughput 變低、單一 ordering key 是順序的。判讀：throughput 是否被 ordering 限制、可拆 ordering key。

IAM 權限錯

操作原則：publish / pull / ack 各自需要不同 IAM role。判讀：用 Cloud Logging 看 deny 原因。

Subscription expired

操作原則：閒置太久 subscription 被 GC。判讀：subscription expiration policy 設定 + 監控 lastReceiveTime。

何時改走其他服務

需求形狀	改走
需要 streaming + replay long window	Kafka / Confluent Cloud
需要 partition + consumer group	Kafka / Pub/Sub Lite
需要複雜 routing	RabbitMQ on GKE
跨雲 / 跨平台	Kafka / NATS
AWS 生態	AWS SQS / SNS
Workflow + durable execution	Google Workflows / Temporal

不在本頁內的主題

Dataflow / BigQuery 完整功能（另開 streaming analytics 章節）
Cloud Run / Functions 整合細節
各語言 client 完整 API

案例回寫

Pub/Sub 專屬案例（C60-C69）

案例	主討論議題
3.C60 Spotify Event Delivery	從 Kafka 遷入 / 自建 dedup
3.C61 Spotify autoscaling	Backlog ≠ healthy / autoscale 反效果
3.C62 Spotify GCS export	Ack = end-to-end commit
3.C63 Mercari Actionable History	Ack deadline 是 batch-level（陷阱）
3.C64 Mercari Item Feed DLT	DLT 防 poison message 阻塞
3.C65 Mercari LINE flow control	Pull subscription 對齊外部 RPS
3.C66 Mercari B2C gRPC pusher	自建 push / 長 job + 動態 RPS
3.C67 Niantic Pokémon GO	Elastic buffer / BQ streaming
3.C68 Wix clickstream	Pub/Sub + Dataflow + BQ 教科書組合
3.C69 Twitter Ad Engagement	多 topic 切分取代 partition

跨 vendor 對照

案例	對 Pub/Sub 的對應
3.C8 Cloudflare Queues	全球交付對照：Pub/Sub global routing 內建
3.C10 規模對照	中小型直接用 / 大型考慮 Pub/Sub Lite / 超大跨雲走 Kafka
3.C20 Spotify 遷出 Kafka	Pub/Sub 遷入的源頭（為何遷出 Kafka）

IAM + Service Account 缺直接 customer engineering case：customer engineering blog 著墨少、建議撰寫該段時依 GCP 官方 IAM 文件 + 通用安全原則。

下一步路由

上游概念：0.3 非同步選型、3.1 broker basics
平行 vendor：AWS SQS、Kafka
下游能力：3.4 consumer 設計、6.12 idempotency / replay

Honeycomb

Fri, 01 May 2026 00:00:00 +0000

Honeycomb 是 high-cardinality observability SaaS、承擔三個責任：events-based 資料模型（不是 metrics aggregation）、unknown-unknowns 偵錯能力（BubbleUp / Heatmap）、observability-driven SRE 文化代表平台。設計取捨偏向「深度優於廣度」、不追求 Datadog 的 integration 廣度、專注於 high-cardinality + distributed system debugging。

本章目標

讀完本章後、你應該能：

用 Honeycomb SDK 或 OTel 送 events 到 Honeycomb
用 BubbleUp 找 outlier 模式（unknown-unknowns）
設計 SLO + burn rate alert
配置 Refinery（tail-based sampling）
評估 Honeycomb vs Datadog 的選用判讀

最短路徑：5 分鐘把 Honeycomb 跑起來

1# 1. 應用程式加 instrumentation（Honeycomb SDK 或 OTel SDK）
2# TODO: HONEYCOMB_API_KEY + dataset 設定
3# TODO: 用 Beeline SDK 或 OTel + OTLP exporter
4
5# 2. 送 sample events
6# TODO: 觀察 trace 出現在 Honeycomb UI
7
8# 3. 用 query 介面查詢
9# TODO: SELECT count + visualize by service.name

日常操作與決策形狀

Events vs metrics 心智模型

Honeycomb 跟 metrics-aggregation 平台不同。子議題：

Event = 一個 trace span（包含 dozens of attributes）
不預先 aggregate、查詢時 group by 任意 attribute
High-cardinality 不是問題、是設計目標
對應 4.C2 Gaming peak cardinality

Instrumentation

子議題：

Honeycomb SDK（Beeline）：簡單、Honeycomb-specific、auto-instrumentation 部分
OTel SDK + OTLP：標準、vendor-neutral、推薦新部署用
Manual attribute：對 business / domain context attribute 不省略
Refinery：tail-based sampling proxy

Query 介面

子議題：

Visualize：count / count_distinct / heatmap / p50 / p95 / p99
Group by：任意 attribute（user_id / region / version 等）
Filter：WHERE clause
對應 SLO query：heatmap(duration_ms) GROUP BY service.name WHERE http.status_code = 500

Deep Article

High-Cardinality Query Model 與 BubbleUp：event-based 資料模型、high-cardinality 查詢設計、BubbleUp 異常偵測、SLO / burn rate、derived columns、dataset 設計與 OTLP ingestion

Migration Playbook

Sentry 遷移到 Honeycomb：error tracking 轉 event-based observability

進階主題（按需閱讀）

BubbleUp 分析

子議題：

給定 heatmap 異常區、自動找區隔 outlier 跟 baseline 的 attribute
適合「我看到 latency spike、但不知道哪個維度造成」
Unknown-unknowns 偵錯模式
跟 Datadog APM 的 service map 對照

SLO 與 burn rate alert

子議題：

SLO 配置（service + indicator + objective + window）
Burn rate calculation：multi-window multi-burn-rate alert
跟 knowledge cards burn-rate 對照
對應 4.C9 OTel migration signal drift

Refinery（tail-based sampling）

子議題：

為什麼需要 tail-based：保留有錯 / 高延遲 trace、丟正常 trace
Refinery 部署模式（gateway in front of Honeycomb）
Sampling rule：error / latency / per-service / dynamic
對應成本：100% ingestion 太貴、tail-based 平衡

OTLP integration

子議題：

Honeycomb 接受 OTLP（gRPC / HTTP）
應用層用 OTel SDK、傳給 Honeycomb 不用改 SDK
Multi-backend 支援：同一份 OTel data 送 Honeycomb + 其他
對應 4.C7 Datadog OTel migration

結構化 events 設計

子議題：

哪些 attribute 應加（user_id / request_id / business 維度）
哪些 attribute 不該加（PII / secrets）
Wide events 哲學：一個 event 帶 dozens of attributes、不分散到多 metric
對應 PII redaction strategy

Observability-driven development

子議題：

Charity Majors 提的 SDLC 模式：production debug 是常態
TDD + observability：寫 code 同時思考可觀測性
跟 SRE 文化整合

排錯快速判讀

Events 沒到 Honeycomb

操作原則：先看 SDK 配置（API key + dataset）、再看 network、最後看 Honeycomb status page。

Query timeout

操作原則：query window 過大或 attribute cardinality 過高造成 backend slow。判讀：縮 time window、簡化 group by。

Sampling 過頭 vs 不足

操作原則：debug 時找不到 trace（sampling 過頭）vs cost 爆（sampling 不足）。Refinery 提供 dynamic sampling 解決靜態 rate 的不足。

Burn rate alert noise

操作原則：multi-window 設計避免「短暫 spike 觸發 alert」、低 burn rate window 給長期趨勢。

跟其他 backend dual ship 不一致

對應 4.C9 OTel migration signal drift。判讀：兩個 backend 數據不對齊、看 SDK 是否 dual export、attribute mapping 是否一致。

何時改走其他服務

需求形狀	改走
廣度大、要 600+ integrations	Datadog
預算敏感	Grafana Stack（OSS）
Pure metrics	Prometheus
Logs full-text	Elastic Stack
Error tracking 為主	Sentry
Cloud-native (AWS / GCP)	CloudWatch / Cloud Ops
Self-hosted	OSS observability（Honeycomb 是 SaaS only）

不在本頁內的主題

Honeycomb SDK 完整 API
BubbleUp 內部演算法
Refinery 詳細配置
Honeycomb pricing 詳細

案例回寫

直接相關案例

案例	主討論議題
4.C2 Gaming peak cardinality	High-cardinality debug pattern
4.C9 OTel signal drift	（反例）Refinery / dual ship 對齊驗證

跨 vendor 對照

案例	對 Honeycomb 的對應
4.C7 Datadog OTel migration	從 Datadog APM 遷出時 Honeycomb 是 events 替代
4.C8 Airbnb K8s scale signals	動態叢集下 wide events 補 metrics 維度不足
4.C10 規模對照	Honeycomb 適合中大型 + observability-driven team

待補 Honeycomb 案例：Charity Majors 的 production talks、Honeycomb customer engineering blog、Refinery scale-up case。

下一步路由

上游概念：4.17 Telemetry Data Quality
平行 vendor：OpenTelemetry、Datadog
下游能力：06 reliability 模組（SLO / burn rate）、4.20 Evidence Package

Locust

Fri, 01 May 2026 00:00:00 +0000

Locust 是 Python-based load test 工具、承擔三個責任：Python class-based test 設計（user behavior 表達力強）、distributed mode（master / worker 內建）、Web UI 即時觀察。設計取捨偏向「Python DX + 高度自訂邏輯 + 任何 Python lib 都可用」、適合 Python 團隊與需要極高自訂邏輯的場景。

本章目標

讀完本章後、你應該能：

寫 Locust user class + task
跑 standalone + distributed mode
自訂 client（非 HTTP、如 gRPC / WebSocket）
設計 task weight + on_start / on_stop hook
評估 Locust vs k6 / Gatling 的選用

最短路徑：5 分鐘把 Locust 跑起來

1# 1. 安裝
2# TODO: pip install locust
3
4# 2. 寫 locustfile.py
5# TODO: class User(HttpUser): wait_time = ..., @task def hello(self): ...
6
7# 3. 跑
8# TODO: locust -f locustfile.py --host=http://target
9# TODO: 瀏覽器 http://localhost:8089 操作

日常操作與決策形狀

User class + task

子議題：

HttpUser / FastHttpUser（FastHttpUser 用 geventhttpclient、效能高）
@task decorator + weight
on_start / on_stop（per-VU setup / teardown）
對應 Python class inheritance

Distributed mode

子議題：

master：協調 + 收集 metric
worker：實際發送 request
locust --master / locust --worker --master-host=...
多 worker 突破 Python GIL 限制

Web UI vs headless

子議題：

Web UI（dev / interactive）
Headless（--headless --users N --spawn-rate N --run-time T）
對應 CI 整合：CSV report

進階主題（按需閱讀）

自訂 client（非 HTTP）

子議題：

任何 Python lib 都可包成 user
gRPC / WebSocket / database / queue 都行
request event 手動 fire

Custom request

子議題：

self.client.get/post（HTTP）
自訂 event emission
Custom statistics

locust-plugins 生態

子議題：

locust-plugins：第三方 plugin（CSV report enhanced / Postgres / Kafka / etc）
Custom shape（dynamic load profile）
TaskSet / SequentialTaskSet

CI integration

子議題：

Headless mode + exit code
CSV / JSON report
對應 6.8 Release Gate

Distributed scaling

子議題：

Kubernetes 部署
多 region load source
Result aggregation

排錯快速判讀

High VU 跑不上去

操作原則：Python GIL + 單 worker 限制、用 distributed mode。判讀：CPU / network bottleneck？

Worker disconnect

操作原則：master / worker network 不通、heartbeat timeout。判讀：log + master UI。

Custom protocol 報告不正確

操作原則：手動 event fire 缺 / metric name 不對。

Memory leak

操作原則：long run test、user state accumulate。判讀：on_stop cleanup。

何時改走其他服務

需求形狀	改走
編譯後分發 / 高 VU 單機	k6
JVM 生態	Gatling
GUI / 老牌	JMeter
Cloud managed	k6 Cloud / BlazeMeter / Locust 自管 K8s
Capacity planning	09 performance capacity 模組

不在本頁內的主題

Python 語言基礎
gevent / asyncio 內部
locust-plugins 完整列表

案例回寫

案例方向	對應主題
LinkedIn：Capacity 與 On-call 分層	automated load testing 對齊 headroom 預測（Python 場景）

Case 庫稀薄：本 cases/ 目錄目前沒有以 Locust 為主軸的案例。可參考候選方向：

待補 Locust customer case：Python-heavy 團隊 load test 採用案例、distributed Locust 大規模部署案例
候選 case：Pinterest（ML serving / 推薦系統壓測場景）、Spotify（squad-based 各團隊自管壓測）— 若未來收錄需先在 cases/ 補正文，本欄再寫實際 link

下一步路由

上游概念：6.13 Performance Regression Gate
平行 vendor：k6、Gatling
下游能力：09 performance capacity

Rootly

Fri, 01 May 2026 00:00:00 +0000

Rootly 是 IR 平台、承擔三個責任：no-code workflow builder（拖拉式自動化）、AI 輔助 retrospective + timeline 整理、Slack / Teams 雙平台整合 + integration 數量最廣（200+）。產品迭代快、跟 incident.io / FireHydrant 三家構成 modern IR 平台主要選項。2023+ 加入 Rootly AI 模組做 incident enrichment 與 retrospective auto-draft、把 IR 平台從 workflow 自動化 推到 AI-assisted investigation。

服務定位

Rootly 的核心定位是 Slack-native IR platform + no-code automation engine、目標客戶是「想最大化降低 incident response toil」的 AI-first / engineering-led 組織。產品主軸：no-code workflow builder（IFTTT-style condition / action 鏈、不需工程 deploy）+ Rootly AI（incident summarization / enrichment / retrospective auto-draft）+ Slack / Teams 雙平台對等支援。

跟 PagerDuty 比、PagerDuty 是 alerting-first（on-call schedule + escalation 為核心）、Rootly 是 IR-process-first（incident workflow + retro 為核心）、兩家常一起用（PagerDuty 負責 page、Rootly 接 declare 後的 process）。跟 incident.io 比、incident.io 走 opinionated minimal（流程固定、學習快）、Rootly 走 configurable maximal（workflow 可深度客製、學習曲線稍陡）。跟 FireHydrant 比、FireHydrant 在 service catalog / runbook 結構更剛、Rootly 在 AI + integration 廣度更領先。

關鍵張力：no-code 客製深度 ↔ 配置複雜度 是 Rootly 客戶最大的 trade-off — workflow 可以做得很深，但配多了會出現 workflow loop / 通知爆量 / AI summary 失準，需要有人定期 review workflow inventory。

本章目標

讀完本頁、讀者能判斷：

用 no-code builder 設計 incident workflow（trigger / condition / action）
配置 severity matrix + role assignment
用 Rootly AI 輔助 timeline + retrospective、了解 AI 失準的邊界
整合 200+ tool（觀測 / cloud / collaboration / ticket / paging）
評估 Rootly vs incident.io / FireHydrant / PagerDuty 的取捨

最短判讀路徑

判斷 Rootly deployment 是否健康、最少看四件事：

Slack workflow 入口統一：/rootly declare 是否唯一 declare 入口、severity / service / role 是否在 declare 時就 bind、Slack channel naming convention（inc-YYYY-MM-DD-slug）跟 retention 是否設定
No-code automation 治理：workflow 數量 / owner / 上次 review 時間是否有 inventory、有沒有 staging tenant 跑新 workflow、production workflow change 是否走 PR-like review
AI integration 邊界：Rootly AI 用在哪些環節（incident summary / timeline enrichment / retrospective draft）、AI 輸出是否標記為 draft 而非 finalized、AI hallucination 的 human review gate 是否定義
SSO + audit + integration health：SSO（Okta / Azure AD）+ audit log（誰改 workflow / 誰 close incident）是否開、Integration token 是否定期 rotate、Jira / Linear / GitHub PR / PagerDuty / Opsgenie 對接是否雙向同步

四件事任一缺失、就是 Drills and On-call Readiness 邊界的待補項目。

最短路徑

1# 1. Slack / Teams install Rootly app
2# 2. /rootly declare 建 test incident
3# 3. 拖拉 workflow（severity → action）
4# 4. Close + AI retrospective

日常操作與決策形狀

No-code workflow builder

子議題：

Trigger（severity / status / time）→ Action（page / message / ticket）
Branch / condition / parallel
Custom field bind

IFTTT-style 邏輯：workflow 是 trigger → condition → action 的 DAG、可以 branch / parallel / loop（loop 要小心、見排錯）。典型 production workflow：「severity SEV1 declared → page on-call via PagerDuty + create Jira ticket + post status page draft + invite security lead to Slack channel」。複雜度上限是「能 express 在 UI 拖拉上」、超過這個複雜度應該寫 webhook 接外部 orchestrator。

AI retrospective + Slack/Teams workflow

子議題：

自動 timeline from Slack messages
AI summary（what happened / contributing factor）
同 incident.io / FireHydrant Slack workflow
Teams 平等支援
Mobile app

Rootly AI 的能力邊界：AI 從 Slack channel 訊息抽 timeline、產生 contributing factor draft、列 action item candidate。產出是 draft、不是 finalized retrospective — IR lead 應該逐項驗證再 publish、AI hallucination 在 contributing factor / blame attribution 段最常出現（見排錯段）。

核心取捨表

取捨維度	Rootly	incident.io	FireHydrant	PagerDuty
核心定位	No-code workflow + AI investigation	Opinionated Slack-native IR	Service catalog + runbook 結構	Alerting + on-call schedule
客製化深度	高 — workflow builder + custom field	中 — 流程相對固定	中高 — runbook + catalog 模型清晰	中 — escalation 配置強、流程較輕
AI 能力	Rootly AI（summary / enrich / retro）	AI 摘要（較新、範圍較窄）	較少強調 AI	AIOps（alert grouping）
平台支援	Slack + Teams 對等	Slack-first（Teams 較弱）	Slack + Teams	Slack / Teams / Mobile / Email
Integration 廣度	200+（業界最廣）	中（Slack ecosystem 為主）	中高	最廣（paging ecosystem）
學習曲線	中陡 — 配置選項多	緩 — 流程少	中 — service model 要先想清楚	中 — escalation policy 要先設計
適合場景	AI-first / 想自動化 toil / Slack-heavy	小到中型、想快上手 + 流程一致	中大型、service ownership 清楚	任何需要強 paging 的團隊
退場成本	中 — workflow / custom field 量會綁	低 — 流程相對標準	中 — service catalog 綁定深	高 — schedule + integration 量大

選 Rootly 的核心訴求：Slack-native IR + 想用 no-code + AI 把 incident process toil 自動化最大化、且能投入時間維護 workflow inventory（避免 workflow sprawl）。需要重 paging 的團隊通常 Rootly + PagerDuty 並用（Rootly 不取代 PagerDuty 的 schedule + escalation）。

進階主題（按需閱讀）

Rootly AI 深入

子議題：incident summary（給 stakeholder broadcast 用）、enrichment（自動補 service owner / recent deploy / related incident）、retrospective auto-draft（timeline + contributing factor + action item）。AI 輸出是 draft、需要 human review gate 才 publish。對 Incident Evidence Write-back 的影響是「快、但要驗」、不能把 AI draft 直接當成 source of truth。

No-code workflow 進階

子議題：condition expression（field / value / operator）、parallel branch、wait / delay、custom webhook action 接外部 orchestrator。複雜 workflow 應該 先在 staging tenant 跑、production workflow change 走 review。Workflow loop（A workflow 觸發 B、B 觸發 A）會在 misconfig 時出現、見排錯段。

Ticket / PR / paging integration

子議題：Jira / Linear 雙向同步（incident close 同步 ticket、ticket update 帶回 Slack）、GitHub PR 自動連 incident（commit message 含 incident ID）、PagerDuty / Opsgenie alerting layer 對接（page 從 PagerDuty 來、process 在 Rootly 跑）。Integration token 失效是常見 silent failure、需要 monitoring。

Integration 廣度

子議題：觀測（Datadog / Grafana / New Relic / Honeycomb）/ Cloud（AWS / GCP / Azure）/ Collaboration（Slack / Teams / Zoom）/ Ticket（Jira / Linear / GitHub）/ Status page

Service catalog + Custom field

子議題：service / team / customer metadata、custom field 帶業務 context、workflow trigger by field

On-call 模組

子議題：Rootly OnCall（schedule + escalation）、跟 IR workflow 同 app

排錯快速判讀

Workflow 行為不符：trigger / condition 邏輯錯、看 workflow run log
AI summary / retrospective 失準：Slack noise 多、AI 對 contributing factor / blame attribution hallucinate — 手動補 timeline、AI 輸出標記為 draft、由 IR lead 逐項驗證才 publish
Workflow loop / 通知爆量：A workflow 觸發 B、B 又觸發 A、Slack 訊息或 ticket 暴衝 — 在 staging tenant pre-test、production workflow change 走 review、加 rate limit / loop detection
Slack notification overload：每個 severity 都 broadcast 全公司 channel — 設 severity threshold、SEV3 以下走 team channel、SEV1/2 才 broadcast
Integration token 失效：rotate / OAuth re-auth、加 integration health monitoring（token expiry alert）
Slack channel 亂：naming convention（inc-YYYY-MM-DD-slug）/ retention 沒設、舊 incident channel 累積成千

何時改走其他服務

需求形狀	改走
Slack-only / 簡潔	incident.io
Microsoft Teams	FireHydrant
Paging-first	PagerDuty
Learning-focused	Jeli
自建 Slack workflow	Slack + GitHub Issues / Linear

不在本頁內的主題

AI model / training detail / Pricing / 200+ integration 個別 setup

案例回寫

Rootly 主打 Slack-native + AI-assisted IR：本案例庫尚無直接揭露 Rootly 使用細節的事故；可參照的閱讀脈絡是「Slack-centric 協作 + 自動化 retro + AI-first 組織想 minimize IR toil」的服務事故。

案例	對應主題
Slack cases	Slack-native IR 平台在通訊平台自身事故下的回退
Reddit cases	mid-size 平台升級事故的 retro 結構（對照素材）

待補 candidate：NVIDIA / Figma / Canva 等 Rootly 公開 customer story。

下一步路由

Momento

Tue, 16 Jun 2026 00:00:00 +0000

Momento 是 serverless cache 服務、承擔三個責任：把 cache 變成一個按用量計費的 API（沒有 node、沒有 cluster、不規劃容量）、自動隨流量 scale（尖峰自動擴、閒置不付固定費）、提供原生 SDK 與 Redis / Memcached 相容介面（既有 client 可遷）。設計取捨偏向「把 cache 的容量規劃與維運完全消除、用計費換掉 sizing」、是不想養 cache 叢集又要彈性的選項。

對「流量不可預測、不想規劃容量與 sizing、團隊沒有 cache 運維資源」這條路徑、Momento 是 serverless 方向的代表。它跟自管 Redis、managed cache 的上層取捨（自管 vs managed vs serverless vs BaaS bundle）見 0.22 能力級買 vs 建。

本頁的計費、limit 與功能宣稱以 Momento 官方文件與 Momento 定價為準、最後檢查日 2026-06-16。Momento 是 SaaS、需帳號與 API key、無法本機 docker 驗證、指令為依官方文件的範例。

本章目標

讀完本章後、你應該能：

理解 serverless cache 跟 node-based / managed cache 的計費與維運差異
評估按用量計費（per request + data transfer）對你的流量形狀划不划算
判斷 Momento 原生 SDK vs Redis 相容介面的遷移路徑
區分 Momento 跟 ElastiCache Serverless 的定位差異
判斷哪些 cache 場景適合 serverless、哪些該回 node-based

最短路徑：用 SDK 連 Momento

1# 1. 在 Momento Console 建 cache + 取得 API key（無 node / cluster 配置）
2# 2. 用語言 SDK（以 pseudo-code 示意、實際 API 以官方 SDK 文件為準）
3
4client = CacheClient(api_key, default_ttl=60s)
5client.set("my-cache", "foo", "bar")     # 寫入、TTL 內有效
6client.get("my-cache", "foo")            # → "bar"

最短路徑的重點是「沒有 endpoint / node / sizing 要配」——建 cache 是一個 API 動作、不是 provision 一台機器。實際 SDK 介面以 Momento SDK 文件為準。

日常操作與決策形狀

SDK 與相容介面

子議題：

原生 SDK（多語言）：gRPC-based、Momento 自有 API
Redis / Memcached 相容介面：既有 Redis / Memcached client 可遷（相容範圍以官方為準、要驗證）
沒有 redis-cli 等價的 server 操作（serverless 無 server 可登入）

計費模型（核心決策）

子議題：

按用量計費：data transfer（傳輸量）+ 可能的 request / storage 維度（以官方定價為準）
無固定 node 費用：閒置時段不付 idle node 的錢
流量尖峰自動 scale：不需預留容量、但尖峰量直接反映在帳單

沒有容量規劃

子議題：

不選 node type、不設 maxmemory、不規劃 shard
scaling 由 Momento 處理、application 端不感知
代價：失去對底層的控制（無法調 eviction policy 等 server 參數）

進階主題（按需閱讀）

Serverless 計費的甜蜜點與陷阱

子議題：

甜蜜點：流量不可預測、有大量閒置時段、不想為峰值預留容量
陷阱：穩態高流量下、按用量可能比 node-based + Reserved Instance 貴
跟 ElastiCache Serverless 的計費踩坑同類議題、access pattern 低效會推高帳單

Momento vs ElastiCache Serverless

子議題：

Momento：cache-as-API、完全 serverless、跨雲（不綁單一 cloud）
ElastiCache Serverless：AWS 生態內的 node 抽象、仍是 ElastiCache engine、綁 AWS
選擇：要完全擺脫容量規劃 + 跨雲 → Momento；已在 AWS 生態 + 要 engine 控制 → ElastiCache

遷移與相容性驗證

子議題：

從 Redis / Memcached 遷 Momento：用相容介面或改用原生 SDK
相容範圍要逐項驗證（serverless 不支援 server-side 操作如 SCAN 全庫、Lua 等、以官方為準）
失去的能力：server 參數調校、自管 persistence、module

排錯快速判讀

帳單超出預期

操作原則：serverless 帳單反映實際用量、先看 data transfer 與 request 量。判讀：access pattern 低效（大量小請求、大 value）會推高、用批次 / 合併降量；穩態高流量重新評估 node-based。

延遲比自管高

操作原則：serverless cache 多一層 API gateway / 跨網路、延遲可能高於同 VPC 的自管 Redis。判讀：latency-sensitive 且穩態高流量的場景、評估自管或 managed node-based。

相容介面行為差異

操作原則：Redis 相容介面不等於 100% Redis、server-side 操作可能不支援。判讀：對照官方相容清單、用到的命令逐一驗證。

何時改走其他服務

需求形狀	改走
穩態高流量、成本敏感	node-based Redis / Valkey + Reserved Instance
需要 server 參數 / eviction 控制	自管 Redis / ElastiCache
已在 AWS 生態	ElastiCache Serverless（同生態）
需要 Redis data types / module	Redis（完整 data types）
process-local 極低延遲	Caffeine（JVM 內、無網路）

不在本頁內的主題

Momento 完整 SDK API（各語言、以官方文件為準）
詳細計費計算（以官方定價為準）
Redis / Memcached 相容介面的完整相容矩陣
Momento Topics（pub/sub）等 cache 以外的產品線

案例回寫

跨 vendor 對照（本模組 case 庫暫無 Momento-specific case）

Momento 是較新的 serverless cache、本 blog 的 cache case 庫（Meta / Shopify / Netflix / Cloudflare / Tinder / Tubi / Snap）暫無 Momento production case。以下用 serverless 的角度對照既有 case 提供判讀。

案例	對 Momento 的對應
2.C9 Cache Stampede	serverless 也會 stampede、client-side jitter / singleflight 仍要自己做
9.C25 Tubi feature store	「feature 可重算才選 cache」的判斷對 serverless 一樣適用、不可重建走 durable
2.C10 規模對照	serverless 適合早期 / 不可預測流量、規模穩定後評估 node-based 成本

待補 Momento-specific 案例：serverless cache 的成本與彈性 production 個案、從 ElastiCache 遷 Momento 的成本對照、不可預測流量場景的採用分享。

下一步路由

上游能力：0.22 能力級買 vs 建（自管 vs managed vs serverless）、0.6 成本取捨
平行 vendor：AWS ElastiCache（Serverless 選項）、Caffeine（另一端：process-local）
上游概念：2.2 Cache Aside

AWS CloudHSM

Mon, 18 May 2026 00:00:00 +0000

AWS CloudHSM 是 single-tenant dedicated HSM 服務（FIPS 140-2 Level 3）、客戶獨享一個 HSM cluster、AWS 提供 硬體 + network + provisioning、客戶自己管 crypto user / partition / key custody / backup。它跟 AWS KMS 是 不同信任模型 — KMS 是 multi-tenant managed、AWS 持有 key custody 與 API plane；CloudHSM 上 AWS 看不到 key、也不能 reset Crypto User password、客戶丟了 credential 等於 key 永久遺失。

服務定位

CloudHSM 的核心定位是 把 cryptographic root of trust 放回客戶手上 — 適合金融、政府、醫療這類有資料主權、FIPS 140-2 Level 3、PCI HSM、HIPAA 合規壓力的場景。跟 AWS KMS 比、KMS 也滿足 FIPS 140-2 Level 3、但 HSM cluster 是 AWS 多租戶共用、key material 由 AWS-controlled HSM 持有、控制面 API 也是 AWS。CloudHSM 把 HSM cluster 物理隔離給單一客戶、PKCS#11 / JCE / OpenSSL Dynamic Engine 直接打 HSM、AWS 在資料平面 沒有讀 key 的能力。

跟 自管 on-prem HSM（SafeNet / Thales 自架）比、CloudHSM 把硬體採購、機房、network、firmware patch 交還 AWS、客戶只管 key custody 跟 Crypto User policy；代價是不能完全脫離 AWS region。跟 Vault auto-unseal 整合場景中、CloudHSM 是 Vault master key 的 root custodian — Vault unseal key 用 CloudHSM 加密、CloudHSM 出事整個 Vault cluster 沒法 unseal、所以可用性設計（cross-AZ cluster、cross-region backup）很關鍵。多數一般 web app / SaaS 用 KMS 即可、不需要 CloudHSM 的物理隔離。

本章目標

讀完本頁、讀者能判斷：

何時需要 CloudHSM 的 dedicated 模型、何時 AWS KMS 已足夠
CloudHSM cluster 的最低安全 / 可用性需求（cross-AZ、Crypto Officer 分離、Quorum、backup）
Crypto User credential 出事的降級路徑（AWS 不能幫忙、靠 backup + Quorum）
跟 KMS Custom Key Store / Vault auto-unseal 整合的取捨

最短判讀路徑

判斷 CloudHSM deployment 是否健康、最少看四件事：

Cluster 拓樸：production cluster 是否至少 2 個 HSM instance 跨 AZ、cluster 內自動 replicate、單一 AZ 故障時 key 是否仍可用
Crypto User 管理：Crypto Officer（CO）跟 Crypto User（CU）是否分離、CO password 是否走 break-glass 保管、CU credential 是否走 short-lived 取得 + audit
Quorum-based policy：高敏 operation（建 CU、改 policy、key export wrapped）是否設 M-of-N approval、避免單一 admin compromise 後 silent abuse
Backup 治理：automatic 24h backup 跟 manual backup 是否都開、cross-region backup 是否走 explicit copy、restore 流程是否定期演練

四件事任一缺失、就是 CloudHSM deployment 待補項目 — 跟 secret management 的 evidence 邊界同類。

日常操作與決策形狀

Cluster + HSM Instance 拓樸：CloudHSM 的部署單位是 cluster、cluster 內可以有 1-N 個 HSM instance。production 場景至少 2 個 HSM instance 跨 AZ、cluster 自動把 key material replicate 在所有 instance 上、單一 AZ 失效不影響 cryptographic operation。跨 region 不自動 replicate — 跨 region DR 要靠 backup copy。

Crypto Officer (CO) vs Crypto User (CU)：CO 是 cluster 管理員、能建 / 刪 CU、設 policy、做 backup；CU 是真的做 cryptographic operation 的 identity（encrypt / decrypt / sign / verify）。production 必須分離 — CO credential 走 break-glass 保管、CU credential 給 application 使用、application compromise 只影響 CU 邊界、不能改 CO policy。

Quorum-based policy（M-of-N approval）：CloudHSM 支援把高敏操作（建 CU、改 policy、key export wrapped）綁定 M-of-N CO approval。例如 3-of-5 quorum、單一 CO 即使 credential 外洩也不能單獨建後門 CU、必須拿到另外 2 個 CO 的 signed token。對應 Storm-0558 signing key chain 啟示：高價值 key custodian 的 admin operation 不該是 單人單 token、必須有第二人簽核才能改變信任根。

Backup 治理：CloudHSM 每 24 小時自動 backup 整個 cluster state（含 key material）、backup 是 AWS-managed encrypted blob、AWS 自己也不能解密、restore 必須在 CloudHSM cluster context 內進行。可手動 backup、可 copy 到其他 region 做 DR。Backup retention 預設 90 天、可延長。Backup 不是 export — 不能把 key material 從 HSM 拿出來看 plaintext。

Key Replication 跨 region：CloudHSM cluster 綁定單一 AWS region、跨 region 走 backup → copy → restore 流程、不是 active replication。設計 DR 時要算 RTO：restore 一個 cluster 從 backup 大約小時級、不適合 hot failover、應該 primary region 跑、DR region 備好空 cluster + backup copy。

PKCS#11 / JCE / OpenSSL Dynamic Engine 整合：application 不用 AWS SDK 講 CloudHSM、而是透過 標準 cryptographic API library（PKCS#11 for C/C++、JCE Provider for Java、OpenSSL Dynamic Engine 走 TLS termination）。好處是 application code 用業界標準介面、未來換 HSM 廠也只需要換 library。代價是 client SDK 要裝在 application host、CU credential 要 deploy 到 host、host security baseline 變成 cryptographic boundary 的一部分。

跟 KMS Custom Key Store 整合：KMS Custom Key Store 把 KMS Key 的 backing material 放在 CloudHSM、API 仍透過 KMS（kms:Encrypt / kms:Decrypt）、application code 不需要改。這是 KMS 易用 + HSM dedicated 雙重：保留 KMS 的 IAM policy / key rotation / audit log（CloudTrail）、又得到 single-tenant HSM 的合規屬性。代價是 CloudHSM 失效時、Custom Key Store backing 的 KMS Key 全部不可用、需要監控 cluster health。

核心取捨表

取捨維度	AWS CloudHSM	AWS KMS	Azure Managed HSM	Google Cloud HSM
部署模型	Single-tenant dedicated cluster	Multi-tenant managed	Single-tenant pool	HSM-backed Cloud KMS（Protection Level=HSM）
FIPS 140-2	Level 3（dedicated）	Level 3（shared cluster）	Level 3	Level 3
AWS / 雲廠持 key？	不持（CU credential 客戶獨有）	持（managed key custody）	不持（HSM admin 客戶獨有）	不持 plaintext key material
整合介面	PKCS#11 / JCE / OpenSSL	AWS SDK / CLI / KMS API	Key Vault SDK / REST	Cloud KMS API
Quorum 多人簽核	內建（M-of-N）	透過 IAM policy + organization SCP	RBAC + Privileged Identity Management	IAM Condition + organization policy
運維成本	高 — 自管 CU credential / patch / topology	低	中	低
合規憑證	FIPS 140-2 L3 + PCI HSM + Common Criteria	FIPS 140-2 L3 + PCI DSS	FIPS 140-2 L3 + Common Criteria	FIPS 140-2 L3
適合場景	金融 / 政府 / 醫療、需要物理隔離 + AWS 不持 key	一般 AWS-heavy workload、需要 IAM 整合	Azure-heavy + 合規壓力	GCP-heavy + 合規壓力
退場成本	中 — backup 跨廠不可移植、key 不能 export	中	中	中

選 CloudHSM 的核心訴求：合規明文要求 dedicated HSM（PCI HSM、某些國家資料主權法規）、或 trust model 上不接受 AWS 持 key。多數 AWS-heavy workload 用 KMS 即可、加 CloudHSM 反而引入 Crypto User credential 的單點失誤（丟了 = key 永久遺失）。需要 KMS API 但又要 dedicated HSM、走 Custom Key Store 是折衷路徑。

進階主題

Quorum Auth 設計：production 把 Quorum threshold 設為 3-of-5 或 2-of-3、五位 CO 由不同部門 / 不同地理位置持有、避免單一辦公室 / 單一網路同時被攻陷。Quorum token 有 TTL、單次 operation 用完就失效、防止 replay。建議 quarterly 演練：模擬一個 CO 不在、用剩餘 quorum 完成 emergency operation、驗證流程在事故時跑得通。

KMS Custom Key Store 整合決策：用 Custom Key Store 的關鍵問題是 availability blast radius — KMS Key 出事影響範圍是 使用該 Key 的 AWS service（S3、EBS、RDS encryption）、Custom Key Store backing 失效會讓這些 service 同步斷。設計時做 分層 key strategy：mass volume 的 S3 / EBS 用 AWS-managed KMS Key、高合規敏感的 database / secret 才用 Custom Key Store backing 的 KMS Key、降低單一 cluster 失效的影響面。

Cross-Region Backup：DR 要把 backup copy 到第二個 region、走 CopyBackupToRegion API、restore 時建空 cluster + 套 backup。整個 RTO 通常數小時、不適合熱備、設計上是 容忍小時級 outage 換到 BCDR 環境、不是 秒級 failover。對應 Azure AD Identity Control Plane 2021 對照啟示：身份 / 加密控制面的單點 outage 影響整個 platform、availability 的 topology 設計跟 confidentiality 同等重要。

跟 Vault auto-unseal 整合：Vault auto-unseal 可用 CloudHSM 作 master key custodian、走 PKCS#11 plugin、Vault unseal 時呼叫 CloudHSM Unwrap master key。比起 AWS KMS auto-unseal 多一層 dedicated HSM 保證、適合監管特別嚴的場景。代價是 CloudHSM cluster 失效 → Vault 不能 unseal → 下游所有 secret 拿不到、要設計 break-glass 流程。

合規憑證：CloudHSM 同時持有 FIPS 140-2 Level 3、PCI HSM、Common Criteria EAL4+ 多個認證、可作金融 PIN block 處理、payment 業者的 HSM 上鏈、政府機敏資料加密的 直接合規承諾、不需要客戶端再做 HSM 認證 audit。

排錯與失敗快速判讀

Crypto User credential 丟失：CU password 全公司只有一份、保管人離職 → AWS 不能 reset、key material 永久不可用 — CU credential 要走 password manager + 多人持有、CO 有能力 revoke 舊 CU 建新 CU
Cluster 只有單一 HSM instance：成本省了、單一 instance 故障 cluster 整個失效 — production 強制至少 2 個 instance、跨 AZ
Backup 沒測過 restore：每天 automatic backup 跑、從未 restore 演練、DR 真要用時發現流程不通 — quarterly 演練 restore 到測試 cluster、驗證 key material 可用
Custom Key Store 沒監控 CloudHSM health：CloudHSM cluster degraded 時、KMS Custom Key Store 跟著失效、application 看到 KMS 5xx — CloudWatch metric 監 HsmsActive / HsmTemperature、cluster health degrade 立即 alert
PKCS#11 library 版本漂移：application host 的 client SDK 版本跟 cluster firmware 不相容、cryptographic operation 失敗 — version compatibility matrix 進 deployment pipeline、firmware upgrade 前先測 staging
Quorum CO 全部同地點：5 個 CO 全在同一個辦公室、辦公室斷網 = quorum 不能組 — CO 跨 region / 跨組織分散
Audit log 沒接 SIEM：CloudHSM activity 透過 CloudTrail + cluster audit log、沒接 SIEM 就無 forensic — CloudTrail 跟 cluster audit 都 push 到 SIEM（見 7.13 偵測覆蓋率與訊號治理）

何時改走其他服務

需求形狀	改走
一般 AWS workload 加密、無 dedicated 合規	AWS KMS
Azure-heavy + dedicated HSM 合規需求	Azure Managed HSM（見上方對照表）
GCP-heavy + dedicated HSM 合規需求	Google Cloud HSM（Cloud KMS Protection Level=HSM）
Secret storage + dynamic credential	HashiCorp Vault / AWS Secrets Manager
Certificate / PKI（不是 key custody）	AWS ACM / cert-manager
跨雲 unified key custody	HashiCorp Vault transit engine（雲廠中立）
Key rotation 證據鏈	7.5 Credential Rotation Scoped Evidence

不在本頁內的主題

CloudHSM 完整 PKCS#11 / JCE API reference
CloudHSM Classic（舊版、已 EOL）的差異
每種合規法規（PCI HSM、HIPAA、FedRAMP）的逐條對應
CloudHSM CLI 跟 cloudhsm_mgmt_util 詳細指令
應用層使用 HSM-bound key 做 TLS termination 的 nginx / Apache 配置細節

案例回寫

CloudHSM 在 07 案例庫沒有直接 vendor-level 事件、以下案例採對照引用：

案例	跟 CloudHSM 的關係（對照）
Microsoft Storm-0558 Signing Key Chain	核心對照 — CloudHSM 設計 AWS 不持 key + key 不能 export 是 Storm-0558 反設計、攻擊者進 cluster 也搬不走 key material、Quorum policy 阻單一 admin compromise
Failure: Credential Rotation Without Scope	CloudHSM key rotation 需要應用層配合 key alias 切換、不像 KMS 自動 rotation；scope map 跟雙軌驗證窗口更明顯、PKCS#11 client 散落 host 群時 rotation 要分批
Azure AD Identity Control Plane 2021	對照啟示 — HSM cluster 是 single point of compromise、cross-AZ topology + cross-region backup 是 availability 的設計依據、不是 confidentiality

下一步路由

上游：7.6 秘密管理與機器憑證治理、7.5 傳輸信任與憑證生命週期（HSM 為 CA / signing key 的 FIPS-grade root custodian）、7.13 偵測覆蓋率與訊號治理
平行：AWS KMS、Google Cloud KMS、Azure Key Vault
整合：HashiCorp Vault（CloudHSM 作為 Vault auto-unseal master key custodian）
整合：KMS Custom Key Store（KMS API + CloudHSM backing 雙重）
跨模組：8 事故處理 vendor 清單（HSM 失效如何 routing 進 IR 流程）
官方：AWS CloudHSM Documentation

Azure RBAC + Entra ID

Mon, 18 May 2026 00:00:00 +0000

Azure 的身份與權限體系是雙層 — Entra ID（前 Azure AD）是 IdP，承擔人類與 workload 的身份來源、SSO、MFA 與 Conditional Access；Azure RBAC 是 cloud resource 的 permission engine，把 role 指派到 scope（Management Group / Subscription / Resource Group / Resource）上的 principal。兩層責任不同、設定介面不同、出事故時的徵兆也不同 — 把兩者寫成同一件事是 Azure 治理最常見的混淆來源。

服務定位

Entra ID 是 Microsoft 自有的 workforce IdP、跟 Okta 是直接競爭者。M365 / Azure-heavy 的組織通常直接用 Entra ID 當主 IdP；Okta-first 的組織可以把 Entra ID 當下游 SP（federation）、也可以雙 IdP 並存、但雙 IdP 的 break-glass 跟 lifecycle 路徑要重新設計。Entra ID 同時承擔 consumer-side 跟 partner-side 的 multi-tenant app 信任、跟 Auth0 在 B2C 場景有交集。

Azure RBAC 是 cloud resource permission engine、跟 AWS IAM / Google Cloud IAM 同層 — 都在解「身份對 cloud resource 能做什麼」。差異在 scope hierarchy — Azure 用 Management Group → Subscription → Resource Group → Resource 四層繼承、AWS 用 account + organization、Google 用 organization → folder → project。Azure RBAC 預期 role assignment 沿 scope 向下繼承、這跟 AWS 在每個 account 重新指派的習慣不一樣、跨雲團隊轉過來常踩到。

本章目標

讀完本頁、讀者能判斷：

哪一段控制屬於 Entra ID（身份）、哪一段屬於 Azure RBAC（resource permission）、不要把兩層當同一件事
Entra ID tenant 的最低稽核需求（Global Admin、App Registration、Conditional Access、Managed Identity）
Azure RBAC 的 scope 設計、Custom Role 跟 PIM 何時必要
Entra ID 控制面事故的降級路徑、跟 Azure RBAC 出事的徵兆差異

最短判讀路徑

判斷 Azure 雙層體系是否健康、要分兩層各看兩件事、跟「日常操作與決策形狀」段的兩層結構對齊。

Entra ID 層（身份控制面）：

誰能做什麼：Global Admin / Privileged Role Administrator 的人數、是否走 PIM just-in-time、Conditional Access 是否強制 phishing-resistant 認證、break-glass 帳號是否 exclude 自所有 CA policy 又單獨監控
入口如何暴露：App Registration 是否限定 single-tenant、multi-tenant app 的 admin consent 流程是否經審查、Managed Identity 是否取代 service principal client secret

Azure RBAC 層（resource permission）：

誰能對 resource 做什麼：Owner / Contributor 在哪個 scope（Management Group 還是 Subscription）、production 環境是否用 Custom Role 收緊權限、有沒有 standing assignment 該改 PIM
證據是否可回查：Entra ID Sign-in Log / Audit Log 是否同步到 SIEM、Azure Activity Log 是否設保留與 alert、admin consent / role assignment 變更是否觸發 alert runbook

兩層任一邊任一條缺失、就是 Audit Log 與 Authorization 邊界的待補項目。

日常操作與決策形狀

Entra ID 層

User / Group / lifecycle：HRIS 推 SCIM 進 Entra ID、Entra ID 同步到下游 SaaS 跟 Azure RBAC group。決策點是 source of truth — 多數組織把 HRIS 設為人員來源、Entra ID 當分發層、避免雙寫造成 stale account。

Conditional Access 是 MFA 主要強制機制：MFA 不是設在 user 屬性上、是 Conditional Access policy 在登入時判斷 user / device / location / app / risk 後觸發。常見設定錯誤包含 exclude legacy auth 沒做、break-glass 規則太寬、emergency access 帳號沒獨立監控。Conditional Access 規則設計錯、就是高權限 bypass 的入口。

App Registration vs Enterprise Application：開發者註冊 multi-tenant app 走 App Registration（app 的定義）、組織 admin 為某 app 設定 SAML SSO / admin consent 走 Enterprise Application（該 tenant 對 app 的信任）。兩者常被混講、但安全意義不同 — App Registration 是「我們做了一個 app」、Enterprise Application 是「我們信任這個 app 用我們的身份」。Consent phishing 攻擊就是針對後者。

Managed Identity：Azure resource（VM、Function、AKS pod）自帶身份、不需要 service principal client secret、跟 Google Workload Identity Federation 同概念但 Azure-internal。System-assigned 跟 resource 生命週期綁定、resource 刪掉 identity 跟著刪；User-assigned 獨立、可跨 resource 共用。production 環境的服務存取 Key Vault / Storage 應走 Managed Identity、不該用 client secret。

Workload Identity Federation：Entra ID 可以 trust 外部 OIDC issuer（GitHub Actions、AWS、Google）、讓外部 workload 直接拿 Entra ID token、不用儲存 client secret。CI/CD 的 OIDC 整合是這層的主用例、比把 client secret 塞進 CI variable 安全很多。

Signing key 是 control plane 託管：Entra ID 不暴露 signing key、客戶沒有 rotate 它的能力。這層信任邊界一旦失守、客戶側 直接修不了、要等供應商發 patch 或公告 — Storm-0558 揭示了這條依賴的代價。客戶側能做的補強是 下游檢查 而非 上游修復：

訂閱 Microsoft Security Advisory（MSRC）+ tenant-specific notification、讓事件公告第一時間進 IR pipeline、不要靠新聞才知道
SIEM alert anomalous token issuance pattern（跨租戶 token 在 Exchange / Graph API 出現異常存取序列）、不能只信 token signature valid
高敏 app 的 token validation 不只看 Entra ID 標準驗證、加 issuer + tenant + audience + nonce 多層比對、攻擊者偽造跨租戶 token 時可能漏掉某層
Conditional Access 配 token protection（token binding to device）、降低 stolen token replay 的命中率
IR playbook 預設 signing key 事件 一條 — 一旦供應商公告、強制 sign-out 高權限 user、token TTL 收短、回頭看 90 天 sign-in log 找異常

Azure RBAC 層

Scope 設計：role assignment 沿 Management Group → Subscription → Resource Group → Resource 向下繼承。在 Management Group 給 Contributor、底下所有 subscription / RG / resource 都繼承 — 這既是優點（統一治理）也是風險（誤指派擴散範圍大）。設計原則是 指派盡量低、不要對全 Management Group 給 Contributor。

Built-in role vs Custom Role：Owner（含 user access admin）/ Contributor（不含權限管理）/ Reader 是 built-in、通常太粗。production 環境需要 Custom Role 把 Microsoft.Storage/storageAccounts/listKeys/action 之類的高風險 action 收掉、只留 read。Custom Role 是 least privilege 在 Azure 的落實工具、不做就是用 Contributor 當預設、權限過寬。

Privileged Identity Management（PIM）：高權限角色（Global Admin、Subscription Owner、User Access Administrator）應走 just-in-time activation、需要 MFA 跟 approval、不該 permanent assignment。沒上 PIM 的組織通常會發現 standing Global Admin 超過 10 個、那是 phishing / token theft 的高價值靶。

Service principal vs Managed Identity：service principal 是 app 在 Entra ID 的代表、可以用 client secret 或 certificate 認證；Managed Identity 是 service principal 的特殊形式、由 Azure 自動管 credential。能用 Managed Identity 就不用 service principal client secret — 後者要自己 rotate、要存 secret management、容易 stale。

Azure Policy 是 RBAC 的補位：RBAC 管 principal 能不能對 resource 做這個 action、Azure Policy 管 允不允許這樣設定 resource（例如 storage account 強制加密、VM 只能用認可的 image）。RBAC 給 Contributor 的人可以建 storage account、但 Azure Policy 可以拒絕未加密的 storage account 建立 — 兩層互補、缺一不可。

核心取捨表

Azure 雙層體系的取捨要分開看 — 一張表回答 cloud resource permission 該選哪家（Azure RBAC vs AWS IAM vs Google IAM）、一張表回答 workforce IdP 該選哪家（Entra ID vs Okta）。兩個決策獨立、可以混搭（例如：Okta 當 workforce IdP + federate 到 Entra ID + 走 Azure RBAC 管 Azure resource）。

Azure RBAC vs AWS IAM vs Google Cloud IAM

維度	Azure RBAC	AWS IAM	Google Cloud IAM
Scope	Management Group → Subscription → RG → Resource	Account + Organization、policy attach	Organization → Folder → Project
繼承模型	scope 向下繼承	account boundary 強、跨 account 用 assume role	scope 向下繼承、condition 強
自訂角色	Custom Role（JSON）	Custom managed policy（JSON）	Custom Role（YAML / API）
JIT 機制	Privileged Identity Management（PIM）內建	無原生 JIT、要靠 IAM Identity Center / 第三方	無原生 JIT、要靠 third-party / 自建
Workload	Managed Identity（內部）+ Workload Identity Fed	IAM role + OIDC trust	Workload Identity Federation
適合場景	Azure-heavy、M365 整合	AWS-heavy、account isolation 模型成熟	GCP-heavy、resource hierarchy 治理

Entra ID vs Okta（workforce IdP）

維度	Entra ID	Okta
主場	M365 / Azure 原生、跟 RBAC 共生	多雲 + SaaS、跨平台 SSO
MFA 機制	Conditional Access 觸發、Authenticator app / FIDO2	Sign-On / Authentication Policy、多 factor 選擇
Lifecycle	SCIM + cross-tenant sync	SCIM + Lifecycle Management、整合更廣
Workload	Managed Identity / Workload Identity Federation	較弱、CI 通常 federate 到雲 IAM
整合廣度	M365 / Azure / Office app 深、外部 SaaS 比 Okta 少	7000+ SaaS app 預建
第三方風險	Microsoft 控制面（Storm-0558、Midnight Blizzard）	Okta 控制面（2022 / 2023 多起）

選 Entra ID 的核心訴求：M365 / Azure 重度使用、要跟 RBAC + Managed Identity 直接整合、能接受 Microsoft 控制面風險；選 Okta 的核心訴求看 Okta vendor 頁。

進階主題

Conditional Access 進階規則：除了 user / device / location 基本條件、進階場景包含 risk-based（Identity Protection 給的 user risk / sign-in risk）、token protection（token binding 到 device、防止 token replay）、authentication strength（強制 phishing-resistant factor）。production tenant 至少要有「Global Admin 必須走 phishing-resistant + compliant device」這條規則。

Privileged Identity Management（PIM）的設計細節：activation 要求 MFA、approval（高權限角色）、justification、時限（預設 8 小時、最長 24）。Access Review 是 PIM 的配套 — 季度檢視 standing assignment 是否還需要、不需要的撤掉。沒做 Access Review 的 PIM 等於只把問題從 standing 推到 誰申請就給 — 不是 least privilege。

Workload Identity Federation 跨雲：Entra ID 可以 trust GitHub Actions / GitLab / AWS / Google 的 OIDC issuer、讓 CI 直接拿 Azure token。同向也成立 — Azure workload 可以拿 Google ID token federate 進 GCP。多雲 CI 不該存任何 client secret、走 federation 比較安全。

Custom Role 設計實務：用 Microsoft.Authorization/roleDefinitions API 或 portal 定義、actions / notActions / dataActions 各自獨立 — actions 是 control plane、dataActions 是 data plane（讀寫 blob、key vault secret 內容）。常見錯誤是只收 actions 沒收 dataActions、結果 storage account 設定改不了但 blob 內容隨便讀。

Azure Policy 跟 Initiative：Policy 是單一規則、Initiative 是 policy 的集合（用來組 baseline、例如 CIS、ISO 27001）。Policy effect 有 audit / deny / deployIfNotExists、後者可以自動補洞（例如自動加 diagnostic setting）。RBAC + Policy 一起設計才是完整的 Authorization 邊界。

排錯與失敗快速判讀

Global Admin 過多：standing Global Admin 超過 5 個就要警惕 — 上 PIM、把日常運維改用 Privileged Role Administrator + 特定 admin role group
Conditional Access 規則漏 legacy auth：規則只 cover modern auth、IMAP / POP / SMTP 等 legacy protocol 不走 CA — 用「Block legacy authentication」baseline policy 補
App Registration / Enterprise Application admin consent 沒審查：使用者自己 consent 把 mail.read 給三方 app、變 consent phishing 入口 — 關閉 user consent、改 admin consent workflow
Service principal client secret 散落：CI / 服務裡有大量 client secret、rotate 沒節奏 — 改 Managed Identity（內部）或 Workload Identity Federation（跨雲 CI）
Subscription Owner 太多：subscription 級 Owner 是高風險、應該收到 Management Group 級 Reader + 必要時 PIM activate Owner
Azure Activity Log 沒進 SIEM：role assignment 變更、Key Vault access policy 變更只在 Azure portal 看得到、沒 alert — 用 Diagnostic Setting 推 Event Hub / Log Analytics、再進 SIEM
Break-glass 帳號 exclude 自所有 CA policy、但沒監控：emergency access 帳號不能被 CA 鎖、但 任何登入都該 alert — 配對 Sign-in Log alert + 季度驗證可用

何時改走其他服務

需求形狀	改走
AWS-only 環境	AWS IAM
GCP-only 環境	Google Cloud IAM
多雲 + 大量 SaaS、IdP 中心化	Okta
Customer / B2C identity	Auth0
自管 IdP / 不接受 SaaS	Keycloak
Secret / Key 管理	7.6 秘密管理與機器憑證治理（Azure Key Vault vendor 頁 S2 批次撰寫中）
偵測訊號（不只 Entra ID 內部）	07 SIEM 章節、04 observability

不在本頁內的主題

Entra ID 完整 SAML / OIDC / SCIM 規格細節
Azure RBAC built-in role 完整清單與 action 對照
Conditional Access policy template 細節
Azure Policy 內建 initiative 完整清單
Microsoft 365 / Defender for Identity 等周邊產品

案例回寫

案例	跟 Entra ID / Azure RBAC 的關係
Azure AD Identity Control Plane 2021	Entra ID 控制面故障外溢到 Teams / SharePoint / Exchange、業務必須有降級與切換策略、不能完全依賴單一 IdP 可用性
Microsoft Storm-0558 Signing Key 2023	signing key 治理失效會跨租戶影響 token 驗證信任、客戶側只能等供應商修復（MSRC / CSRB 公開報告補充了 crash dump / Exchange Online 等具體外洩路徑、屬 case 檔之外的歷史 reference）
Microsoft Storm-0558 Signing Key Chain (red-team)	HSM-bound key 是 control plane 必要前提、跨租戶 token 異常要立即升級、不能等供應商先公告
Failure: Credential Rotation Without Scope	Entra ID app secret 跟 Managed Identity 的 rotation 分域、不該把 service principal client secret 跟 user password 混在同一個 rotation policy

下一步路由

上游：7.2 身分與授權邊界、7.13 偵測覆蓋率與訊號治理
平行：AWS IAM、Google Cloud IAM、Okta
下游：7.6 秘密管理與機器憑證治理（Entra ID / Managed Identity 之後的 secret / key 層、Azure Key Vendor 個別 vendor 頁 S2 批次撰寫中）
跨模組：8 事故處理 vendor 清單（Entra ID / Azure 事件如何 routing 進 IR 流程）
官方：Microsoft Entra Documentation、Azure RBAC Documentation

Cloud-native Data Policy (BigQuery + S3)

Mon, 18 May 2026 00:00:00 +0000

Cloud-native data policy 的核心責任是把資料層的 access 控制綁在 storage resource 本身、用該雲既有的 IAM 體系做 enforcement、不依賴額外的 data security platform。本頁同時涵蓋 BigQuery policy tooling（Authorized View / Column-level security / Row-level security / Dynamic Data Masking）跟 AWS S3 policy tooling（Bucket policy / Access Points / Object Lambda / Macie / Block Public Access）— 兩條 sister stack 是各自雲端代表性的 data access control 設計、合一頁是為了讓讀者看清楚 GCP 走 SQL-native 細粒度 跟 AWS 走 storage-resource-bound 的取捨差異、不是把它們當同類混寫。

服務定位

Cloud-native data policy 是 resource-bound access control — 控制邏輯掛在 BigQuery dataset / column / row 或 S3 bucket / object 上、用 Google Cloud IAM / AWS IAM 的 principal 體系做 evaluation。跟 Google DLP 比、DLP 是 content-based discovery + transformation（掃 PII、做 de-id）、本頁工具是 access boundary；典型組合是 DLP 發現 sensitive column → BigQuery policy tag 控制誰能讀 → S3 Object Lambda redact at read time。跟 Microsoft Purview 比、Purview 走 label-driven + 跨 platform（同一個 sensitivity label 跨 SharePoint / Fabric / Azure SQL）、雲端原生 policy 走 resource-bound + 限該雲；雲端原生更貼近 storage、跨雲統一靠商業 platform。跟通用 Cloud IAM 比、IAM 是 resource-level read/write 二分、本頁是 column / row / object-level 細粒度、補 IAM 解不掉的「同一張表只能看自家行」場景。

關鍵張力：資料細粒度 ↔ 跨雲 portability。BigQuery RLS 跟 S3 Access Points 的 policy 語法都是該雲專屬、換雲要重寫；換來的是 free（無額外授權）+ 平台原生效能（不過代理）。多雲 enterprise 若要統一 policy DSL、走 Immuta / Privacera / Snowflake Horizon。

本章目標

讀完本頁、讀者能判斷：

BigQuery 跟 S3 policy 各自能做到什麼層級的細粒度（column / row / object / cross-region）、不能做到什麼
Cloud-native policy 跟 Google DLP / Microsoft Purview 的責任分界、何時要組合使用
Multi-tenant SaaS 在共用 dataset / bucket 場景的 access boundary 設計（BigQuery RLS / S3 Access Points）
何時用雲端原生 policy、何時改走 Immuta / Privacera / Snowflake 跨雲 data security platform

最短判讀路徑

判斷 cloud-native data policy 是否健康、最少看四件事：

BigQuery 側 — RLS / column policy coverage：multi-tenant dataset 是否有 CREATE ROW ACCESS POLICY、sensitive column 是否綁 policy tag、policy tag 上的 IAM 是否走 group 而非 individual user、view-only access 是否走 Authorized View 而非 dataset grant
S3 側 — bucket policy 結構：Block Public Access 是否 account-level 開啟、ACL 是否 disabled（Object Ownership = BucketOwnerEnforced）、共用 bucket 是否走 Access Points 分租戶、跨帳號是否經 AP policy + bucket policy 雙重驗證
Sensitive data discovery 接口：BigQuery 是否接 Google DLP inspection job、Dataplex 是否跑 data classification、S3 是否開 Macie scan、findings 是否進 EventBridge / Security Hub 而非僅 console 看
Audit trail completeness：BigQuery audit log（dataAccess）是否進 Cloud Logging + 進 SIEM、S3 是否開 server access logging + CloudTrail data event（GetObject / PutObject）、跟 Detection Coverage 對齊

四件事任一缺失、就是 Data Residency, Deletion and Evidence Chain 邊界的待補項目。

日常操作與決策形狀

BigQuery 側

Authorized View / Authorized Routine：view 的 SQL definition 可以讀 source dataset、grantee 只要被 grant view 自身就能查、不需要 grant source dataset access。經典「給 analyst 看 aggregate 數據但不給原始 PII row」模式 — analyst 看 SELECT region, count(*) FROM customer 沒問題、但 underlying customer table 從不出現在 analyst IAM。Authorized Routine 是同邏輯延伸到 stored procedure / UDF、適合 logic 比 SELECT 複雜的轉換場景。

Column-level security（policy tag）：在 Data Catalog 建 taxonomy + policy tag、把 BigQuery column schema 綁 tag、policy tag 上設 fine-grained reader role。沒這個 role 的 user 即使有 dataset access、SELECT * 時該 column 會 raise error 或 被 omit。HIPAA / PCI-DSS 對「即使 DBA 也不能 default 看到 PHI / cardholder data」的硬要求、走 policy tag 是技術性 enforcement、不是 procedural control。

Row-level security (RLS)：CREATE ROW ACCESS POLICY tenant_filter ON dataset.table GRANT TO ('group:analysts@org.com') FILTER USING (tenant_id = SESSION_USER())。每個 query 自動 append filter、user 看到的 row 由 policy expression 決定。Multi-tenant SaaS（共用 dataset、每行帶 tenant_id）必用 — 否則 query 必須在 application layer 帶 WHERE、漏一處就是跨 tenant data leak。對應 Snowflake 2024 Credential Abuse 的對照啟示。

Dynamic Data Masking：column 上設 masking rule（hash / nullify / partial mask / regex replace）、不同 IAM 角色看不同 mask 程度 — email_address 在 admin 看到原值、在 analyst 看到 ***@example.com、在 external partner 看到 NULL。補 RLS 不足之處：RLS 過濾 哪些 row 看得到、Masking 過濾 看到的 row 內容怎麼呈現；兩者組合解大多數 multi-tenant + multi-role 場景。

Dataplex Data Classification + DLP 整合：Dataplex 走 lake-wide 治理（dataset metadata + lineage + quality）、自動觸發 Google DLP inspection、發現 sensitive column 自動建議 / 套用 policy tag。是 GCP 內部把 discovery → access control 自動化的標準路徑。

S3 側

Block Public Access account-level：2018 推出、2023 起新建 bucket 預設開啟。account-level setting 強制 override 所有 bucket policy / ACL — 即使有 bucket policy 寫 "Principal": "*"、Block Public Access 開啟時也禁止對外暴露。Production AWS 帳號必須 account-level 開、bucket-level 額外加固。是 LastPass 2022 Backup Chain 類事故的 last-line defense。

Bucket policy / IAM policy / ACL（legacy）：三層 evaluation — bucket policy（resource-based、寫在 bucket 上）、IAM policy（identity-based、寫在 principal 上）、ACL（legacy object-level、新建 bucket 應禁用）。AWS 2023 起推 Object Ownership = BucketOwnerEnforced、強制 ACL disabled、所有 access 經 bucket policy + IAM 決定。舊 bucket 應走 ACL → bucket policy migration。

S3 Access Points：每個 bucket 可開多個 Access Point、各有獨立 name + policy + VPC restriction。Multi-tenant 場景（一個 bucket 服務多個 tenant）走「每個 tenant 一個 AP + AP policy 限定 prefix + 限定 VPC」、取代過去「shared bucket + prefix-based IAM」的脆弱模式。對應 Mailchimp 2023 Support Tool Abuse 的對照啟示 — 共用入口需 per-tenant policy boundary、不是 application-layer filtering。

Multi-Region Access Points (MRAP)：跨 region replicated bucket 的單一 global endpoint、自動 route 到最近 region。資料駐留要求高的場景（GDPR / 中國資料法）反而要慎用、因為 read 來源不可預測；對 latency-sensitive 全球分發是 first-class 解法。

Object Lambda Access Points：在 GetObject response path 插 Lambda、做 read-time transformation（redact PII / format conversion / image resize / decrypt + re-encrypt）。同一份 raw object、不同 caller 透過不同 Object Lambda AP 看到不同版本 — 等同 BigQuery Dynamic Data Masking 在 S3 的對應物。但 Lambda 有 cold start + 6MB response limit、不是所有場景都合適。

Macie sensitive data discovery：S3 專屬、scan bucket 找 PII / credential / payment data、findings 進 EventBridge + AWS Security Hub。跟 Google DLP 同層但限 S3、不能掃 RDS / DynamoDB。findings 應自動 route 到 SIEM、不是只在 Macie console 等人看。對應 Progress WS_FTP 2023 File Service Breach 的對照 — 對外檔案服務必有 audit + 異常量 baseline + Macie sensitive content scan。

S3 Object Ownership / ACL disabled：2023+ 預設 ACL disabled、所有新 bucket 應 keep this default、舊 bucket 走 audit + migration（先掃 ACL grant、確認沒人靠 ACL 拿 access、再切換）。混用 ACL + bucket policy 的 bucket 是 access control 漂移最常見的源頭。

核心取捨表

取捨維度	BigQuery policy tooling	S3 policy tooling	Immuta / Privacera	Snowflake Horizon
細粒度層級	Column / Row / cell-level（policy tag + RLS + DDM）	Object-level（prefix-based）+ Object Lambda 內容轉換	Column / Row / cell + 跨平台統一 DSL	Column / Row + Snowflake 平台限定
計費	Free（included in BigQuery）	Free（bucket policy）+ Macie / Object Lambda 用量計費	商業授權、per-user 或 per-data-source	Snowflake 平台費內含
跨雲 portable	GCP only	AWS only	跨 BigQuery / Snowflake / Databricks / S3	Snowflake only
Policy DSL	SQL-native（CREATE ROW ACCESS POLICY、masking SQL）	JSON policy + Lambda 程式碼	統一 attribute-based DSL	SQL-native
Sensitive discovery	DLP / Dataplex 自動整合	Macie（限 S3）	內建 + 跨平台 scan	跨 schema metadata + classification
Audit	Cloud Audit Log dataAccess 細到 column	CloudTrail data event + server access log	跨平台統一 audit trail	Snowflake QUERY_HISTORY
適合場景	GCP-first、BigQuery 為主 data warehouse	AWS-first、S3 為 data lake / 檔案分發	多雲 enterprise、跨平台統一 policy	Snowflake-centric data platform
退場成本	中 — RLS / policy tag 重寫到目標平台	中 — bucket policy / AP 重寫	低 — DSL 抽象可遷移	中 — 限 Snowflake

選雲端原生 policy 的核心訴求：單一雲 + 預算敏感 + 不想引入新 vendor。多雲 enterprise + 統一治理需求高、走 Immuta / Privacera 才能避免兩套 policy 漂移。

進階主題

BigQuery Authorized View vs RLS 取捨：Authorized View 適合 shape-based filtering（grantee 只能看 aggregate / 特定 column subset）、RLS 適合 value-based filtering（grantee 只能看 tenant_id = self 的行）。實務常常組合 — view 限 column、view 上再加 RLS 限 row。view 的問題是維護成本（schema 改要同步改 view）、RLS 的問題是 policy expression 寫錯整批 user 看不到資料、staging tenant 跑過再 promote。

S3 Access Points + VPC-only restriction：AP policy 可加 "Condition": {"StringEquals": {"aws:SourceVpc": "vpc-xxx"}}、強制只能從特定 VPC access — 跨帳號場景（partner 帳號 access 自家 bucket）必加、避免 partner credential 外洩後可從任意網路位置存取。對應 LastPass 2022 Backup Chain 對照、backup bucket 不該跟 prod bucket 共用 IAM role + 不該允許 internet-wide access。

Object Lambda redact PII at read time：適合 raw data 已寫入、但不同 consumer 需要不同 view 的場景 — 例如客服查 user record 看到 mask 過的 SSN、合規 audit 帳號看到完整 SSN。Lambda 內部呼叫 Google DLP deid template / Comprehend PII detection / 自家 regex；要注意 cold start 對 latency 的影響、不適合 high-throughput 場景。

Macie automated discovery → SIEM：Macie findings 走 EventBridge rule → Security Hub → 推 Splunk / Elastic Security / Datadog Security — 不該只在 Macie console 看 findings。發現 unencrypted S3 bucket 有 cardholder data 必須觸發 incident response runbook、進 8 事故處理。

跨 region 跟 data residency：BigQuery dataset region + S3 bucket region 是 資料駐留 enforcement 的硬邊界、policy tooling 不能 override。GDPR / 中國資料法場景必須 region pinning + 禁止 Multi-Region replication、policy tag / RLS 無法解決資料離境問題。對應 Data Residency Deletion and Evidence Chain 章節原則。

排錯與失敗快速判讀

BigQuery RLS 設了但 user 還是看到全部 row：policy GRANT TO 沒包該 user 的 group、或 user 有 bigquery.dataOwner role（owner override RLS）— check group membership + 降權到 dataViewer
Column policy tag 沒生效：column 沒 attach tag、或 tag taxonomy 沒在該 project / region — check Data Catalog taxonomy location 跟 dataset region 對齊
S3 bucket 意外 public：Block Public Access account-level 沒開 + bucket policy 寫 "Principal": "*"、或 ACL 殘留 AllUsers grant — 立即開 BPA + audit ACL（aws s3api get-bucket-acl）
Access Point policy 跟 bucket policy 衝突：AP 允許但 bucket policy 拒絕、最後是拒絕（explicit deny 永遠勝）— 兩層都要明確 allow、bucket policy 加 "Principal": {"AWS": "*"} + condition 限定 AP ARN
Macie scan 跑很久 / cost 暴衝：scan 整個 bucket、含 archive prefix、沒設 sampling — 用 sensitive data discovery job with prefix filter + sampling rate、不要 default 全 bucket scan
Authorized View grantee 看不到資料：view definition 走的 source dataset 沒 authorize 該 view、或 view 自身改了但沒重新 authorize — bq update --view_authorization 重設
Object Lambda 慢 / timeout：Lambda cold start + 6MB response limit、大檔案不該走 Object Lambda — 改在寫入時 transform、或用 pre-signed URL 繞過 Object Lambda

何時改走其他服務

需求形狀	改走
跨雲統一 data policy DSL	Immuta / Privacera
Content-based discovery + de-id	Google DLP / Microsoft Purview
Label-driven + Microsoft 365 跨 platform	Microsoft Purview
Application-layer access control	應用層 RBAC / ABAC（Casbin / OPA / Cerbos）
Snowflake-centric data platform	Snowflake Horizon（row access policy / masking policy 平台內建）
通用 cloud resource permission	AWS IAM / Google Cloud IAM
SIEM / detection	Splunk / Elastic Security / Datadog Security

不在本頁內的主題

BigQuery / S3 自身的完整 admin guide（pricing / region / quota）
Encryption-at-rest 細節（KMS 整合走 AWS KMS / Google Cloud KMS 頁）
Azure Data Lake / Azure SQL policy（屬 Azure stack、本頁不涵蓋）
應用層 RBAC framework（Casbin / Cerbos / OPA Rego）
資料庫層 RLS（PostgreSQL RLS / SQL Server Row-Level Security）— 跟雲端原生 storage policy 是不同層

案例回寫

Cloud-native data policy 在 07 案例庫沒有直接 vendor-level 事件、所有 data exfiltration case 都是 access boundary 的對照：

案例	跟 cloud-native data policy 的關係（對照啟示）
Snowflake 2024 Credential Abuse	Multi-tenant SaaS 共用 dataset / schema 必須有 BigQuery RLS / Snowflake row access policy 等技術邊界、即使 credential 外洩攻擊者也只能看授權 row、不能只靠 application-layer WHERE
LastPass 2022 Backup Chain	S3 backup bucket 跟 prod bucket 必須獨立 Access Point + 獨立 IAM role + VPC restriction、同帳號 prefix-based 區隔不夠、Block Public Access 是 last-line
Progress WS_FTP 2023 File Service Breach	對外檔案服務必須有 S3 server access log + CloudTrail data event + Macie sensitive content scan、批量下載靠 GetObject 速率 baseline alert、不是事後檢視
Mailchimp 2023 Support Tool Abuse	共用 bucket 服務多 tenant 必走 S3 Access Points 拆 per-tenant policy、取代 prefix-based ACL 跟 application-layer filtering 的脆弱模式
Data Residency Deletion and Evidence Chain (section)	Cloud-native policy 是 deletion + residency 治理的技術 enforcement 層、region pinning + 禁止 Multi-Region replication + audit log retention 對應章節原則

下一步路由

上游：7.7 資料駐留刪除與證據鏈、Detection Coverage and Signal Governance
平行：Google DLP（discovery + de-id 互補）、Microsoft Purview（label-driven 對照）
下游：Splunk / Elastic Security / Datadog Security（audit log + Macie findings → SIEM）
跨類：AWS IAM / Google Cloud IAM（principal 體系基底）、AWS KMS / Google Cloud KMS（encryption-at-rest）
跨模組：8 事故處理 vendor 清單（data exfiltration incident routing）、1 資料庫模組（database-layer RLS / column policy 對照）
官方：BigQuery column-level security、BigQuery row-level security、Amazon S3 Access Points、Amazon Macie

Trivy

Mon, 18 May 2026 00:00:00 +0000

Trivy 是 Aqua Security 維護的 open-source all-in-one security scanner、Apache 2.0、單一 CLI 涵蓋 container image / filesystem / git repo / Kubernetes / IaC 五種 scan target、額外做 secret / license / SBOM scan。設計目標跟 Snyk 不同 — Snyk 是 SaaS-first、用 server-side dashboard 跨 SCM / 跨 repo 聚合；Trivy 是 CLI-first、零 server、CI runner 自己就能完成所有工作、air-gapped 環境也能跑。商業版 Aqua Platform 加 dashboard / RBAC / policy / runtime defense、但 Trivy 本身免費覆蓋大部分團隊需求。

服務定位

Trivy 的核心定位是 把 supply chain scan 收斂成一個 CLI。同一個 binary 處理 container image、source tree、K8s cluster live state、Terraform / Dockerfile / CloudFormation 配置、secret / license / SBOM — 不需要拼裝多個工具、不需要 SaaS account、不需要 server。跟 Snyk 商業 SaaS 的差異是 資料治理權 在自己這邊（scan 結果不上 vendor cloud）、代價是 跨 repo 集中報表 需要自己拼（用 Trivy Operator 或 Aqua Platform）。

跟 Syft + Grype 的差異是 工具邊界劃法。Anchore Syft 專做 SBOM 生成、Grype 專做 vuln scan、兩個工具靠 SBOM 標準（CycloneDX / SPDX）串接；Trivy 一個 CLI 全包、SBOM 也同樣輸出標準格式。多 vendor 並存環境（例：build pipeline 用 Syft 生 SBOM、release gate 用 Grype scan、跟 SBOM repository 互通）Syft+Grype 模組化較適合；單一團隊單一 pipeline 想 一次裝完 用 Trivy 更直接。

跟 GitHub Advanced Security 的差異是 偵測類型 + 部署面。GHAS 綁 GitHub、SAST（CodeQL）覆蓋深、但容器掃跟 IaC scan 較弱；Trivy 跨 SCM、容器跟 IaC 掃強、但沒 SAST 深度。跟 Clair（RedHat / Quay 內建）或 Anchore Enterprise 比、Trivy 用戶基數大（CNCF Sandbox）、社群更新快、整合面廣（GitLab CI / GitHub Actions / Jenkins / CircleCI 都有官方 step）。

本章目標

讀完本頁、讀者能判斷：

Trivy 的五種 scan target（image / fs / repo / k8s / config）各承擔哪段 supply chain 責任、什麼時候用哪個
Trivy DB 的更新模型（OCI artifact、6 小時 cadence、air-gapped mirror）跟 CI runner 信任邊界
.trivyignore 跟 severity gate 在 CI 怎麼接、exception 治理要設哪些 tripwire
何時用 Trivy、何時改走 Snyk / Syft + Grype / GHAS 的取捨

最短判讀路徑

判斷 Trivy 配置是否健康、最少看四件事：

scan target 覆蓋面：是否 image / fs / config / secret 四類都跑（不是只 scan image）、CI 是否把 dev container / base image / runtime image 全納入 — 漏掉 base image 等於信任 upstream registry
Trivy DB 更新 cadence：CI runner 是否每次都 pull 最新 DB（OCI artifact、預設 6 小時 TTL）、air-gapped 環境是否有內部 mirror（--db-repository 指到內部 registry）、trivy --skip-db-update 是否被誤用
severity gate 是否真的 fail build：Trivy 預設 scan 完 exit 0、CI 不會 fail；需要 --exit-code 1 --severity HIGH,CRITICAL 才會把 PR build 擋下來、否則 scan 結果只在 log、沒人看
.trivyignore 治理：ignore 的 CVE 有 reason + expiration 嗎、quarterly review 流程在嗎、.trivyignore.yaml 有用嗎 — 沒治理的 ignore list 會無限膨脹、最後等於沒 scan

四件事任一缺失、就是 supply chain integrity 邊界的待補項目。

日常操作與決策形狀

CLI 五種 scan target：trivy image 掃 container image 的 OS package + language dependency；trivy fs

掃 source tree（含 lockfile + Dockerfile + IaC manifest + secret）；trivy repo 不 clone 直接掃 git repo；trivy k8s --report summary cluster 掃 K8s cluster 內所有 workload（image + manifest 配置）；trivy config 專掃 IaC 配置（Terraform / CloudFormation / K8s YAML / Dockerfile / Helm）。本地 dev 最常用 trivy fs .、CI 最常用 trivy image $IMAGE、K8s 場景用 Trivy Operator 跑 in-cluster scan。

Trivy DB（OCI artifact）：Trivy 自己維護 vulnerability DB、以 OCI artifact 形式存在 ghcr.io/aquasecurity/trivy-db、每 6 小時更新一次。CI runner 第一次 scan 自動 pull、後續用 cache。air-gapped 環境（金融 / 政府 / 工控）需要把 DB mirror 到內部 OCI registry、--db-repository internal.registry/trivy-db 指過去。DB 內容是 aggregated source — NVD、GHSA、各 Linux distro security advisory、language ecosystem advisory（npm / PyPI / Maven / RubyGems / crates.io / Go / etc.）合在一起、所以單一查詢就能跨多生態。

.trivyignore 跟 .trivyignore.yaml：scan 發現的 CVE 若已評估無風險（無 reachable code path、已有 mitigation、upstream 尚未 patch 但業務不受影響）寫進 .trivyignore（純 CVE-ID list）或 .trivyignore.yaml（含 expired_at + comment + paths、更適合治理）。後者強制每筆 ignore 有 expiration（建議 quarterly）跟 reason、過期自動失效、避免 ignore list 變成「忘了清的死帳」。CI 應該每季跑 trivy --ignorefile .trivyignore.yaml 同時 alert 即將過期的條目。

Severity gate 是 CI 必設：Trivy 預設 scan 完 print 結果但 exit 0、CI build 不會 fail。要在 CI 真正擋下高風險 PR、必須 trivy image --exit-code 1 --severity HIGH,CRITICAL $IMAGE。Severity 級別（UNKNOWN / LOW / MEDIUM / HIGH / CRITICAL）對應 CVSS score、團隊需要決定 什麼 severity 算 release blocker。常見 baseline：CRITICAL fail PR build、HIGH fail nightly build（給 24 小時修補窗口）、MEDIUM 進 backlog ticket。

SBOM 生成與 scan：trivy image --format cyclonedx --output sbom.json $IMAGE 生 CycloneDX 格式 SBOM、--format spdx-json 生 SPDX。也可以反向 — 拿別人生的 SBOM 餵給 Trivy：trivy sbom sbom.json 跑 vuln scan、不重新解析 image。這個 workflow 跟 Syft + Grype 重疊（Syft 生 SBOM + Grype scan SBOM）、差別是 Trivy 一站完成、Syft+Grype 拆兩階段更模組化。SBOM artifact 進 OCI registry（用 cosign attach）或 SBOM repository（如 Dependency-Track）做長期追蹤。

Misconfig + Secret + License 一起 scan：trivy fs . 預設啟用四類 scanner — vuln（package CVE）、misconfig（IaC 配置錯誤）、secret（hardcoded credential）、license（license compliance）。Misconfig 內建 hundreds of built-in policy（Rego 寫的）涵蓋 K8s / Terraform / Docker / CloudFormation 常見錯誤（privileged container / open S3 bucket / 0.0.0.0/0 ingress）。Secret scanner 用 regex pattern 找 AWS access key / GCP service account / Stripe key 等常見格式、不是萬能、但 dev pre-commit 攔截已洩漏 secret 很實用。

Trivy Operator（K8s in-cluster scanner）：K8s 場景的標準配置。Operator 在 cluster 跑、定期 scan 所有 namespace 的 workload、產 CRD reports：VulnerabilityReport（image CVE）、ConfigAuditReport（manifest 配置）、SbomReport、ClusterComplianceReport（CIS Kubernetes Benchmark / NSA Kubernetes Hardening Guide）。Operator 可選配 ValidatingAdmissionWebhook、admission 階段拒絕高風險 image（CVE severity 超門檻）。Reports 是 CRD、可以走 kubectl get vulnerabilityreport 看、也可以 prometheus exporter 出 metric 進 Grafana。

Aqua Platform 整合：Trivy CLI / Operator 結果可以推到 Aqua Platform（商業版）做集中 dashboard、跨 cluster RBAC、policy engine、compliance report、runtime defense（runtime container 監控）。純 CLI 用戶不需要、但企業有多 cluster + 跨團隊 governance 需求時、Aqua Platform 補 server-side aggregation 那塊（對應 Snyk dashboard 的功能）。

核心取捨表

取捨維度	Trivy	Snyk	Syft + Grype	GitHub Advanced Security
部署模型	CLI-only、零 server	SaaS-first、需要 Snyk account	CLI-only、兩個 binary	綁 GitHub、整合在 PR / Code Scanning
授權	Apache 2.0、完全免費	商業 SaaS（Free tier + 付費 plan）	Apache 2.0、完全免費	GitHub Enterprise add-on
Scan target	image / fs / repo / k8s / config	image / SCA / IaC / Code (SAST) / Container	image / fs（SBOM-first）	SAST (CodeQL) + Dependabot + Secret scanning
Vulnerability DB	Trivy DB（OCI artifact、6h cadence、可 mirror）	Snyk Intel（私有、含 reachability data）	Grype DB（GitHub-hosted、可 mirror）	GitHub Advisory DB
Reachability	無	有（Snyk Code reachability）	無	部分（CodeQL data flow）
SBOM 支援	生 + scan（CycloneDX / SPDX）	生（Snyk SBOM）	Syft 生、Grype scan、最完整 SBOM workflow	部分（Dependency Graph）
K8s in-cluster	Trivy Operator（CRD reports + admission）	Snyk Kubernetes（agent-based）	無原生、靠外部 wrapper	無
跨 repo 報表	Trivy 本身無、Aqua Platform 補	Snyk dashboard（強項）	無原生、靠外部	GitHub Security tab（綁 GitHub）
Air-gapped 支援	強 — DB 可 mirror 到內部 registry	弱 — 需要 Snyk SaaS（Snyk On-Prem 商業版另算）	強 — DB 可 mirror	弱 — 綁 GitHub.com
學習曲線	低 — 一個 CLI + 通用 flag	低 — UI 友善、CLI 也順	中 — 兩個工具拼、SBOM 概念要懂	中 — CodeQL query 寫 / 調有門檻
適合場景	CI image scan、K8s scan、air-gapped、OSS-only 預算	跨 SCM 跨 repo 集中治理、SaaS 預算 OK、需 reachability	SBOM 為主軸的 supply chain、多 vendor 互通	GitHub-only + 需要 SAST 深度

選 Trivy 的核心訴求：零 server / OSS-only 預算 / air-gapped 友善 / 一個 CLI 涵蓋 container + IaC + secret。需要跨 SCM 集中 dashboard 跟 reachability 走 Snyk；純 SBOM workflow + 多工具互通走 Syft+Grype；GitHub-only + 重 SAST 走 GHAS。

進階主題

Trivy Operator + admission control：Operator 跑 ValidatingAdmissionWebhook、admission 階段對 Pod spec 的 image 跑 vuln check、超門檻就拒絕創建。對應 supply chain integrity 的 artifact gate at deploy time。組態要小心 — webhook timeout / Trivy DB 不可用 / Operator 自己 down 都會擋住 deploy、production 通常 fail-open（DB 不可用時放行 + alert）而非 fail-close。

Custom check（Rego policy）：Trivy misconfig scanner 用 Rego 寫 policy、可以自己加 custom check（例：禁止特定 namespace 用 hostPath volume、禁止特定 IAM action）。policy 走 --policy ./custom-policies/ 載入、跟內建 policy 一起跑。比 OPA Gatekeeper 簡單（不需要部署 admission webhook、scan-time 就執行）、但 runtime enforcement 還是要靠 Gatekeeper / Kyverno。

Air-gapped DB sync：金融 / 政府 / 工控環境 CI runner 不能連外網。流程是：有對外網的 staging machine 跑 trivy --download-db-only 把 OCI artifact 拉下來、用 skopeo copy 推到內部 OCI registry、CI runner 用 --db-repository internal.registry/trivy-db --skip-db-update（或排程從內部 mirror pull）。DB 更新節奏要排程化（每天 / 每 6 小時）、否則 air-gapped DB 落後幾天會 miss 掉新公布 CVE。

Cosign + SLSA + Trivy 三件事：Trivy 看的是 known CVE、看不到 build-time backdoor。配套需要 Sigstore cosign 做 image signature verify（確認 image 真的是自家 CI 出的）+ SLSA provenance（build pipeline 不可篡改紀錄）+ Trivy scan（known CVE）三件事一起、才是完整 supply chain trust chain。對應 Cert-manager 在 TLS 的角色、Trivy 在 supply chain 的角色是 已知漏洞檢測、不是 trust establishment。

排錯與失敗快速判讀

CI 顯示 scan 完但 build 沒 fail：忘了 --exit-code 1 --severity HIGH,CRITICAL、scan 結果只在 log、PR 一直 merge 進高風險 image — 補 severity gate flag、設 baseline
Trivy DB 拉不下來 / 過期：CI runner 沒對外網 / GitHub Container Registry 被擋 / DB cache 太舊 — 設內部 OCI mirror、CI runner --db-repository 指過去、排程 update
.trivyignore 無限膨脹：用純 list 沒 expiration、團隊找不到誰加的 / 為什麼加 — 改 .trivyignore.yaml 強制 reason + expiration、quarterly review 排進 sprint
false positive 多到 alert fatigue：base image 自帶大量未修補 OS package、scan 出 50+ HIGH — 換 distroless / Chainguard / Wolfi 等 minimal base image、或 multi-stage build 只保留必要 binary、不是調高門檻當沒看到
secret scanner 漏報：hardcoded credential 是非標準格式（內部 token、特殊 vendor key）— 加 custom secret pattern、或配合 dedicated tool（Gitleaks / GitGuardian）做第二道
Trivy Operator 報表沒人看：reports 是 CRD、kubectl get 才看到、PR / Slack 沒通知 — 接 prometheus exporter + Grafana alert、或 webhook 推 Slack
K8s admission webhook fail 擋住 deploy：Operator down / DB 不可用、所有 Pod 創建被拒 — webhook 配 failurePolicy: Ignore、production 通常 fail-open + alert、不是 fail-close

何時改走其他服務

需求形狀	改走
需 reachability / 跨 SCM dashboard	Snyk
SBOM-first / 多工具互通	Syft + Grype
SAST 深度 / GitHub-only	GitHub Advanced Security（CodeQL）
純依賴升級自動化	Dependabot
Runtime container monitoring	Falco / Cilium Tetragon / Aqua Runtime（商業版）
TLS / mTLS cert lifecycle	cert-manager
Image signing / provenance	Sigstore cosign + SLSA framework

不在本頁內的主題

Trivy CLI 所有 flag 跟 output format 完整 reference
Rego policy language 完整語法（OPA / Rego 自有體系）
Aqua Platform 商業版完整功能矩陣（dashboard / RBAC / runtime defense）
各 PCI DSS / SOC 2 / FedRAMP 合規 mapping
跟其他 scanner（Clair / Anchore Enterprise / Twistlock）的逐項比較

案例回寫

Trivy 在 07 案例庫沒有 直接 vendor-level 事件（Trivy 本身 OSS、無 vendor-side 控制面風險）、但 supply chain 案例都對應 Trivy 的能力與邊界：

案例	跟 Trivy 的關係
Log4Shell CVE-2021-44228	對照啟示 — CVE 公開後 Trivy DB 幾小時內更新、scan container image 找受影響 service 是緊急 response 主軸；air-gapped 環境 DB mirror 更新節奏直接決定窗口期長度
SolarWinds 2020 Sunburst	對照啟示 — Trivy scan known CVE、看不到 build-time backdoor 植入；必須配合 image signing（cosign）+ SLSA provenance 才完整
3CX 2023 Desktop App Supply Chain	對照啟示 — container scan 看 image layer 內 known CVE、看不到 runtime callback / dynamic load；需配合 runtime monitoring（Falco / Tetragon）
XZ Backdoor 2024	對照啟示 — Trivy 比對 package name + version 對應 CVE、看不到 maintainer takeover；mitigation 走 SBOM provenance + maintainer trust baseline
7.12 供應鏈完整性與 Artifact 信任	章節原則 — Trivy 是 known CVE 檢測、SBOM + signing + provenance 三件事一起才形成完整 trust chain

下一步路由

上游：7.12 供應鏈完整性與 Artifact 信任
平行：Snyk、Syft + Grype、GitHub Advanced Security、Dependabot
下游：7.3 入口治理與伺服器防護（image 漏洞最終影響的是 origin server 風險面）
跨類：cert-manager（TLS lifecycle）、HashiCorp Vault（secret rotation 對應 Trivy secret scan 找到的 hardcoded credential）
跨模組：8 事故處理 vendor 清單（CVE 緊急 response 流程 / 高風險 image rollback）
官方：Trivy Documentation、Trivy Operator

AWS Aurora

Wed, 13 May 2026 00:00:00 +0000

Aurora 是 AWS managed PostgreSQL / MySQL、把 storage layer 重寫成跨 AZ 分散式 log service、保留 wire protocol 相容。Netflix 把多套 RDBMS 統一到 Aurora（+75% 效能、-28% 成本）、DraftKings 撐每分鐘 100 萬 ops 體育博彩、Standard Chartered 跨 7 個受監管市場、FanDuel 處理 Super Bowl 5-10 倍峰值 — 是 SQL OLTP managed 服務的代表。

教學路線：Managed SQL 與平台責任轉移

Aurora 服務頁的教學目標是把 PostgreSQL / MySQL 語意延伸到 AWS managed storage / compute 分離模型。讀者讀完後要能判斷哪些責任交給 Aurora，哪些責任仍留在 schema、query、maintenance window、region 與成本治理。

學習段	核心問題	對應段落
Managed SQL	Aurora 如何保留 PostgreSQL / MySQL 語意並改變操作責任	定位、適用場景
Storage / compute	分離 storage layer 如何影響 replica、failover、backup	容量規劃要點、案例對照
AWS operation model	parameter group、maintenance、region、cost 如何成為平台責任	跟其他 vendor 的取捨、RTO / RPO
Peak workload	金融、串流、Super Bowl、banking case 如何提供容量判準	適用場景、案例對照
替代路由	何時留 RDS、自管 PostgreSQL / MySQL、轉 Spanner 或 DynamoDB	不適用場景、下一步路由

定位：storage / compute 分離的 SQL

Aurora 跟傳統 PostgreSQL / MySQL primary 最大差異是 storage layer 重寫。傳統 SQL primary 把 storage 跟 CPU / RAM 綁定、storage 擴容要換 instance、replication lag 受 compute 影響。Aurora 把 storage 拉到分散式 log service、跨 6 個 storage node（3 AZ × 2 node）、storage 跟 compute 獨立擴。

容量特性：

單一 cluster 最高 storage：128 TB
最多 15 個 read replica（單 region 內）
read replica replication lag：10-30ms（vs 傳統 PostgreSQL 跨 AZ 可能秒級）
跨 AZ failover：< 30 秒（promote read replica）
Aurora Global Database 跨 region replication：< 1 秒典型 lag

為什麼這個分離很重要：

傳統 PostgreSQL primary 上的 read replica 都靠 logical replication、會跟著 primary write load 走慢
Aurora storage 直接複製到 6 個 storage node、read replica 從 storage 讀、不靠 primary
→ read replica 大幅減少 lag、可以撐更多 OLTP read traffic
對應 9.C23 Netflix +75% 效能改善的關鍵原因

適用場景

按公開 case 提煉的典型適用場景：

1. 既有 PostgreSQL / MySQL 應用想要 managed：

wire protocol 相容，應用層改動通常集中在連線、參數與操作流程
ORM / driver / SQL 多數可保留，但 migration plan 仍要驗證 dialect 與 extension
對應案例：9.C23 Netflix — 多套 RDBMS（PostgreSQL、MySQL、Oracle）統一到 Aurora、+75% 效能、-28% 成本

2. 金融交易 / 體育博彩 OLTP：

強 ACID transaction
多 read replica 處理 query traffic、不影響寫
對應案例：9.C4 DraftKings — 每分鐘 100 萬 ops、200 個獨立資料庫、Super Bowl 流量 +50% 無影響

3. 受監管產業跨市場部署：

每個市場一個獨立 cluster、合規分割
對應案例：9.C14 Standard Chartered — 7 個受監管市場、各自獨立 Aurora、總吞吐 4000 TPS、10x 提升

4. 高峰流量 + 多 read replica 擴容：

read 高峰用 read replica 接、write 走 primary
對應案例：9.C28 FanDuel — 5-10x Super Bowl 峰值、直播 + 投注雙工作負載

5. Aurora Serverless v2 適用場景：

流量 unpredictable + sustained workload
自動 scale CPU / RAM，降低 instance class 管理負擔
適合：dev / test 環境、流量稀疏的多 tenant SaaS

6. Aurora Global Database：

跨 region async replication（< 1 秒 typical）
DR + 跨地理 read（write 在 primary region、read 可從 secondary region）
Global Database 是跨 region DR / read route，multi-region active-active write 要改看 Aurora DSQL

不適用場景

1. 跨雲需求：

Aurora 是 AWS-only、wire protocol 相容但 storage 是 AWS 專屬
替代：自管 PostgreSQL / MySQL on Kubernetes

2. 需要最新 upstream PostgreSQL / MySQL 特性：

Aurora 通常落後 upstream 1-2 個 major version
替代：RDS PostgreSQL（更接近 upstream）

3. 極端寫入吞吐：

單一 primary 寫入受 storage 設計限制（雖然比 PostgreSQL 快）
100K WPS 級別、考慮 sharding、CockroachDB、或 DynamoDB
對應 9.C29 Lemino — RDB connection limit 是 bottleneck、改 DynamoDB

4. 全球 multi-region active-active write：

Aurora Global Database 是 async、有 lag，write 仍集中在 primary region
替代：Aurora DSQL（2024 推出）、Spanner、Cosmos DB

5. 預算敏感的小 workload：

Aurora 比 self-managed PostgreSQL 貴 20-30%
小流量場景、自管 PostgreSQL on EC2 或 RDS 更便宜

跟其他 vendor 的取捨

vs RDS PostgreSQL / MySQL（同 AWS）：

Aurora：storage / compute 分離、更多 read replica、更快 failover、跨 AZ 自動 replication
RDS：純 managed PostgreSQL / MySQL、不重寫 storage、更接近 upstream
選 Aurora：需要 scale read replica 或 cross-AZ failover < 30 秒
選 RDS：需要最新 upstream 特性、預算更敏感

vs 自管 PostgreSQL / MySQL：

Aurora：託管、自動 backup / failover，降低日常 database operation
自管：彈性高、可自己 tuning、跨雲可用、預算可控
選 Aurora：團隊想把 DBA / SRE 操作責任轉交 AWS、AWS 生態深
選自管：跨雲需求、需要客製化、預算極敏感

vs CockroachDB：

Aurora：single-region scaling（一個 region 內擴）、AWS-only
CockroachDB：multi-region 強一致、跨雲可用、PostgreSQL wire protocol
選 Aurora：AWS-only + single-region OLTP
選 CockroachDB：需要 multi-region 強一致 + 跨雲 / on-prem 彈性

vs Aurora DSQL（2024-12 preview / 2025-05 GA）：

Aurora：single-region scaling、傳統 OLTP
Aurora DSQL：multi-region active-active write、serverless、強一致
選 Aurora：流量集中在一個 region
選 Aurora DSQL：需要全球 active-active
從 PG / Aurora PG 遷 DSQL 的完整 playbook 見 PG → Aurora DSQL Migration

vs DynamoDB：

詳見 DynamoDB vendor page 對比段。Aurora 是 SQL、DynamoDB 是 KV、適用場景不同。

vs Azure SQL Hyperscale：

設計理念類似（storage / compute 分離）
Aurora 在 AWS、Hyperscale 在 Azure
對應案例：9.C32 Clearent — Azure 生態的同類設計、5 億 payment txn / 年

容量規劃要點

從 09 案例庫提煉的 Aurora 容量規劃實踐：

1. read replica 是擴 read traffic 的主要工具：

最多 15 個 read replica、replication lag 10-30ms
read replica autoscaler 按 CPU / connection 自動加減
對應 9.C4 DraftKings 用多個 read replica 處理「比賽期間用戶查 balance」流量

2. 200 個獨立 cluster 模式：

Aurora 的實務設計通常用多個 bounded cluster 控制 blast radius
按業務切多個小 cluster（9.C4 DraftKings 200 個）、降低 blast radius
對應 microservice 私有 store（9.C23 Netflix 同樣思維）

3. Aurora I/O-Optimized：

2023-05 推出的 storage 配置
適合 I/O-heavy workload（write 多、scan 多）
比 standard storage 貴、但少 I/O 收費
對應 9.C4 DraftKings 用 I/O-Optimized 加速

4. Aurora Serverless v2：

ACU（Aurora Capacity Unit）為單位、自動 scale 0.5-128 ACU
適合 dev / test、稀疏 workload、unpredictable burst
不適合：sustained predictable high workload（provisioned 便宜）

5. Cross-region Global Database：

< 1 秒 typical replication lag、但是 async
secondary region 可 read，write 仍回 primary region
DR 切換通常 1-2 分鐘
對應 9.C14 Standard Chartered — 跨市場各自獨立 Aurora，合規邊界優先於 Global Database

6. Connection pool 仍是隱性限制：

Aurora 跟傳統 PostgreSQL 一樣有 connection pool 上限
應用層 + Aurora 之間建議用 RDS Proxy 做 pool 共享
對應 9.C29 Lemino — RDB connection limit 是 surge 場景的 bottleneck；Lemino 案例發生在 RDS，但 connection-bound 機制同樣適用 Aurora

Deep article（已完成）

本 vendor 現有 deep article 覆蓋 Aurora 從 storage architecture、fleet 治理到容量彈性、連線管理與 distributed 升級門檻的核心 production 議題：

主題	文章	對應 production 議題
quorum-based 分散式 log、韌性即性能、6-way replication	storage-architecture	4-of-6 write / 3-of-6 read、DraftKings 6ms 寫 / <1ms 讀 production reference
Cross-AZ failover lifecycle、< 30 秒 RTO、endpoint routing	cross-az-failover-rto	application DNS cache + connection pool 對齊、Standard Chartered 受監管獨立 cluster 而非 Global Database failover
15 replica 上限、lag profile、headroom 預留、fleet 治理 3 條 driver	read-replica-scaling	Aurora fleet 治理 SSoT、DraftKings headroom 預留、FanDuel 雙 SLO 並行
跨 region async replication、< 1 秒 lag、合規 anti-recommendation	global-database-multi-region	planned vs unplanned failover RTO、Standard Chartered 合規禁止跨境複製反指標
從自管 PostgreSQL / MySQL 遷到 Aurora（Type C operational redesign）	migrate-from-self-managed-pg-mysql	Standard Chartered 合規 lead time、Netflix 非 all-purpose store 邊界
ACU 自動擴縮、min/max 設定、混合 cluster、成本 crossover	serverless-v2-scaling	離峰浪費 vs 尖峰不足、穩定高負載 serverless 反而更貴
多 cluster 業務切分、blast radius 隔離、fleet 治理	multi-cluster-business-split	Netflix 微服務私有 store + DB 種類 consolidation 雙重成立
RDS Proxy connection multiplexing、pinning 陷阱、failover 加速	rds-proxy-connection-pooling	Lambda 連線風暴、pinning 讓 multiplexing 失效
standard Aurora vs Aurora DSQL 升級門檻取捨	aurora-vs-dsql-tradeoff	single-writer 上限 vs active-active distributed、何時跨 paradigm

I/O-Optimized vs Standard 成本對比由 Aurora PostgreSQL I/O-Optimized Cost 主寫（storage I/O 成本模型 SSoT），本 vendor 各篇提到 storage 成本時 cross-link 它、不重複展開。

跨 vendor entry：先看 CockroachDB vs Aurora DSQL vs Spanner 決策樹（distributed SQL 三選一 + 撞牆訊號分型），再決定是否進 Aurora overview。

後續擴充（仍待補）

Aurora Global Database write forwarding 深入
Babelfish（SQL Server 相容層）適用判斷
Blue/Green deployment 做 major version 升級
Backup / PITR restore drill（hands-on lab）

Anti-recommendation 與升級路由

Aurora 的 managed SQL 能把大量操作責任交給 AWS，但它仍保留 single-primary SQL 的資料模型與交易邊界。這一段先說何時維持 RDS / Aurora，再說何時升級 Global Database、Serverless v2、RDS Proxy、Aurora DSQL 或 DynamoDB。

機制 / 路線	維持簡單設計的條件	升級訊號	主要引用路徑
RDS PostgreSQL / MySQL	upstream 相容、成本、版本節奏比 storage 分離更重要	read replica lag、backup / failover、storage growth 成主題	PostgreSQL vendor、MySQL vendor
Aurora provisioned	workload sustained、容量可預測、團隊能管理 instance class	read replica、fast failover、storage autoscale 是主要需求	Replication Lag、Failover
Aurora Serverless v2	sustained workload 已穩定且 provisioned 成本較低	稀疏 tenant、dev/test、不可預測 burst	Cost Per Request、Scheduled Scaling
RDS Proxy	application pool 已能控制 backend connection	Lambda / surge / connection storm 造成 pool 壓力	Connection Pool
Global Database	single-region DR 已符合 RTO/RPO	跨 region read、regional DR、低 RPO 是產品需求	RTO、RPO、Stale Read
Aurora DSQL / Spanner / CockroachDB	single-primary write 仍足夠	multi-region active-active write、global strong consistency	1.11 全球分散式 OLTP
DynamoDB	SQL query 與 transaction 仍是主要價值	access pattern 固定、connection-free surge、KV latency 成主題	DynamoDB vendor

Aurora 的簡單路徑是先把 operation transfer 寫清楚。Backup、minor upgrade、storage growth、failover 與 read replica lag 交給平台後，schema design、query shape、transaction boundary、connection pool 與 cost guardrail 仍由 application / SRE 共同承擔。

Global Database 的升級路徑要先定義讀寫方向。它適合 DR 與跨地理 read，若業務需要多 region 同時寫入並保持強一致，應直接進入 Aurora DSQL、Spanner 或 CockroachDB 的 distributed SQL 比較。

已知 limitation 與後續路由

Aurora overview 目前完成 managed SQL 判斷。下一輪 deep article / playbook 應補 storage architecture、RDS Proxy、Global Database、Serverless v2、I/O-Optimized cost、PostgreSQL / MySQL → Aurora migration 與 Aurora → Aurora DSQL 的分歧路徑。

案例對照

案例	規模	教學重點
9.C4 DraftKings	1M ops/min、<1ms reads、6ms writes、200 個 DB	體育博彩金融帳本、按業務切 cluster
9.C14 Standard Chartered	4000 TPS、7 個受監管市場、10x 提升	受監管金融跨市場部署
9.C23 Netflix	+75% 效能、-28% 成本	多套 RDBMS 統一到 Aurora
9.C28 FanDuel	Super Bowl 5-10x peak	直播 + 投注雙工作負載

Aurora case 的讀法是看 operation transfer 如何變成容量與成本結果。DraftKings 與 FanDuel 提供 peak OLTP 訊號，Standard Chartered 提供合規分區訊號，Netflix 則提供多套 RDBMS 整併到 managed SQL 的組織與成本訊號。

反向 sibling 路由

Aurora 的反向 sibling 路由用來避免把 managed SQL 誤讀成唯一升級方向。若讀者從 PostgreSQL / MySQL 章節過來，先對照 PostgreSQL → Aurora 與 MySQL → Aurora；若核心需求是 connection surge，補讀 DynamoDB vendor 與 Lemino case；若核心需求是 multi-region active-active write，轉到 Spanner vendor 或 CockroachDB vendor。

這條路由的判準是先問「保留 SQL + 轉移 operation」是否足夠。答案成立時，Aurora 是 RDS / 自管 MySQL / 自管 PostgreSQL 的 managed endpoint；答案需要改成 global quorum、partition-key access pattern 或 document API 時，Aurora 應退到對照組，而非成為最後選項。

常見陷阱

誤以為 Aurora 等於無限擴：寫吞吐仍受 primary 限制，容量曲線和 distributed SQL 不同
忽略 read replica：把所有 query 打 primary，會浪費 read replica scaling 能力
跨 region 強一致誤解：Global Database 是 async 複製，multi-region active-active 要看 Aurora DSQL / Spanner / CockroachDB
connection pool 忽略：Aurora 仍是 PostgreSQL / MySQL、connection 上限有效
單一巨大 cluster：把所有業務塞進一個 cluster 會放大 blast radius，通常要按業務切

下一步路由

完整 T1 對照：01-database vendors index
平行：DynamoDB vendor page（NoSQL 對比）
上游：1.3 Transaction Boundary / 1.11 全球分散式 OLTP
下游：1.12 大規模 DB 遷移實戰（從 RDS / 自管遷到 Aurora）
跨模組：9.5 瓶頸定位流程、9.6 容量規劃模型
Last reviewed：2026-05-22（Aurora storage / Serverless / Global Database / I/O-Optimized 屬時間敏感 claim）
官方：Amazon Aurora、Aurora storage architecture

Atlassian Statuspage

Fri, 01 May 2026 00:00:00 +0000

Statuspage 是 Atlassian 收購整合的公開狀態頁 SaaS、承擔三個責任：對外公開服務狀態揭露（component / incident / maintenance）、subscriber notification（email / SMS / Slack / Microsoft Teams / webhook / RSS）、自有 domain + branding。是公開狀態頁的事實標準、跟 Opsgenie 同屬 Atlassian 事故處理生態（搭配 Jira Service Management、Confluence post-mortem template）、也跟 PagerDuty / incident.io 等第三方 IR 平台廣泛整合。

服務定位

Statuspage 的定位是 對外狀態頁領導品牌、責任邊界是 把內部 incident state 翻譯成對外可讀的公告、不是 IR workflow 本身。功能涵蓋 component status（operational / degraded / partial outage / major outage / under maintenance）、incident update（lifecycle + template）、scheduled maintenance（pre-announce + auto-publish + auto-resolve）、metrics chart（uptime / latency 公開圖表、來源 Datadog / Pingdom / New Relic / Library）、audience targeting（public / private / partner / per-customer 分軌）。

跟 Opsgenie / Confluence / Jira Service Management 是同生態 — Statuspage 接 Opsgenie alert 自動 create incident draft、incident resolve 自動 publish post-mortem 到 Confluence、JSM ticket 連結 Statuspage incident URL。enterprise polish（custom CSS / 自有 domain / multi-language / SSO admin）是賣點、defaults 也夠用、是大型 SaaS public-facing 的主流選擇。

本章目標

建 Statuspage + 設 component / group
寫第一個 incident update（template-driven）
配置 subscriber notification channels
API 自動化（從 IR 平台 push update）
設定 custom domain + 品牌一致 UI

最短路徑

1# 1. 註冊 Statuspage、選 plan
2# 2. 建 component（按服務拆）
3# 3. 寫 test incident
4# 4. 訂閱者 self-service subscribe

最短判讀路徑

判斷 Statuspage deployment 是否健康、最少看四件事：

誰能 publish update：admin / page admin / incident manager 的權限分層、incident publish 是否走 template + reviewer、API token 是否分 human ops 跟 machine push 兩條
Component dependency 設計：component 是否對應 使用者可感知的服務面（不是內部 microservice）、group 是否拆得太細導致 status update 散落、dependency map 是否誇大內部架構讓對外公告失焦
Metrics integration：uptime / latency chart 來源是否跟內部 SLO 對齊（Datadog / Pingdom / 自家 API push）、metrics 是否跟 incident state 同步（incident 開了 metrics 還綠燈 = 對外公信力下降）
Audience targeting：public / private / partner page 是否清楚分軌、subscriber list 是否定期清理（離職者 / 失效 email / SMS bounce）、per-customer audience 是否走 SSO 控管

四件事任一缺失、就是 Incident Communication 邊界的待補項目。

日常操作與決策形狀

Component / group 設計

子議題：

Component 對應服務 / API endpoint（粒度跟使用者可感知一致、不是內部服務拓樸）
Group 組織多 component（按產品線 / 區域 / 客戶層）
Status：operational / degraded / partial outage / major outage / under maintenance
Component dependency：parent component 自動匯總 child status（過細會造成內部架構洩漏）

Incident lifecycle + Subscriber

子議題：

Investigating → Identified → Monitoring → Resolved 四段、每段都該推 update
Template（標準措辭、降低 incident commander 寫稿壓力、避免揭露過多內部細節）
Email / SMS / Slack / Microsoft Teams / webhook / RSS subscriber
Subscribe by component（部分訂閱、避免 noise）

進階主題（按需閱讀）

Audience-specific page

子議題：public（所有人）/ private（authenticated、內部員工 / 特定客戶）/ partner（B2B 獨立 view）、per-customer / per-region status（大型 SaaS 用、避免單一 region 事故影響全球公信力）

Scheduled maintenance

子議題：提前公告 maintenance window、auto-publish + auto-resolve、跟 change management 流程串接、recurring maintenance 用 template

Subscription management

子議題：email / SMS / Slack / Microsoft Teams / webhook 多通道、bounce 清理、SMS provider 限額（高峰 incident 可能塞車）、subscriber list growth 變廣告管理目標時需 GDPR / CAN-SPAM 治理

Templates

子議題：incident template（standard outage / degraded performance / scheduled maintenance）、避免每次 incident commander 重新寫稿、降低措辭風險

IR 平台整合

子議題：PagerDuty Status Pages integration、incident.io Statuspage sync、Opsgenie incident-to-Statuspage workflow、FireHydrant auto-publish

API automation

子議題：從 IR 平台 push update、跟 Opsgenie alert sync、custom field、API token 分軌（human ops vs machine push）、retry / idempotency

Custom domain + branding

子議題：status.example.com vs example.statuspage.io、custom CSS / logo、多語言、SSO trap（admin SSO 設錯導致 lock-out）

Metrics 公開

子議題：uptime / response time 圖表、來源（Datadog / Pingdom / New Relic / 自家 API push）、metrics 跟 incident state 同步、避免 metrics 綠燈但 incident open

排錯快速判讀

Incident update 沒發：API token 失效 / IR 沒 trigger / template variable 漏帶
Stale status（incident 過了還掛 active）：auto-resolve 規則沒設 / IR 平台 close 沒 sync / oncall 手動忘記 resolve
Subscriber 沒收到：email bounce / SMS provider 限額 / Slack workspace token expired
Component dependency map 過細：把內部 microservice 都拉成 component、對外公告失焦、攻擊面間接洩漏架構
Subscriber list growth 變廣告管理：上萬 subscriber 後接近 marketing list、需 GDPR / CAN-SPAM 治理、定期清離職 + bounce
Component status 跟實際不符：自動 sync 規則錯 / 手動沒更新 / metrics 來源延遲
Custom domain 失效：DNS / SSL cert 過期、Statuspage cert auto-renew 沒 enable
SSO trap：admin SSO 切過去後 IdP 出事、Statuspage admin 進不去、break-glass token 沒留

何時改走其他服務

需求形狀	改走
預算敏感 / 小型團隊	Instatus / Better Stack
OSS / 自管 / 完全 control	Cachet
IR 平台內建 status	FireHydrant
IR workflow + Status 一體	incident.io
內部 only	內部 dashboard（Grafana / Datadog）

選 Statuspage 的核心訴求：enterprise polish + Atlassian 生態整合（Opsgenie / JSM / Confluence）+ subscriber scale（百萬級 email/SMS）+ audience targeting 需求（partner / per-customer page）。中小團隊 / 預算敏感走 Instatus / Better Stack 更划算；IR workflow + status 想一體化走 incident.io。

不在本頁內的主題

完整 API reference / Custom CSS / Statuspage Connect
Atlassian SSO 設定細節（屬 IdP 範疇）
SLA 計算 / SLO dashboard（屬 observability、不屬對外狀態頁）

案例回寫

Statuspage 廣泛使用：GitHub / Cloudflare / Atlassian / Slack / Discord / Datadog / Fastly / Heroku / Reddit / Roblox 等大型 SaaS 的 public-facing status communication 多為 Statuspage 託管、是 對外揭露節奏跟措辭 的事實標準。

案例	對應主題
GitHub cases	Statuspage update 與長尾事故時序
Cloudflare cases	控制面事故的公開揭露節奏
Atlassian cases	自家 Statuspage、14 天長尾事故對外通訊
Slack cases	通訊平台失效時的 status 訊息分軌
Discord cases	Gateway 事故的 component 拆分
Datadog cases	觀測平台失效時的 status 自我宣告
Fastly cases	全球邊緣事故的單頁公開時程
Heroku cases	平台型 Routing 事故的 incident 分層
Reddit cases	Kubernetes 升級事故的對外揭露策略
Roblox cases	長時間核心基礎設施事故的 incident lifecycle

下一步路由

AWS CloudWatch

Fri, 01 May 2026 00:00:00 +0000

CloudWatch 是 AWS 原生 observability 服務、承擔三個責任：AWS 服務內建 metrics / logs / alarms（無需配置）、跨 AWS 服務統一觀測平面、X-Ray + Container Insights + Lambda Insights 等專用擴展。設計取捨偏向「AWS 生態深度整合 + 不用第三方 vendor + 預設 turnkey」、跨雲跟成本是主要限制。

本章目標

讀完本章後、你應該能：

用 AWS CLI / Console 查 CloudWatch metrics / logs / alarms
用 CloudWatch Logs Insights 查詢結構化 logs
配置 alarm + composite alarm + EventBridge integration
用 X-Ray 追蹤 distributed tracing
控制 CloudWatch cost（log ingestion / metric / API call）

最短路徑：5 分鐘把 CloudWatch 跑起來

1# 1. 用 CloudWatch Agent 採集 EC2 metrics + logs
2# TODO: aws-cli + cloudwatch-agent.json config
3
4# 2. 查詢 metric
5# TODO: aws cloudwatch get-metric-statistics --namespace AWS/EC2 --metric-name CPUUtilization
6
7# 3. 用 Logs Insights 查詢
8# TODO: fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc

日常操作與決策形狀

Metrics / Logs / Alarms 整合

子議題：

Namespace + Dimension + Metric 三層
Custom metric（CLI / SDK / Agent）
Logs group + Log stream + Log event
Alarm + Composite alarm + EventBridge rule

Logs Insights query

子議題：

Query syntax：fields / filter / parse / stats / sort
跟 KQL / LogQL 對照（CloudWatch 自家 syntax）
對應指令：aws logs start-query、aws logs get-query-results

Metrics Math

子議題：

跨 metric 算術運算（rate / sum / avg）
適合 dashboard / alarm 不直接 metric 表達的計算
對比 PromQL：CloudWatch Math 較弱、無 label join 能力

X-Ray tracing

子議題：

各語言 X-Ray SDK
Sampling rule（rate-based / reservoir）
Service map 自動 build
對應 4.C4 X-Ray to OpenTelemetry 遷移案例

Deep Article

Logs Insights 查詢與日誌治理：log group 設計、query syntax、retention policy、cross-account aggregation、subscription filter 與 cost governance
Alarms 與 Composite Alarms 操作實務：Metric Alarm、Anomaly Detection、Composite Alarm 設計、alarm actions、missing data 處理與 cost

進階主題（按需閱讀）

Container Insights / Lambda Insights

子議題：

Container Insights：EKS / ECS metrics + logs 自動採集
Lambda Insights：Lambda runtime metrics + cold start visibility
跟 Prometheus + Grafana 的 K8s 模式對照

CloudWatch Synthetics / RUM

子議題：

Synthetics：canary script 定期 probe
RUM：前端用戶體驗
跟 Datadog Synthetics / RUM 對照

Logs lifecycle

子議題：

Retention（1 day to never expire）
Subscription filter：把 logs 送到 Lambda / Kinesis / S3
Logs to S3 archive
對應 cost 控制

Cost 控制

子議題：

Logs ingestion charge（per GB）
Metrics storage charge（custom metrics + high-resolution）
API call charge（GetMetricData / Logs Insights query）
對應 4.C1 Fintech audit

CloudWatch Managed Prometheus（AMP）

子議題：

AMP：AWS managed Prometheus、scrape EKS / ECS
跟 CloudWatch 互補（CloudWatch 是 AWS-native、AMP 是 OSS standard）
對應 4.C6 ADOT EKS

AWS Distro for OpenTelemetry（ADOT）

子議題：

AWS-supported OTel distribution
跟 X-Ray / AMP / CloudWatch 都整合
推薦的 OTel adoption 路徑
對應 4.C6 ADOT EKS

排錯快速判讀

Logs Insights query 過慢

操作原則：query 範圍 + 結果集大時、用 sample 縮範圍。

1# TODO: fields @timestamp, @message | limit 100（先測 logic）

Metric not found

操作原則：metric namespace / dimension 對應錯。判讀：用 aws cloudwatch list-metrics --namespace ... 確認。

Alarm 沒觸發

操作原則：alarm period / evaluation period / datapoints 配置造成延遲或忽略。

X-Ray trace incomplete

操作原則：sampling rule 過頭、subseg context propagation 失敗。判讀：X-Ray console 看 trace timeline。

Cost 爆

操作原則：log ingestion 多、custom metric 多、Logs Insights query 量大都會貢獻。判讀：Cost Explorer 看 CloudWatch service breakdown。

何時改走其他服務

需求形狀	改走
多雲 / 跨雲統一	Datadog / Grafana Stack / OTel
進階 APM 體驗	Datadog / Honeycomb
高頻 query / 大量 log	Grafana Stack（Loki）/ Elastic
OTel standard	OTel + ADOT / AMP
GCP / Azure 生態	Cloud Operations / Azure Monitor

不在本頁內的主題

各 AWS 服務的 CloudWatch metric 名稱列表
CloudWatch Synthetics canary script 語法
Logs Insights 完整 query syntax reference
AWS IAM 跟 CloudWatch 的細部權限

案例回寫

直接相關案例

案例	主討論議題
4.C4 X-Ray to OTel	X-Ray 遷出到 OTel
4.C6 ADOT EKS pipeline	AWS Distro + EKS 觀測

跨 vendor 對照

案例	對 CloudWatch 的對應
4.C1 Fintech audit	CloudWatch Logs / S3 archive 作為 audit evidence
4.C3 Healthcare retention	Logs lifecycle / retention 對應資料主權限制
4.C10 規模對照	AWS-only 場景優先 CloudWatch

下一步路由

上游概念：4.17 Telemetry Data Quality
平行 vendor：OpenTelemetry、Cloud Operations
下游能力：4.20 Observability Evidence Package

Chaos Mesh

Fri, 01 May 2026 00:00:00 +0000

Chaos Mesh 是 PingCAP 開源、CNCF incubating 的 Kubernetes-native chaos engineering 平台、承擔三個責任：CRD-driven fault injection（PodChaos / NetworkChaos / IOChaos / StressChaos）、Chaos Workflow（多步驟編排）、Chaos Dashboard 視覺化 + experiment scope 控制。設計取捨偏向「K8s-native + GitOps-friendly + multi-fault types」、適合 K8s 為主的 chaos engineering。

本章目標

讀完本章後、你應該能：

部署 Chaos Mesh 到 K8s cluster
設計 PodChaos / NetworkChaos / IOChaos experiment
用 Chaos Workflow 編排多步驟實驗 + steady state probe
控制 blast radius（namespace / labelSelector / mode）
跟 6.20 Experiment Safety Boundary 對齊 chaos 實驗審批

最短路徑：5 分鐘把 Chaos Mesh 跑起來

1# 1. 安裝
2# TODO: curl -sSL https://mirrors.chaos-mesh.org/v2.7.0/install.sh | bash
3
4# 2. 跑第一個 PodChaos
5# TODO: 寫 podchaos.yaml、kubectl apply
6# TODO: action: pod-kill / selector / mode
7
8# 3. Dashboard
9# TODO: kubectl port-forward svc/chaos-dashboard 2333:2333

日常操作與決策形狀

CRD 設計

子議題：

PodChaos：pod-kill / pod-failure / container-kill
NetworkChaos：delay / loss / duplicate / corrupt / partition
IOChaos：delay / errno / mistake / attrOverride
StressChaos：CPU / memory pressure
對應 GitOps：Helm / Kustomize 管 experiment

Chaos Workflow

子議題：

多步驟 chaos 編排（serial / parallel）
Suspend / resume 控制
Probe（steady state validation）
對應 6.20 Experiment Safety Boundary

Chaos Dashboard

子議題：

視覺化 experiment timeline
Experiment archive
Event log
RBAC

進階主題（按需閱讀）

Blast radius 控制

子議題：

namespace 限制
labelSelector / value mode（one / all / fixed / fixed-percent / random-max-percent）
annotationSelector
Pause / resume 緊急中止

Schedule 與 GitOps

子議題：

Schedule CRD 定期 chaos
ArgoCD / Flux 整合
Experiment as code review

跟 LitmusChaos / Gremlin 對比

子議題：

Chaos Mesh：CRD-driven、PingCAP 主導
LitmusChaos：ChaosHub experiment / CNCF graduated
Gremlin：商業 SaaS、跨平台
選擇判讀：K8s OSS first → Chaos Mesh / Litmus；商業跨平台 → Gremlin

Steady state 驗證

子議題：

HTTP / TCP / Pod / podHTTPChaos
Probe success threshold
跟 9.13 SLO 對應 burn rate

排錯快速判讀

Experiment 沒生效

操作原則：先 kubectl describe podchaos 看 status、再看 webhook + RBAC。

Blast radius 過大

操作原則：mode 設 all 或 percent 設太高、影響超出預期。預防：先 dry-run / staging 測試。

Pause 不及時

操作原則：experiment running 中要 pause、不是 delete（delete 不會 cleanup state）。

Dashboard 連不上

操作原則：service 沒暴露、RBAC 不對。

何時改走其他服務

需求形狀	改走
非 K8s 環境	Gremlin / Toxiproxy
AWS-native chaos	AWS Fault Injection Service
K8s + ChaosHub experiment	LitmusChaos
Integration test 模擬故障	Toxiproxy
商業 + GameDay 設計	Gremlin

不在本頁內的主題

完整 CRD spec
Chaos Mesh internal architecture
各 fault type 詳細 parameter

案例回寫

案例方向	對應主題
Netflix：Steady State、Chaos 與 FIT	steady state hypothesis 對應 Chaos Workflow Probe
Netflix：Business-Hours Guardrails	blast radius / pause / mode 控制對應時段策略
Pinterest：快取可靠性與容量驚奇	NetworkChaos / StressChaos 模擬熱點與 cache failure mode
Google：Error Budget 與 Release Gating	chaos finding 對應 SLO burn rate 的回寫

待補 Chaos Mesh customer case：PingCAP / TiDB 客戶 Chaos Mesh 案例、CNCF Chaos Mesh adopters。

下一步路由

上游概念：6.20 Experiment Safety Boundary
平行 vendor：LitmusChaos、Gremlin
下游能力：8 incident response（chaos finding 進 IR 流程）

Terraform / OpenTofu

Fri, 01 May 2026 00:00:00 +0000

Terraform 是 HashiCorp 出品的 IaC 工具、承擔三個責任：declarative infrastructure 配置（HCL）、state-based reconciliation（plan → apply）、跨 provider 抽象（AWS / GCP / Azure / K8s / SaaS）。設計取捨偏向「state-driven + declarative + multi-cloud」、provider 生態最廣。2023 改 BSL 授權、社群 fork OpenTofu（Linux Foundation 託管、MPL 2.0）。

對「跨雲基礎設施管理、團隊協作 IaC、需要 state + plan workflow」這條路徑、Terraform / OpenTofu 是首選。

本章目標

讀完本章後、你應該能：

寫 HCL config（resource / variable / output / module）
設定 remote state（S3 + DynamoDB lock / Terraform Cloud）
設計 module + workspace 結構
跑 plan / apply / destroy 工作流 + GitOps
評估 Terraform vs OpenTofu vs Pulumi vs Crossplane

最短路徑：5 分鐘把 Terraform 跑起來

1# 1. 安裝
2brew install hashicorp/tap/terraform   # 或 brew install opentofu

1# 2. 寫 main.tf
2terraform {
3  required_providers {
4    aws = { source = "hashicorp/aws", version = "~> 5.0" }
5  }
6}
7provider "aws" { region = "us-east-1" }
8resource "aws_s3_bucket" "demo" { bucket = "my-tf-demo-bucket" }

1# 3. init + plan + apply
2terraform init
3terraform plan -out=plan.tfplan
4terraform apply plan.tfplan

日常操作與決策形狀

HCL config 結構

子議題：

provider / resource / data source / variable / output / locals
terraform block（required_version / required_providers / backend）
Module（reusable group of resources）
對應指令：terraform fmt、terraform validate

State 管理

子議題：

Local state（terraform.tfstate）：dev / 學習用
Remote state（S3 + DynamoDB lock / GCS / Terraform Cloud / Spacelift）
State migration（terraform state mv / rm / import）
State sensitive data 不入 git

Plan / apply workflow

子議題：

terraform plan -out=plan.tfplan（凍結結果）
terraform apply plan.tfplan
Auto-approve（CI / CD）vs manual approve（critical）
對應 GitOps：Atlantis / Terraform Cloud / Spacelift

進階主題（按需閱讀）

Module 設計

子議題：

Module input / output
Module composition（root module → child module）
Public module registry（Terraform Registry / OpenTofu Registry）
Version pinning
對應 Terraform best practice

Workspaces vs directory layout

子議題：

Workspaces：同 module 多 instance（dev / staging / prod）
Directory：每 env 一個 directory
Workspaces 的局限（state 同 backend、env 共享 config）
選擇判讀：強隔離 → directory；快切換 → workspace

Drift detection

子議題：

Drift = 實際 infra ≠ Terraform state
偵測：terraform plan 跑出來有 diff
修法：Manual import / state pull / 修改 cloud directly + plan refresh
對應自動化 drift detection（Atlantis / Driftctl）

Terraform vs OpenTofu

子議題：

2023 Terraform 改 BSL：Linux Foundation fork OpenTofu
OpenTofu 跟 Terraform 1.5 API 相容
之後分歧：OpenTofu 加 state encryption、provider iteration
遷移路徑：替換 binary、import 既有 state

Provider 生態

子議題：

AWS / Azure / GCP（cloud provider）
Kubernetes / Helm（K8s provider）
SaaS：Datadog / Pagerduty / Cloudflare / GitHub
Community provider vs official provider 品質差距

跟 Crossplane / Pulumi 對比

子議題：

Crossplane：K8s-native IaC（用 K8s CRD 管 cloud resource）
Pulumi：用通用語言（TS / Python / Go / C#）寫 IaC
選擇判讀：純 cloud infra → Terraform / OpenTofu；K8s-heavy → Crossplane；developer-first → Pulumi

Terraform Cloud / Spacelift / Atlantis

子議題：

Terraform Cloud（HashiCorp managed）：remote state + run + policy
Spacelift / env0：商業替代
Atlantis：OSS Pull Request automation
對應 GitOps for IaC

排錯快速判讀

State lock stuck

操作原則：DynamoDB lock 沒釋放（process killed）。判讀 + 修法：terraform force-unlock （小心）。

Plan diff 過大

操作原則：drift 累積 / provider 升級 / config 改太多。判讀：先看 plan output、再決定要不要 apply。

Provider auth fail

操作原則：AWS / GCP credentials 沒設、過期、權限不夠。判讀：AWS_PROFILE / IAM role / GCP ADC 配置。

Module version 衝突

操作原則：root module 跟 child module 用不同 provider version。判讀：terraform providers 看 version constraint。

Apply partial failure

操作原則：apply 中某 resource 失敗、state 一致性問題。判讀：state pull 看當前、可能要 import / state rm 修。

何時改走其他服務

需求形狀	改走
OSI-licensed Terraform	OpenTofu（同模組）
Imperative API	Pulumi
Cloud-specific（單一 cloud）	CloudFormation / Azure Bicep / GCP Deployment Manager
K8s-native IaC	Crossplane
Application config（不是 infra）	Helm / Kustomize / cdk8s
極小場景	CLI / Cloud Shell（不用 IaC）

不在本頁內的主題

完整 HCL syntax reference
各 provider 完整 resource list
Terraform Cloud / Spacelift 商業 feature
Drift detection 工具細節

案例回寫

跨 vendor 對照

案例	對 Terraform 的對應
5.C1 Tradeshift self-managed → EKS	平台遷移期間舊 / 新叢集共通配置基線靠 IaC 表達、批次切流時 module 版本要凍結
5.C2 Condé Nast EKS	多團隊異質集群盤點後、用 module + workspace 把平台基線變成統一可審計的 IaC
5.C5 Miro EKS	Managed EKS 後平台團隊把手動操作改成 IaC + GitOps、自動化取代手動操作
5.C10 規模對照	小型 CLI / 中型單 workspace / 大型 multi-workspace + Atlantis / Spacelift 治理

待補 Terraform 案例：HashiCorp Cloud 大客戶案例、OpenTofu fork 後企業遷移案例、Drift detection 治理案例。

下一步路由

上游概念：5 deployment platform
平行 vendor：Kubernetes（K8s provider）
下游能力：06 reliability（IaC GitOps + release gate）

Caffeine

Tue, 16 Jun 2026 00:00:00 +0000

Caffeine 是 JVM 上的 high-performance process-local cache library、承擔三個責任：在 application 進程內（on-heap）提供奈秒到微秒級的 cache（沒有網路往返）、用 Window TinyLFU 淘汰演算法逼近最佳命中率（優於傳統 LRU）、提供 expire / refresh / size-based eviction 等完整 cache 語意。設計取捨偏向「最低延遲 + 最高命中率 + 嵌進 application」、是 Redis 之外的另一層 cache，不是 Redis 的替代。

對「每個請求重複讀同一份小資料、Redis 的網路往返都嫌慢、資料可在每個實例各存一份」這條路徑、Caffeine 是 process-local 層的標準選擇。它常跟 Redis 組成兩層 cache（Caffeine L1 + Redis L2）、不是二選一。Caffeine 是 Guava Cache 的後繼、由同作者重寫、Spring Boot 等框架的預設 local cache。

本章目標

讀完本章後、你應該能：

用 Maven / Gradle 引入 Caffeine、寫出基本 cache
理解 Window TinyLFU 為何命中率優於 LRU
設計 expire-after-write / refresh-after-write / 容量上限
判斷 process-local cache 跟 Redis 的兩層 cache 分工
評估跨實例 invalidation 的限制與 GC 壓力

最短路徑：引入 Caffeine 寫一個 cache

1
2
3  com.github.ben-manes.caffeine
4  caffeine
5  3.2.4
6

 1// 基本 cache：容量上限 10000、寫入後 5 分鐘過期
 2Cache<String, User> cache = Caffeine.newBuilder()
 3    .maximumSize(10_000)
 4    .expireAfterWrite(Duration.ofMinutes(5))
 5    .build();
 6
 7cache.put("user:123", user);
 8User u = cache.getIfPresent("user:123");
 9
10// loading cache：miss 時自動回源（取代手寫 cache-aside）
11LoadingCache<String, User> loading = Caffeine.newBuilder()
12    .maximumSize(10_000)
13    .refreshAfterWrite(Duration.ofMinutes(1))   // 背景非同步 refresh、不阻塞讀
14    .build(key -> userRepository.findById(key)); // miss / refresh 時呼叫
15User u2 = loading.get("user:123");

Caffeine 是 library 不是 server、跑在 application JVM 內、無法 docker 獨立驗證；上面是依官方 API 的範例（API 以 Caffeine wiki 為準）。

日常操作與決策形狀

淘汰與過期策略

Caffeine 把 cache 行為拆成幾個正交的旋鈕。子議題：

maximumSize / maximumWeight：容量上限（筆數或加權大小）、超過用 W-TinyLFU 淘汰
expireAfterWrite：寫入後固定時間過期（資料新鮮度上限）
expireAfterAccess：最後存取後過期（淘汰冷資料）
refreshAfterWrite：到期後背景 refresh、舊值先服務、不阻塞（跟 expire 不同）

Window TinyLFU 淘汰

子議題：

W-TinyLFU 結合 recency（window）+ frequency（TinyLFU sketch）、命中率逼近最佳
比 LRU 更抗一次性掃描污染（scan resistance）、跟 Redis LFU 的動機類似但演算法更先進
frequency 用 count-min sketch 近似、記憶體開銷小

兩層 cache（L1 Caffeine + L2 Redis）

子議題：

L1 Caffeine（process-local、奈秒級、每實例一份）擋掉大部分讀
L2 Redis（共享、毫秒級、跨實例一致）擋掉 L1 miss
對應 2.6 high concurrency 的 hot key 兩層解法

進階主題（按需閱讀）

跨實例 invalidation 的根本限制

子議題：

每個 JVM 實例有自己的 Caffeine 副本、一個實例更新不會通知其他實例
解法：短 TTL 容忍 stale、或用 Redis pub/sub 廣播 invalidation 訊息給各實例
這是 process-local cache 的固有取捨：最低延遲換來最弱的跨實例一致性

GC 壓力與 on-heap vs off-heap

子議題：

Caffeine 預設 on-heap、大 cache 會增加 JVM heap 與 GC 壓力
容量上限要對齊 heap 預算、避免 cache 把 heap 撐爆觸發 full GC
極大 local cache 考慮 off-heap 方案（如 Ehcache 的 off-heap tier），但 Caffeine 本身專注 on-heap

async 與 refresh 語意

子議題：

AsyncCache / AsyncLoadingCache：回傳 CompletableFuture、不阻塞 caller
refreshAfterWrite：到期後第一個讀觸發背景 refresh、舊值立即回、避免 stampede
refresh vs expire 的差異是「舊值能不能先服務」

排錯快速判讀

跨實例讀到舊值

操作原則：process-local cache 各實例獨立、更新不傳播。判讀：縮短 TTL 容忍 stale、或加 Redis pub/sub 廣播 invalidation；強一致需求不該用 process-local cache。

命中率低 / cache 沒效果

操作原則：先看 maximumSize 是否太小（working set 放不下）、再看 TTL 是否太短。判讀：用 recordStats() 看 hit rate / eviction count、對齊 working set。

Full GC 頻繁

操作原則：on-heap cache 太大撐爆 heap。判讀：降 maximumSize 或用 maximumWeight 控制實際記憶體、對齊 JVM heap 預算。

何時改走其他服務

需求形狀	改走
需要跨實例共享 / 一致	Redis / Valkey（共享 cache 層）
非 JVM 語言	該語言的 process-local cache（Go ristretto、Python cachetools 等）
需要持久化 / durable	Redis with AOF / AWS MemoryDB
極大 cache 超過 heap	off-heap cache（Ehcache off-heap）或外部 cache（Redis）
不想管容量 / serverless	Momento（serverless、但有網路延遲）

不在本頁內的主題

Caffeine 完整 API（以官方 wiki 為準）
各 JVM 框架（Spring Cache abstraction）的整合細節
Guava Cache 到 Caffeine 的完整 API 對照
off-heap cache 方案比較

案例回寫

跨 vendor 對照（本模組 case 庫暫無 Caffeine-specific case）

Caffeine 是 library 層元件、本 blog cache case 庫（Meta / Shopify / Netflix / Cloudflare / Tinder / Tubi / Snap）暫無 Caffeine-specific case。以下用 process-local 的角度對照。

案例	對 Caffeine 的對應
2.C4 Meta CacheLib + Kangaroo	CacheLib 是 C++ 的 process-local + flash 分層 library、Caffeine 是 JVM 的 on-heap 對應
2.C8 Meta TAO	TAO 有 application-tier local cache、process-local 擋掉大部分讀的思路一致
9.C6 Tinder	每次互動查多個 cache、process-local L1 可擋掉重複讀、降低 L2（Redis）的 RTT 壓力

待補 Caffeine-specific 案例：L1 Caffeine + L2 Redis 兩層 cache 的 production 命中率分層數據、跨實例 invalidation 的 Redis pub/sub 廣播實作、W-TinyLFU vs LRU 的實測命中率對照。

下一步路由

deep article：Caffeine + Redis 兩層 cache 與跨實例失效（L1+L2 + pub/sub 廣播失效）
上游概念：2.6 high concurrency（hot key 兩層解法）、2.3 TTL eviction
平行 vendor：Redis（兩層 cache 的 L2）、Momento（另一端：serverless）
下游能力：2.7 cache copy boundary（跨實例一致性窗口）

cert-manager

Mon, 18 May 2026 00:00:00 +0000

cert-manager 是 K8s 原生的 certificate lifecycle automation — 把「拿 cert、放 cert、定期 renew」這條從以前需要 cron + certbot + 手動 reload 的鏈、轉成 declarative + controller pattern。使用者在 cluster 內 apply 一個 Certificate resource、cert-manager controller 自動跟 issuer 對話、把 cert 存進 Secret、在 lifetime 2/3 點觸發 renew。它把 cert 這件事接進 K8s 控制循環、跟 Pod / Service / Ingress 同等地位的 first-class resource、層級高於 certbot 的 K8s 移植。

服務定位

cert-manager 的核心責任是 K8s cluster 內所有 cert 的生命週期治理。從 Ingress / Gateway 對外 TLS、internal service mTLS、到 workload-level 短期 cert、都用同一套 declarative model 表達。Issuer 抽象讓底層 cert 來源可換 — 公開 cert 走 Let’s Encrypt ACME、內部 cert 走 Vault PKI engine 或 self-signed CA、企業環境走 Venafi 或 AWS PCA — 上層 Certificate spec 不變。

跟 AWS ACM 的差異是 cert 的部署面：ACM 是 AWS-managed cert、只能掛在 AWS service（ELB / CloudFront / API Gateway）、私鑰永不離 AWS；cert-manager 是 K8s-native client、cert 放在 cluster 內的 Secret、可以掛任何 ingress controller 或 workload mTLS。跟 Let’s Encrypt 的關係是 client vs issuer — cert-manager 是 ACME client、Let’s Encrypt 是 ACME server、不是替代關係。跟 SPIRE 的差異是 身份模型 — cert-manager 給 DNS-named cert（CN / SAN 是 hostname）、SPIRE 給 SPIFFE ID-based workload identity（spiffe://trust-domain/workload）、兩者互補不衝突。

本章目標

讀完本頁、讀者能判斷：

cert-manager 用 Issuer / ClusterIssuer 哪個、配什麼 issuer backend（Let’s Encrypt / Vault PKI / self-signed / 公司 CA）
Challenge solver 選 HTTP01 還是 DNS01、為什麼 wildcard cert 必須用 DNF01
Auto-renewal 觸發點、renew 失敗的 alert 時機、跟 Ingress / Gateway API 整合的 annotation
何時用 cert-manager、何時改走 ACM（雲端原生 service）或 SPIRE（workload identity）

最短判讀路徑

判斷 cert-manager 部署是否健康、最少看四件事：

Issuer 配置：是 ClusterIssuer（cluster-wide）還是 Issuer（namespace-scoped）、backend 是哪一種（acme / vault / ca / venafi）、credential（ACME private key、Vault token、CA cert）放哪、RBAC 限制誰能參考這個 issuer
Certificate spec：dnsNames / ipAddresses 跟實際 service 一致、duration 跟 renewBefore 比例合理（renewBefore >= duration / 3）、secretName 指向的 Secret 是不是 ingress 真的會讀的那個
Renewal 觸發：controller log 有沒有按時觸發 renew、kubectl describe certificate 的 Renewal Time 接近沒、Challenge resource 沒有卡在 pending
Challenge solver：HTTP01 的 ingress / Gateway 80 port 真的能被 Let’s Encrypt 從 Internet 打到、DNS01 用的 cloud provider credential 還有效、wildcard cert 沒誤用 HTTP01

四件事任一缺失、cert 就會在不知不覺中過期、production 看到 x509: certificate has expired 才驚覺、是 Transport Trust and Certificate Lifecycle 的典型缺口。

日常操作與決策形狀

Issuer vs ClusterIssuer 的選擇：Issuer 是 namespace-scoped、只能 issue 該 namespace 的 cert、適合 單 team 自管 issuer credential 的場景；ClusterIssuer 是 cluster-wide、所有 namespace 都可以參考、適合 平台 team 統一管理 issuer。production 通常用 ClusterIssuer 配特定 issuer backend + RBAC 收 Certificate 建立權（讓 application team 只能在自己 namespace 建 Certificate、不能改 ClusterIssuer）。

Certificate spec 設計：dnsNames 列出該 cert 涵蓋的 hostname（支援 wildcard *.example.com）、ipAddresses 加 IP SAN（mTLS 跨 service 常用）、duration 是 cert 有效期、renewBefore 是提前多久 renew（預設 duration 的 1/3）。短期 cert（hours-level、Vault PKI 常用）配 renewBefore 短、長期 cert（90 天、Let’s Encrypt）配 renewBefore 30 天。secretName 指向 cert-manager 會寫入的 Secret、Ingress 跟 workload 從這個 Secret 讀。

Challenge solver 的選擇：ACME issuer（Let’s Encrypt）需要證明 你控制這個 domain、有兩個方法：HTTP01（在 http://yourdomain/.well-known/acme-challenge/ 放檔案、Let’s Encrypt 從 Internet 來抓）跟 DNS01（在 DNS zone 加 _acme-challenge.yourdomain TXT record、Let’s Encrypt 查 DNS）。wildcard cert（*.example.com）必須用 DNS01、HTTP01 不支援 wildcard 因為 Let’s Encrypt 不知道要打哪個 subdomain。HTTP01 要求 ingress controller 80 port 對 Internet 開放、DNS01 要求 cluster 有 cloud DNS API credential。

Auto-renewal 機制：cert-manager 在 cert lifetime 達到 (duration - renewBefore) 時間時觸發 renew、預設約 lifetime 2/3 點。Let’s Encrypt cert 90 天 = 60 天時開始嘗試 renew、留 30 天緩衝給 renew 失敗的重試。renew 失敗會持續重試（exponential backoff、最長 8 小時間隔）、剩下 ~7 天時 controller log 開始 ERROR 級別 alert — 監控要 hook 進這個 log 訊號、否則 cert 真的過期才知道就太晚。

跟 Ingress 整合：Ingress resource 加 annotation cert-manager.io/cluster-issuer: letsencrypt-prod（或 cert-manager.io/issuer:）、cert-manager 看到 Ingress 的 tls.hosts 自動建立對應 Certificate、issue 完寫進 tls.secretName 指定的 Secret、ingress controller 自動 reload 用新 cert。Gateway API 的整合機制類似、用 cert-manager.io/issuer annotation 在 Gateway resource。

CertificateRequest Approval Policy（v1.4+）：每個 Certificate 建立會產生 CertificateRequest、由 Approver 決定要不要送給 issuer。預設 cert-manager 內建 approver 自動 approve、但可以加 admission policy（Kyverno / OPA / 自寫 webhook）限制「誰能在哪個 namespace 建什麼 SAN 的 cert」— 防 internal compromise 任意 issue cert 對外冒名。production 環境通常會在 platform-level 鎖 wildcard cert、防 application team 誤建涵蓋整個 zone 的 cert。

核心取捨表

取捨維度	cert-manager	AWS ACM	手動 certbot / OpenSSL
部署模型	K8s controller、declarative `Certificate` resource	AWS managed、Console / API request	手動跑 CLI、cron 跑 renew
Cert 部署面	K8s Secret、任何 ingress controller / workload	只能掛 ELB / CloudFront / API Gateway	任何地方、但 deploy 要自己做
Issuer 彈性	多 issuer（ACME / Vault / Venafi / CA / AWS PCA）	只能 Amazon CA	任何 ACME provider、但要手寫 hook
Auto-renewal	內建 controller、預設 2/3 lifetime 點 renew	AWS 自動 renew（DNS-validated only）	自己寫 cron + reload script
Wildcard 支援	走 DNS01 challenge	支援、需 DNS 驗證	走 DNS01 hook
私鑰位置	K8s Secret（cluster 內、需 RBAC + etcd encryption）	AWS 內、不可 export	Local filesystem、要自己管
適合場景	K8s cluster 內所有 cert、跨 issuer、internal mTLS	AWS-only serving cert（ELB / CDN）	非 K8s 的 server、舊系統
退場成本	中 — 改其他 ACME client 或回手動	高 — 私鑰拿不出來、要重新 issue	低 — 完全自管

選 cert-manager 的核心訴求：cluster 內 cert 跨 issuer 統一管理 + 自動 renew + 跟 Ingress / Gateway declarative 整合。如果 cert 完全給 AWS service 用、不進 K8s workload、ACM 更簡單（不用裝 controller、AWS 自動處理）。如果是非 K8s 環境（VM、bare-metal Nginx）、certbot + cron 仍是合理選擇、不需要為了 cert 跑 K8s controller。

進階主題

DNS01 challenge 跟 cloud DNS 整合：cert-manager 支援多家 cloud DNS provider 作為 DNS01 solver — Route53、Cloud DNS（GCP）、Azure DNS、Cloudflare、ACMEDNS（自管 DNS proxy）。每個 provider 需要 DNS zone 寫入 credential（IAM role、service account key、API token）— 這份 credential 等於 任意改該 zone DNS record 的權力、blast radius 大、要走 least privilege 限定到 specific zone + 只給 TXT record write、不要全 zone 全 record type。

跟 Vault PKI engine 整合：cert-manager 可用 Vault PKI engine 作為 issuer backend — 在 cluster 內建 Issuer / ClusterIssuer type 為 vault、指向 Vault address + PKI mount path + auth method（Kubernetes auth / AppRole）。每張 cert 的 issue / revoke 都進 Vault audit log、跟 secret rotation 用同一套 evidence chain（呼應 Credential Rotation Scoped Evidence）。typical 用法：short-lived workload mTLS cert（hours-level duration、minutes-level renewBefore）、靠 Vault PKI 短期 cert + cert-manager 自動換。

跟 SPIRE 的互補：cert-manager 自動更新 cert、但 cert 是給人讀的 DNS name；SPIRE 自動建立 workload identity、identity 是 SPIFFE ID。兩者解不同問題 — cert-manager 解「Ingress / external API 的 TLS」、SPIRE 解「service A 要怎麼證明自己是 A 給 service B 看」。production 環境常並存：edge cert 跟 user-facing TLS 用 cert-manager + Let’s Encrypt、internal service mesh 用 SPIRE + SPIFFE。

Trust bundle 管理（trust-manager）：trust-manager 是 cert-manager 姐妹專案、解決 trust anchor（root CA bundle）跨 namespace 同步 問題。傳統做法是每個 pod ConfigMap 各自塞 CA bundle、更新時要逐個改；trust-manager 提供 Bundle resource 一處定義、自動 distribute 到指定 namespace 的 ConfigMap。對應 cert rotation 跟 CA rotation 是兩條獨立 chain、後者是 trust-manager 的領域。

排錯與失敗快速判讀

Challenge 卡在 pending：HTTP01 卡 = ingress 80 port 沒對 Internet、firewall / NLB 沒開、redirect 80→443 把 challenge 也轉了；DNS01 卡 = DNS provider credential 過期、IAM 沒 zone write 權、_acme-challenge record 沒寫進去 — kubectl describe challenge 看 reason
Wildcard cert 用 HTTP01：申請失敗 + log 寫 “wildcard not supported with HTTP-01” — 改 DNS01 solver
renewBefore 太短：renew 失敗只剩幾天才 alert、實際過期前來不及處理 — renewBefore 至少 duration / 3、production cert 給 30 天
Secret 沒被 ingress 讀到：Certificate 已 Ready 但 ingress 還用舊 cert — ingress tls.secretName 拼錯、ingress controller 沒 reload、TLS handshake 用的 SNI 沒匹配
ACME rate limit 撞牆：Let’s Encrypt rate limit 每週同 domain 50 cert / 同 account 300 pending — 反覆建錯 Certificate 重 issue 會撞、staging environment 用 letsencrypt-staging issuer 測過再上 prod
ClusterIssuer 被 application team 誤改：沒設 RBAC、任何 namespace 都能 patch ClusterIssuer — 用 admission policy 鎖 ClusterIssuer 變更權給 platform team
Approval Policy 缺失：任何 namespace 能建 wildcard cert、internal compromise 拿到 K8s API token 就能 issue 假冒 cert — 上 CertificateRequest Approval Policy + Kyverno / OPA rule

何時改走其他服務

需求形狀	改走
AWS-only serving cert（ELB / CloudFront）	AWS ACM
非 K8s 環境（VM、bare-metal）的 ACME cert	certbot / acme.sh / Let’s Encrypt 直接用
Workload identity（不是 DNS-named cert）	SPIRE（SPIFFE-based）
大量短期 internal cert + 完整 PKI 治理	Vault PKI engine（可配 cert-manager 為 client）
公司既有 enterprise CA（Venafi / DigiCert）	cert-manager + Venafi issuer / 商用 issuer plugin
全公司 cert rotation 證據鏈	7.5 Credential Rotation Scoped Evidence

不在本頁內的主題

cert-manager Helm chart 的所有 value 細節跟版本相容性矩陣
每個 issuer backend 的完整 schema（acme / vault / venafi / ca / selfSigned）
Gateway API 跟 Ingress API 的 cert-manager annotation 完整對照
ACME RFC 8555 protocol 細節（HTTP01 / DNS01 / TLS-ALPN-01 challenge mechanism）
trust-manager 的 Bundle source 種類（inMemory / secret / configMap / defaultPackage）

案例回寫

cert-manager 在 07 案例庫沒有直接 vendor-level 事件、以下案例採對照引用：

案例	跟 cert-manager 的關係（對照）
Transport Trust and Certificate Lifecycle (section)	cert-manager 是 cert lifecycle automation 的具體實作 — auto-renewal + Challenge solver + Approval Policy 是 lifecycle 治理三層機制
Credential Rotation Scoped Evidence (section)	cert-manager 的 renewal 自動但 revocation 流程不自動 — 舊 cert 失效後 fleet 層級 trust bundle update 是另一條 chain、走 trust-manager
Citrix Bleed 2023 Session Hijack	對照啟示 — cert 更新後 session 仍可能延續、cert-manager 只管 cert lifecycle、session invalidation 是另一層責任、不要把 cert rotation 當 session 失效手段

下一步路由

上游：7.6 秘密管理與機器憑證治理、Transport Trust and Certificate Lifecycle
平行：Let’s Encrypt（ACME issuer）、AWS ACM（AWS-managed cert）、SPIRE（workload identity）
下游：HashiCorp Vault（PKI engine 作為 issuer backend）
跨模組：8 事故處理 vendor 清單（cert 過期 / mis-issue 事件如何 routing）
官方：cert-manager Documentation

Syft + Grype

Mon, 18 May 2026 00:00:00 +0000

Syft 跟 Grype 是 Anchore 開源的 姐妹工具（Apache 2.0、免費）、各做一件事、用 pipe 串接成 SBOM-first 的 supply chain scan 鏈：Syft 掃 container image / 檔案系統 / 目錄、產出標準 SBOM（CycloneDX 1.5+ / SPDX 2.3 / SyftJSON）；Grype 吃 SBOM 或直接 scan target、比對 Grype-DB 回報 CVE。設計哲學是 Unix philosophy — syft image:tag -o cyclonedx-json | grype 等價於 grype image:tag、但中間的 SBOM 是 正式 artifact、可以單獨簽章、單獨保存、單獨給下游消費。跟 Trivy 全包式設計不同、跟 Snyk 商業 SaaS 路線也不同。

服務定位

Syft + Grype 的核心定位是 SBOM-first 的 OSS supply chain scan tool chain。SBOM 不是中間產物、是 正式可簽章 artifact：Syft 產 SBOM 後通常用 Sigstore cosign attest --predicate sbom.cdx.json 把 SBOM 簽進 image OCI metadata、跟 image 一起發布；下游團隊 / 客戶 / scan pipeline 拿 trusted SBOM 跑 Grype、不需要重新 scan image。對 air-gapped 環境、multi-team handoff、合規場景（EO 14028 / FedRAMP 要求交付 CycloneDX 或 SPDX）特別合適。

跟 Trivy 的差異是 分工 vs 全包：Trivy 一個 binary 把 SBOM 生成 + vuln scan + IaC + secret + license 都做了；Syft + Grype 拆兩個工具、SBOM 互通流程適合、團隊偏好 Unix philosophy 選這條。功能覆蓋面 Trivy 略廣（含 IaC / secret scan）、Syft 的 SBOM 格式互通性是 OSS reference implementation。跟 Snyk 的差異更直接：Snyk 商業 SaaS、覆蓋廣（SAST / IaC / CSPM / Reachability）、有 dashboard 跟 fix PR；Syft + Grype 純 CLI、OSS 免費、聚焦 SBOM + vuln 兩件事、沒 server / 沒 dashboard、要 dashboard 走商業 Anchore Enterprise 或自接 JSON 到 Elasticsearch / Grafana。

關鍵 first-class concept：Source（OCI image / OCI archive / Docker daemon / dir / file / 既有 SBOM）、Catalog（Syft 內部 package inventory 結構）、Package、Vulnerability、Match（Grype 的 package ↔ CVE 配對）、Match Configuration（grype.yaml 設 severity gate / 比對策略）、Vulnerability DB（Grype-DB、Anchore 聚合 NVD + GHSA + 各 distro secdb）、Ignore Rule（CVE 例外、強制帶 expiration）。

本章目標

讀完本頁、讀者能判斷：

Syft 跟 Grype 各自的責任邊界、為什麼拆兩個工具比合一個工具好（SBOM 互通、attestation、air-gapped）
SBOM 格式（CycloneDX / SPDX / SyftJSON）的選擇、跟合規要求對應
Grype Match Configuration 跟 Ignore Rule 怎麼設、CI fail 條件怎麼定
何時改走 Trivy 全包式、何時走 Snyk 商業 SaaS

最短判讀路徑

判斷 Syft + Grype 配置是否健康、最少看四件事：

SBOM 格式跟保存：產出格式是否符合合規（多數 EO 14028 / FedRAMP 場景要 CycloneDX 或 SPDX、不是 SyftJSON）、SBOM 是否簽章（cosign attest）、是否集中保存（OCI registry 旁邊 / artifact store）、是否有 baseline diff（image 升級前後依賴變化）
Grype DB 更新：DB 是否每日同步、air-gapped 場景是否 mirror 到內部 registry（Grype DB 是 OCI artifact、可 oras pull 鏡像）、DB version 是否進 SBOM scan record（重現性）
Match Configuration：grype.yaml 的 severity gate（CI fail 條件、通常 high / critical fail）、only-fixed: true 是否開（只報有 patch 的 CVE）、add-cpes-if-none: true 對 binary-only package 行為
Ignore Rule 治理：例外清單是否帶 expiration、reason 欄位是否填 ticket / decision 連結、quarterly review 機制、過期自動回到 fail 狀態

四件事任一缺失、就是 Supply Chain Integrity 邊界的待補項目。

日常操作與決策形狀

Syft 用法跟 Source 種類：syft -o 是核心 — source 可以是 OCI image（registry/image:tag）、OCI archive（oci-archive:image.tar）、Docker daemon（docker:image:tag）、目錄（dir:./）、單一檔案、甚至既有 SBOM（sbom:./prev.cdx.json、用來 轉格式）。format 包括 cyclonedx-json / cyclonedx-xml / spdx-json / spdx-tag-value / syft-json / table。production 通常產 cyclonedx-json（合規要求最常見）+ 保留 syft-json（Syft 自家最完整、未來 round-trip 用）。

Package detector 廣度：Syft 自動偵測 OS package（apk / dpkg / rpm）+ 語言 dependency（npm / pip / gem / go module / cargo / maven / gradle / nuget / composer / hex / conan / swift / dart 等）+ binary analysis（Go binary 內 embedded module、Rust binary metadata、Java jar / war / ear nested）。對 static binary / FAT image 的支援是 Syft 的強項、比多數 SBOM tool 廣。但 runtime-only dependency（dlopen / dynamic load）SBOM 看不到、要靠 runtime workload protection（Falco / Cilium Tetragon 類工具、見 7 後續候選 vendor 清單）補。

Grype 用法：grype 或 grype sbom:./image.cdx.json。輸出 table / json / cyclonedx-json（CycloneDX VEX 格式）/ sarif（GitHub code scanning）/ template（Go template 自訂）。production CI 通常 --output sarif 上傳 GitHub code scanning + --output json 進內部 SIEM。grype sbom:./prev.cdx.json 模式是 SBOM-only scan、不碰 image — 適合 下游團隊拿 SBOM 持續 monitor、原始 image 已經 frozen 或不可達。

Match Configuration（grype.yaml）：核心欄位包括 fail-on-severity: high（CI gate）、only-fixed: true（只回報有 fix 可用的 CVE、避免 noise）、ignore list（個別 CVE 例外）、match strategy（如何把 package CPE / PURL 對應到 CVE、預設策略對 90% 場景夠用、特殊 binary 場景才調）。所有設定走版控、grype.yaml 跟程式碼一起 review、避免 console 改。

Ignore Rule 治理：grype.yaml 的 ignore entry 結構：vulnerability + reason + expiration（YYYY-MM-DD）+ optional package.name / fix-state。Anchore 設計 沒有「永久 ignore」、必須帶 expiration — 強制 quarterly review、避免「五年前 ignore 的 CVE 早被 fix 了還在清單裡」。reason 欄位填 ticket 編號或 ADR link、給未來的人 context。

Cosign attest SBOM：syft image:tag -o cyclonedx-json > sbom.cdx.json && cosign attest --predicate sbom.cdx.json --type cyclonedx --key cosign.key image:tag — SBOM 被簽進 image 的 OCI signature manifest、下游 cosign verify-attestation --type cyclonedx ... 拿到 cryptographically signed SBOM。這把 SBOM 從「可被竄改的 JSON 檔」升級到 trusted artifact、是 SLSA L3+ provenance 的基礎。

SLSA / SPDX 流程整合：Syft SBOM 是 build 階段產物、跟 SLSA provenance（誰 build 的、用什麼 builder、source commit 是什麼）併存、不互斥 — SBOM 答「裡面有什麼」、provenance 答「怎麼 build 的」。完整 supply chain trust 需要兩者 + cosign signature。

核心取捨表

取捨維度	Syft + Grype	Trivy	Snyk
工具拆分	兩個（Unix philosophy）	一個（all-in-one binary）	SaaS + CLI（多模組）
授權	OSS Apache 2.0	OSS Apache 2.0	商業（freemium、付費才解鎖完整）
部署模型	CLI、無 server	CLI、無 server	SaaS dashboard + CLI
SBOM 格式	CycloneDX 1.5+ / SPDX 2.3 / SyftJSON（reference 實作）	CycloneDX / SPDX	CycloneDX / SPDX（次要、scan 為主）
Vuln 資料源	Grype-DB（NVD + GHSA + 各 distro secdb 聚合）	Trivy-DB（類似來源 + Aqua 加值）	Snyk Intel（自家 research、含 reachability）
額外掃描	無（聚焦 SBOM + vuln）	IaC / secret / license / k8s misconfig	SAST / IaC / container / IaC / Open Source / Code
Dashboard	無（Anchore Enterprise 商業才有）	無（Aqua 商業才有）	內建 SaaS dashboard
Air-gapped	強 — Grype DB 是 OCI artifact、可 mirror	強 — Trivy DB OCI artifact	弱 — SaaS-only 為主（自管 server 是 Enterprise）
Reachability	無	無	有（Java / JS）
Fix PR 自動化	無	無	有（auto PR、Renovate-like）
適合場景	OSS 偏好、SBOM 互通流程、air-gapped、Unix tool chain	OSS 偏好、單一工具想包多事、k8s misconfig 也要	商業 SaaS、需 dashboard / fix workflow / reachability

選 Syft + Grype 的核心訴求：要正式 SBOM 作為交付 artifact（合規 / 多 team handoff）+ 偏好 OSS Unix philosophy（兩個工具各做一件事、容易整合自家 pipeline）+ 不需要 SaaS dashboard（自家 SIEM / Grafana 已經有）。需要 IaC scan 一起做、看一下 Trivy 是不是更省整合成本；需要 fix workflow 跟 reachability、商業預算足、走 Snyk。

進階主題

SBOM attestation 完整鏈：build pipeline 順序通常是 — build image → syft image -o cyclonedx-json > sbom.cdx.json → cosign sign image → cosign attest --predicate sbom.cdx.json --type cyclonedx image → push。下游 admission controller（Kyverno / Gatekeeper / Sigstore policy-controller）verify-attestation 拿 trusted SBOM、再 Grype scan、policy 決定是否允許 deploy。這條鏈把 SBOM 從文件升級成 deploy gate。

Grype DB air-gapped sync：Grype DB 是 OCI artifact（ghcr.io/anchore/grype/listing.json + db.tar.gz）、oras pull 或 grype db update 取得。air-gapped 場景：DMZ 跑 grype db update --skip-listing-content-check、把 ~/.cache/grype/db/ 整個 sync 到內部 mirror registry、內部 grype 透過 GRYPE_DB_UPDATE_URL 指到內部 listing。DB 版本進 scan record、確保 相同 SBOM + 相同 DB = 相同結果（可重現）。

Custom matcher / Ignore Rule 細部：Grype 預設 matcher 對 90% 場景夠、但 Go binary、static-linked binary、custom C++ build 可能需要 add-cpes-if-none: true 強制配對 CPE。Ignore Rule 支援 vex-status 欄位（accepted / under-investigation / fixed / not-affected）對齊 CycloneDX VEX 標準、輸出 VEX-enriched SBOM 給下游 / 客戶。

Anchore Enterprise 商業整合：OSS Syft + Grype 不夠時、Anchore Enterprise 加：policy engine（GraphQL 寫複雜 policy）、dashboard、RBAC、SLA-backed support、跟 Kubernetes admission integration、跟 Jira / ServiceNow ticket 自動建單。OSS 是 90% 場景的起點、Enterprise 解的是 policy + workflow 而非 scan ability。

SBOM diff（baseline 比對）：syft 自己沒內建 diff、但 cyclonedx-cli diff 或自家 script 可以比對 image v1 SBOM vs image v2 SBOM、找出新增 / 移除 / 升級的 package。用途：XZ backdoor 之類「相同 version 但被植入後門」事件、單靠 SBOM 看不出來、但 baseline + behavior anomaly 雙軌可以提早警示。

排錯與失敗快速判讀

Syft scan 找不到 package：image 是 FROM scratch 或 distroless、Syft 偵測不到 OS package metadata — 改 scan source 為 build 階段的 dir:./ 或保留 builder image 的 SBOM
Grype 報一堆 unfixed CVE：base image 老、有 CVE 但 upstream 還沒 patch — 設 only-fixed: true 過濾 noise、focus 在 actionable item；同時排程 base image 升級
CI 突然 fail 變多：Grype DB 更新後新 CVE 揭露 — 看 DB version diff、評估是 真新風險 還是 舊 package 被重新分類、必要時用 Ignore Rule + expiration 過渡
SBOM 格式下游不認：合規要求 SPDX、產的是 SyftJSON — 用 syft convert syft-json:./sbom.json -o spdx-json 轉格式（Syft 本身就是 SBOM 互轉工具）
Air-gapped 環境 Grype 跑不動：DB 沒同步、scan 直接報 0 vulnerability（假陰性）— grype db status 看 DB age、mirror sync 機制檢查、加 staleness alarm
Ignore Rule 過期回到 fail：CI 突然 fail、查 expiration 已過 — 預期行為、強制 quarterly review；補 rotation 機制（cronjob 提前一週 alert owner）
Binary 偵測不到 module：Go binary stripped、-trimpath 後 module path 沒了 — build 改加 -buildvcs=true 保留 VCS info、或 build 階段 SBOM scan source code、不是 binary
cosign verify-attestation 失敗：image 被 re-tag / re-push 後 attestation manifest 不對 — 用 image digest（@sha256:...）而非 tag 做 attest、tag 不可信
Grype 不抓某個 ecosystem：例如新冒出的 package manager — Syft 沒實作 detector、Grype 也看不到；submit issue 或自己寫 catalogger 貢獻

何時改走其他服務

需求形狀	改走
一個工具想包 IaC / secret / k8s misconfig	Trivy
需要 SAST / Reachability / Fix PR workflow	Snyk
綁 GitHub 的 SAST + Dependabot	GitHub Advanced Security
Container runtime detection	Falco / Cilium Tetragon（見 7 後續候選 vendor 清單）
Image signing / attestation	Sigstore cosign
Policy at admission	Kyverno / OPA Gatekeeper（見 7 後續候選 vendor 清單）
SBOM dashboard / enterprise policy / RBAC	Anchore Enterprise（商業）

不在本頁內的主題

CycloneDX / SPDX 完整 schema 規格逐欄位解讀
Sigstore cosign / Rekor / Fulcio 完整架構（attest 鏈的 OIDC / transparency log）
SLSA framework 各 level 對應的 builder 要求
Anchore Enterprise policy DSL 完整語法
VEX（Vulnerability Exploitability eXchange）跟 CSAF 標準對照細節

案例回寫

07 案例庫沒有直接 Syft / Grype-level 事件、但供應鏈案例都是 SBOM-first 思維的對照：

案例	跟 Syft + Grype 的關係
Log4Shell CVE-2021-44228	對照啟示 — 預先用 Syft 產 SBOM 集中保存後、Log4Shell 公開時拿歷史 SBOM 跑 Grype 在分鐘級回答「我們哪些服務有用、含 transitive」
SolarWinds 2020 Sunburst	對照啟示 — Syft 看 package layer、看不到 build-time backdoor 注入；需配 cosign attest + SLSA provenance 才完整
XZ Backdoor 2024	對照啟示 — 相同 version 被植入後 SBOM 一樣、純比對 SBOM 看不出來；mitigation 是 SBOM diff 對 baseline + release tarball verify
Kaseya VSA 2021	對照啟示 — 多服務 SBOM 集中 inventory（哪 service 用哪 component）、緊急時可 affected-services-by-package 反查、不是逐 image scan
7.12 供應鏈完整性與 Artifact 信任	Syft 是 SBOM reference implementation、章節原則對應 SBOM + signing + provenance 的 trust chain

下一步路由

上游：7.12 供應鏈完整性與 Artifact 信任
平行：Trivy（一站式替代）、Snyk（商業 SaaS）、GitHub Advanced Security（GitHub 內建）
下游：Sigstore cosign（SBOM attestation）、admission policy（Kyverno / OPA Gatekeeper、見 7 後續候選 vendor 清單）
跨類：runtime workload protection（Falco / Cilium Tetragon、見 7 後續候選 vendor 清單）、HashiCorp Vault（cosign signing key 保存）
跨模組：8 事故處理 vendor 清單（新 CVE 揭露時的 SBOM-based fan-out 查詢）
官方：Syft Documentation / Grype Documentation

Google Cloud Spanner

Wed, 13 May 2026 00:00:00 +0000

Cloud Spanner 是 Google 內部 2007 年起跑、2017 年開放為 GCP 服務的 全球分散式 SQL OLTP。內部撐 Google Ads / Play / Search 計費、外部支援 Blockchain.com、Sharechat、ZEE5 等。它的公開案例重點是每秒 10 億請求等級、線性擴展、強一致與 global distribution 可以同時成為 OLTP 設計目標。

教學路線：全球強一致與 TrueTime 成本

Spanner 服務頁的教學目標是把 global strong consistency、TrueTime、Paxos、region layout 與 processing unit 連成一條產品決策線。讀者讀完後要能判斷何時需要全球一致 SQL，並理解這種能力的 latency、成本與雲平台邊界。

學習段	核心問題	對應段落
Global consistency	強一致 SQL 為什麼需要時間邊界與 consensus	定位、適用場景、Linearizability
Region layout	instance config、leader region、replica 如何影響 latency	容量規劃要點、常見陷阱
Capacity unit	node / processing unit 如何取代傳統 shard 心智模型	容量特性、案例對照
Use-case pressure	billing、subscription、ticketing、金融交易何時需要 Spanner	適用場景、案例對照
替代路由	何時用 PostgreSQL、CockroachDB、Aurora DSQL、DynamoDB	不適用場景、跟其他 vendor 的取捨

定位：TrueTime + Paxos 的全球線性 SQL

Spanner 解決的是跨地理位置同時追求 strong consistency、linear scalability 與 global availability 的 OLTP 問題。

關鍵設計：

TrueTime API：用 GPS + 原子鐘提供「全球 unambiguous 時間戳」、誤差 < 7ms
External consistency（線性化）：跨節點交易順序跟 wall clock 一致
Paxos-based replication：跨 zone / region quorum
線性擴展：2 nodes → 45K reads/sec、4 nodes → 90K reads/sec、依此類推

容量特性（引自 9.C10 Spanner 案例）：

內部峰值：> 10 億 requests / sec
線性擴展（不像 USL 系統會在某點 plateau）
跨 region quorum 延遲：50-200ms（視 region 距離）
最小容量單位：100 processing units（PU）≈ 1/10 node、適合小負載

適用場景

1. 金融交易、ticketing inventory、payment ledger：

需要強一致，避免 double-spend、oversell 或帳務順序錯亂
全球用戶但需要原子性
對應案例：9.C10 Spanner — Google Ads 計費與 Google Play 訂閱都需要把每次計費事件放進可驗證順序

2. 全球用戶的 OLTP（不只 read replica）：

跨 region 寫入、各地用戶寫入本地 region 仍維持全球強一致
它承擔的是 multi-region write path，而非 single primary + 跨 region read replica
對應案例：Blockchain.com（高頻 crypto 交易、強一致）

3. 想擺脫 sharding 複雜度：

傳統大規模 SQL 常走應用層 sharding（管 shard key、跨 shard query、resharding）
Spanner 自動 partition，application 主要管理 schema、query shape 與 region layout
對應案例：9.C10 Spanner 案例 — 「節點數量是容量單位」，shard placement 由 Spanner 管理

4. PostgreSQL 相容路徑：

2024 後 Spanner 提供 PostgreSQL dialect interface
從 PostgreSQL 應用遷入 Spanner 變得容易
跟 CockroachDB / Aurora DSQL 類似的策略

不適用場景

1. 跨洲低延遲（< 50ms）需求：

跨洲 quorum 物理上 100ms+ 不可壓縮
替代：single-region OLTP（Aurora、Cloud SQL）+ eventual consistency 跨 region 同步

2. 高 throughput 但容忍 eventual consistency：

Spanner 強一致有溢價，eventual consistency workload 通常有更低成本選項
替代：Bigtable（wide-column、eventual）、DynamoDB Global Tables（KV、eventual）

3. 小規模 OLTP：

100 PU 起跳、月費約 $65 起、比 Cloud SQL 貴
流量 < 1000 RPS 的場景、Cloud SQL 更划算
Spanner 主要對 中大規模 + 全球 workload

4. 跨雲需求：

Spanner 是 GCP managed service，cross-cloud / on-prem 需求要看 CockroachDB、TiDB 或其他自管路線
替代：CockroachDB、TiDB（自管、可跨雲）

5. 需要 OLAP 分析能力：

Spanner 定位在 OLTP，analytics workload 交給 BigQuery 或其他 OLAP 系統
替代：跟 BigQuery 整合做 ETL、或用 Spanner Graph（2024 推出）

跟其他 vendor 的取捨

vs Aurora DSQL（AWS 2024 推出、概念對標 Spanner）：

Spanner：用 TrueTime hardware、生產驗證 17 年（Google 內部）+ 7 年（公開）
Aurora DSQL：新（2024）、PostgreSQL 相容、serverless
選 Spanner：GCP 生態、需要極致成熟度
選 Aurora DSQL：AWS 生態、需要 PostgreSQL ORM 相容

vs CockroachDB：

Spanner：managed、TrueTime hardware、GCP 限定
CockroachDB：自管、HLC + Raft（不靠 TrueTime）、跨雲
選 Spanner：想把 operation 交給 GCP managed service，並需要 Google 規模驗證
選 CockroachDB：跨雲 / on-prem、PostgreSQL 相容、自管彈性

vs TiDB：

Spanner：GCP-only、PostgreSQL-like
TiDB：可自管 + Cloud、MySQL 相容、中國 / 亞洲生態深
選 Spanner：英語 / 歐美生態
選 TiDB：MySQL 應用、亞洲市場

vs Aurora（traditional single-region scaling）：

Spanner：全球分散式
Aurora：single-region scaling
選 Spanner：流量明確跨 region + 需要強一致
選 Aurora：流量集中一個 region（多數情況）

vs Cosmos DB（multi-region write）：

Spanner：strong consistency 跨 region
Cosmos DB：5 個 consistency levels、AP 系統（含 strong 但語義不同）
選 Spanner：需要 linearizable（金融、ticketing）
選 Cosmos DB：可接受 session / eventual、Azure 生態、需要 multi-model

vs Bigtable：

Spanner：SQL、強一致、OLTP
Bigtable：wide-column、eventual replication、時序 / IoT / 大資料
兩者互補：Bigtable 承擔大資料 / wide-column，Spanner 承擔強一致 OLTP

vs PostgreSQL（baseline）：

PostgreSQL：single-primary、跨 region async replication、90% 場景夠用
Spanner：全球線性化、強一致跨 region、需要 GCP + 接受 latency / 成本
從 PostgreSQL 升級 Spanner 的判準：流量明確跨 region，且跨 region 一致性是 product requirement
詳見 PostgreSQL vendor page 取捨段 + 1.11 全球分散式 OLTP

容量規劃要點

從 09 案例庫 + Spanner 文件提煉：

1. 節點數量 = 容量單位：

節點配置通常用較長週期 review，並在事件高峰前預先調整
線性擴展讓 forecast 簡單（2x 流量 → 2x 節點）
對應 9.6 容量規劃模型的「不可水平擴容服務」反向 — Spanner 是 可水平擴容 但需要 提前 provision

2. 跨 region quorum 配置：

multi-region instance 可選擇哪些 region 是 voting member
voting region 數量決定 failure domain
跨大洲 voting 延遲高、跨大陸內可接受

3. 100 PU 起跳的 granular sizing：

早期 Spanner 最小單位 1 node（約 $1000+/month）、中小負載難用
後來推出 100 PU（1/10 node、約 $65/month）、讓小負載也能 evaluate

4. 跨環境與新產品能力要查官方文件：

Spanner 的跨環境、graph、PostgreSQL dialect 與 change streams 能力持續演進
實作前要用官方文件確認可用 region、版本、限制與 pricing

5. TrueTime 是 Spanner 價值之一：

Spanner 還有 schema migration without downtime、change streams、interleaved tables
評估 Spanner 要同時看跨 region 強一致與整體 SQL 工程能力

Deep article（已完成）

本批 4 篇 deep article 已完成、覆蓋 Spanner 從 TrueTime 到 Cloud SQL 遷移的核心 production 議題：

主題	文章	對應 production 議題
TrueTime 是手段、line-rate scaling 才是設計目的、commit wait 數學	truetime-api-depth	9.C10 Google internal dogfood 線性擴展模式、ε 暴衝失敗模式、cross-region voting latency 影響
external consistency / serializability / linearizability 精確定義差異	consistency-models-comparison	PG SSI / CockroachDB / Spanner / Aurora DSQL line-rate scaling 對照、9.C10 cross-region quorum 100-200ms
Schema migration without downtime + interleaved tables 物理 layout	schema-migration-interleaved-tables	TrueTime version timestamp、5 production 踩雷、跟 PostgreSQL online schema change 對照
Cloud SQL for PostgreSQL → Spanner（Type E paradigm shift）playbook	migrate-from-cloud-sql-pg	sizing barrier（100 pu 起跳）+ < 50ms write latency no-go、cost crossover 報告、9.C10 dogfood 邊界
Change Streams (CDC)：data change record、watch partition、下游整合	change-streams-cdc	OLTP 變更餵搜尋 / 快取 / 分析、child partition 接力、retention 失敗、跟 DynamoDB Streams 對照
PostgreSQL dialect vs GoogleSQL、相容子集邊界、dialect 不可逆	postgresql-dialect	PostgreSQL 生態遷入、相容性 audit、dialect 鎖定的高代價回退、何時選 PG dialect
Spanner Graph (2024)：property graph、跟 relational 共存、GQL	spanner-graph	多跳關係查詢、edge table layout 不可逆設計代價、super node 扇出、何時用專用 graph DB
Spanner ↔ BigQuery federation：OLTP/OLAP 分工、Data Boost	bigquery-federation	分析查詢拖垮 OLTP、Data Boost workload 隔離、federation vs change-stream 落地、何時分出去

DB4 cross-vendor entry：先看 CockroachDB / Aurora DSQL / Spanner 決策樹識別 driver path、再進本 vendor 深度。

後續擴充（仍待補）

Spanner Graph 進階查詢 lab（GQL pattern、super node 處理、遍歷效能調校）
Data Boost 容量規劃與成本模型 deep dive
Change Streams → Dataflow hands-on lab（建 stream、部署 pipeline、驗證 end-to-end）
Spanner regional → multi-region topology 升級 playbook

Anti-recommendation 與升級路由

Spanner 的 global strong consistency 是高價值能力，也會把 latency、region layout 與 GCP lock-in 帶進核心架構。這一段先說何時維持 Cloud SQL / Aurora，再說何時升級 Spanner、CockroachDB、Aurora DSQL 或 Bigtable / DynamoDB。

機制 / 路線	維持簡單設計的條件	升級訊號	主要引用路徑
Cloud SQL / Aurora	single-region primary 足夠、跨 region 只需 async DR / read	跨 region 寫入順序是產品契約、double-spend / oversell 代價高	Aurora vendor、RPO
Spanner regional	單 region 強一致與水平擴容已足夠	需要 multi-region availability、regional failure survival	Quorum、External Consistency
Spanner multi-region	GCP 生態、SQL workload、global consistency 是核心需求	跨洲 p99 目標過低、成本或 GCP lock-in 成為主要風險	Latency Budget、Global OLTP
CockroachDB	GCP-only managed 服務可接受	跨雲、on-prem、自管或 PostgreSQL wire 相容是硬需求	CockroachDB vendor
Aurora DSQL	團隊已在 GCP 或需要 Spanner 成熟度	AWS 生態、serverless distributed SQL、PostgreSQL 相容是主訴求	PG → Aurora DSQL Migration
Bigtable / DynamoDB	workload 可接受 eventual consistency 或 KV / wide-column	強一致 SQL 的協調成本高於產品收益	DynamoDB vendor

Spanner 的簡單路徑是先證明跨 region 一致性是產品需求。若只是想要全球 read latency，read replica、cache、edge KV 或 eventual consistency pipeline 可能更划算；Spanner 適合把「全球寫入順序正確」視為產品承諾的資料。

Region layout 的升級路徑要先定義 leader、voting replica 與使用者地理分布。跨洲 quorum 會把物理延遲放進 transaction path，因此 latency budget、降級策略與 read staleness policy 要一起寫進設計。

已知 limitation 與後續路由

Spanner overview 目前完成 global SQL 判斷。下一輪 deep article / playbook 應補 TrueTime、external consistency、PostgreSQL dialect、interleaved tables、change streams、Cloud SQL / PostgreSQL → Spanner migration 與 Spanner / BigQuery federation。

案例對照

案例	規模	教學重點
9.C10 Cloud Spanner	> 10 億 req/sec、線性擴展	全球強一致 OLTP 標竿

Spanner case 的讀法是先看一致性需求，再看容量數字。10 億 req/sec 證明它能水平擴展，但讀者真正要回收的是「計費、訂閱、庫存、交易順序」這類需要 global external consistency 的產品壓力。

反向 sibling 路由

Spanner 的反向 sibling 路由用來把 global strong consistency 和雲端代管責任一起判讀。若讀者從 PostgreSQL / MySQL 過來，先確認是否具產品契約等級的 external consistency 需求；若只是 managed SQL 與 replica scaling，回 Aurora vendor；若要 PostgreSQL-like distributed SQL 且需要自管或多雲彈性，對照 CockroachDB vendor；若 access pattern 是固定 KV / document，先看 DynamoDB vendor 或 Cosmos DB vendor。

這條路由的判準是交易順序是否跨 region 影響產品正確性。Spanner 的價值在 external consistency、schema 與 SQL 能力、全球 deployment 與 Google Cloud operation model 的組合；若產品只需要 eventual / session consistency，較輕的 NoSQL 或 managed SQL 常有更低成本。

常見陷阱

誤以為跨 region 強一致沒有延遲代價：跨洲 quorum 100-200ms 是物理成本
設計 schema 像傳統 PostgreSQL：Spanner 有 interleaved tables、適當用能加速查詢
所有讀取都用強一致：read-only transaction 可選 bounded staleness，reporting 類路徑常能用 stale read 換較低成本
單 region 用 Spanner：浪費、Cloud SQL / Aurora 更便宜
不評估 100 PU 起跳：早年 1 node minimum、現在 100 PU 起、small workload 也可以 POC

下一步路由

完整 T1 對照：01-database vendors index
平行：Aurora vendor、DynamoDB vendor、CockroachDB vendor
上游：1.11 全球分散式 OLTP
跨模組：9.6 容量規劃模型 — 全球 OLTP 的容量規劃特殊性
Last reviewed：2026-05-22（processing units / PostgreSQL interface / TrueTime 文件屬時間敏感 claim）
官方：Cloud Spanner、TrueTime: Time Distributed in Spanner

GCP Cloud Operations

Fri, 01 May 2026 00:00:00 +0000

GCP Cloud Operations（前 Stackdriver）是 GCP 原生 observability 套件、承擔三個責任：GCP 服務內建 Cloud Logging / Monitoring / Trace（無需配置）、跟 GCP 資源 model 深度整合（project / folder / org）、BigQuery 匯出長期 logs 跟分析。設計取捨偏向「GCP 生態 turnkey + BigQuery 整合 + Cloud Profiler 持續 profiling」、跨雲跟進階 distributed tracing 是限制。

本章目標

讀完本章後、你應該能：

用 gcloud / Console 查 Cloud Logging / Monitoring
設計 structured logging + log-based metrics
用 Cloud Monitoring uptime checks + SLO + alerting policy
用 Cloud Trace + Cloud Profiler 做 application performance
配置 BigQuery 匯出長期 logs 跟分析

最短路徑：5 分鐘把 Cloud Operations 跑起來

1# 1. GCP 預設啟用 Cloud Logging / Monitoring（free tier 額度）
2# TODO: GKE / Cloud Run / Cloud Functions 自動 log + metric
3
4# 2. 查詢 logs
5# TODO: gcloud logging read 'resource.type="gae_app" AND severity>=ERROR'
6
7# 3. 用 Logs Explorer 視覺化查詢
8# TODO: Console → Logging → Logs Explorer

日常操作與決策形狀

Cloud Logging 結構化 logs

子議題：

jsonPayload：結構化 log（推薦）
Severity 7 級（DEBUG / INFO / NOTICE / WARNING / ERROR / CRITICAL / ALERT）
Resource type / Resource labels：自動帶入
對應 4.C5 Cloud Trace OTLP

Log-based metrics

子議題：

Counter metric：log 出現次數
Distribution metric：log field 數值分布
適合：把 application log 轉成 metric trigger alert
對應指令：gcloud logging metrics create

Cloud Monitoring uptime checks / SLO

子議題：

Uptime check：HTTP / HTTPS / TCP / ICMP 多地點 probe
SLO：service indicator + objective + window + burn rate alert
Multi-window SLO alert（類 Honeycomb burn rate）
對應 knowledge cards burn-rate

Cloud Trace

子議題：

接受 OTLP（Cloud Trace 2.0+）
自動採集 GCP service（Cloud Run / GKE / App Engine）
對應 4.C5 Cloud Trace OTLP adoption
跟 X-Ray 比、distributed tracing 較基礎

Deep Article

Cloud Monitoring Metrics Model 與 MQL：GCP metrics model、MQL vs PromQL、custom metrics 設計、alerting policy 與 Managed Prometheus 整合
Cloud Logging 查詢、匯出與合規：查詢語言、log router / sink 匯出、retention 設計、organization-level 聚合、audit log 與 PII / CMEK 合規治理

進階主題（按需閱讀）

Cloud Profiler

子議題：

持續 profiling（CPU / Heap / Wall time / Mutex）
支援 Go / Java / Python / Node
Flame graph 視覺化
跟 Pyroscope / Datadog Profiler 對照

BigQuery 匯出長期儲存

子議題：

Log Router：定義 sink 把 logs 匯出 BigQuery / GCS / Pub/Sub
BigQuery 適合長期 + 分析查詢（SQL）
對應 4.C3 Healthcare retention
Cost：BigQuery storage 比 Cloud Logging cheaper

Error Reporting

子議題：

自動聚合 application error
各語言 client library（Python / Java / Node / Go）
跟 Sentry 對照（Sentry 更深 / 更廣）

Cloud Monitoring agent

子議題：

Ops Agent（取代 Stackdriver agent）：統一 logs + metrics 採集
支援 GCE / Bare metal / AWS / on-prem
配置：YAML config + receivers / processors / exporters（類 OTel Collector）

Multi-project / Multi-region 治理

子議題：

Aggregated logging sink：跨 project 集中 logs
Cross-project SLO
Workspace（前 Stackdriver workspace）已 deprecated、改用 Metrics Scope

OTLP integration

子議題：

Cloud Trace 接受 OTLP（2024 GA）
Cloud Monitoring 接受 OTel metrics（via OTel Collector + GCP exporter）
Logs in OTel 跟 Cloud Logging 整合（成熟中）
對應 4.C5 Cloud Trace OTLP

排錯快速判讀

Logs 沒出現

操作原則：先看 resource type / project 是否對、再看 IAM 權限。

1# TODO: gcloud logging read --project= --resource-type=...

Monitoring 查不到 metric

操作原則：metric name + project + filter 是否對。對應 Metrics Explorer 確認 metric 存在。

SLO alert noise

操作原則：multi-window burn rate 設計避免噪音。

Cloud Trace 太空

操作原則：sampling 不足或 SDK 沒配置。判讀：Cloud Trace 看 span count + 確認 SDK Cloud Trace exporter 設定。

BigQuery 匯出 cost 爆

操作原則：sink filter 沒收斂、所有 logs 都匯。判讀：Cloud Logging usage 看 export volume。

何時改走其他服務

需求形狀	改走
多雲統一觀測	Datadog / Grafana Stack / OTel
進階 APM 廣度	Datadog
High-cardinality debug	Honeycomb
Logs full-text 進階	Elastic / Loki
AWS / Azure 生態	CloudWatch / Azure Monitor
Error tracking 進階	Sentry

不在本頁內的主題

gcloud / Cloud Console UI 操作詳細
各 GCP 服務的內建 metric 完整列表
Cloud Trace span structure 細節
BigQuery SQL syntax

案例回寫

直接相關案例

案例	主討論議題
4.C5 Cloud Trace OTLP	OTLP 在 GCP 的採用路徑

跨 vendor 對照

案例	對 Cloud Operations 的對應
4.C1 Fintech audit	Cloud Logging + BigQuery 作為審計證據與長期分析
4.C3 Healthcare retention	BigQuery 匯出長期 retention
4.C9 OTel migration signal drift	（反例）Cloud Trace ↔ OTLP 雙軌語意對齊
4.C10 規模對照	GCP-only 場景優先 Cloud Operations

下一步路由

上游概念：4.17 Telemetry Data Quality
平行 vendor：OpenTelemetry、CloudWatch
下游能力：4.20 Observability Evidence Package

Instatus

Fri, 01 May 2026 00:00:00 +0000

Instatus 是輕量 status page SaaS、承擔三個責任：簡潔現代 UI 的 status page、component + incident management、跟 IR 工具整合（incident.io / Rootly / FireHydrant）。設計取捨偏向「價格親民 + UI 現代 + 中小團隊適用」、是 Atlassian Statuspage 的 budget-friendly 替代。

服務定位

Instatus 主打 fast + cheap + custom domain、產品形狀直接對標 Atlassian Statuspage 的核心功能（component / incident / subscriber / custom domain），但價格約 1/3-1/5、free tier 就包含 custom domain SSL。typical 客戶是中小 SaaS、indie hacker / 個人 project、不需要 enterprise SLA 但要對外呈現專業感的團隊；不適合需要 audit log、SAML SSO、複雜 access role、SLA 報表的大企業 — 那是 Statuspage / FireHydrant status 模組的場域。

Instatus 的取捨設計：UI 走 modern + minimal、頁面 load 快（自稱 ~50ms）、subscriber notification provider 多元（Email / SMS / Slack / Discord / Teams / Telegram / RSS / Webhook），用 generous free tier 拉初期用戶、進階功能（更多 component、更多 subscriber、white-label、SLA report）走分層 pricing。

關鍵張力：cheap + custom domain from free tier ↔ enterprise governance（SAML / audit / role）。Instatus 故意把 enterprise governance 砍掉以壓 pricing、所以團隊規模成長到需要區分多角色 / 留 audit trail 時、會撞到產品天花板、要評估遷移。提早估算 什麼時候撞到天花板 比事故當下才發現省事很多。

本章目標

建 Instatus + 設 component
寫 incident template + update
配置 subscriber notification
API 從 IR 平台 push
評估 Instatus vs Statuspage / Cachet

最短判讀路徑

判斷 Instatus 是否健康承載對外狀態揭露、最少看四件事：

誰能 publish update：team member 角色設計（admin / member / read-only）、incident update 是否走 PR / approval、誤發 update 的回收路徑（edit / delete + email correction）
Component 數量 vs pricing tier：current tier 的 component limit、現有 / 規劃中的 component 數、跨 tier 切換的成本影響（升 tier 還是合併 component）
Custom domain SSL：status.example.com 的 CNAME 是否生效、SSL cert 自動 renew 是否健康（Instatus 用 Let’s Encrypt 自動簽發、需在 DNS 加 CAA record 授權）、未來 domain 變更的遷移流程
Subscriber notification 健康度：subscriber 數量是否逼近 tier 限制、Email / SMS provider quota / bounce rate、Slack / Discord webhook 是否還有效

四件事任一缺失、就是事故揭露通道有風險、應該優先補完。

日常操作與決策形狀

Component / incident + Subscriber

Component 是對外揭露單位、status（operational / degraded / partial outage / major outage / maintenance）的抽象顆粒度影響事故揭露的 精準度 — 拆太細用戶看不懂、太粗反而失真。實務上跟內部 service map 對齊但 外部可理解語言、例如「Web App」「API」「Login」「Webhooks」、而不是內部 microservice 名稱。

子議題：

Component status（跟 Statuspage 相似、操作 surface 簡潔）
Incident template + maintenance window（pre-defined template 讓事故 update 走標準格式、避免臨場寫錯）
Email / SMS / Slack / RSS / Discord / Teams / Telegram / Webhook subscriber、各 channel 的 quota / 失敗模式不同

API + IR 整合

REST API 用 token 認證、可程式化 create incident / update / resolve / 改 component status。典型整合：incident.io / Rootly / FireHydrant 觸發事故後同步推 Instatus、避免 SOC / on-call 還要手動雙寫。webhook 也支援反向通知、Instatus 上的 incident 變更通知到 IR 平台。

token 是高權限資源（任何持有 token 的 caller 可對外發布 incident）、應該存在 secrets manager、不放程式碼 / 環境變數明文、定期 rotate；CI / IR 平台用獨立 token、出事可單獨 revoke 不影響其他整合。

核心取捨表

取捨維度	Instatus	Atlassian Statuspage	Better Stack Status	Cachet (OSS)
計費模型	分層 SaaS、free tier 含 custom domain	分層 SaaS、custom domain 需付費 tier	分層 SaaS、跟 monitoring 綁	OSS 自管、零 license 成本
UI / 速度	現代 + 快（~50ms load）	成熟但偏重	現代、跟 monitoring 整合	基本、視自管 stack
Custom domain	free tier 即支援、auto SSL	付費 tier、auto SSL	付費 tier	自架 + 自管 cert
Subscriber	Email / SMS / Slack / Discord / Teams / Telegram / RSS / Webhook	同類但部分需高 tier	Email / Slack 為主	自實作
適合場景	中小 SaaS / indie hacker / 個人 project	Enterprise + 跨團隊治理	已用 Better Stack monitoring	嚴格資料自管、零外部 SaaS
退場成本	低 — 標準 component / incident 結構	中	中	高 — 自管 ops

選 Instatus 的核心訴求：cheap + fast UI + custom domain 從 free tier 就有、且不需要 enterprise SLA / SAML / audit 報表。組織成長到要 SAML SSO / multi-team approval / SLA report 時、再評估遷移到 Statuspage 或 IR 平台內建 status。

遷移成本：標準 component / incident 結構讓 Instatus → Statuspage 的搬遷相對單純（資料模型一致、subscriber 列表可匯出）、但 subscriber 重新確認 opt-in 通常是最大痛點 — 切換 domain / provider 時、許多 email subscriber 不會自動轉移、要走再次訂閱流程。

進階主題（按需閱讀）

Custom CSS + branding + Multi-language

status.example.com 走 CNAME 指到 Instatus 配發的 host、SSL 由 Instatus 透過 Let’s Encrypt 自動簽發 + renew、不用自己管 cert。custom CSS / logo 在中高 tier 開放、可改色票 / 字型 / layout、適合需要跟主站視覺一致的 SaaS；不要為了美觀過度客製、status page 第一順位是 清楚揭露事故、視覺只是輔助。

multi-language 支援同一 incident 用多語 update、適合對外服務跨地區用戶。注意 誰負責翻譯 — 事故當下沒人有空一條條翻、實務上 incident update 寫英文 + 主要語言、其餘語言用 fallback 或事後補。

IR 平台 auto-create incident

Instatus 提供 REST API + webhook、典型整合是 IR 平台偵測事故後 自動 create + update status page incident、收尾時 自動 resolve。常見 pattern：PagerDuty / Opsgenie 觸發 high-severity alert → webhook → Instatus API create incident → resolve 時同步收尾。

要點是 誰是 SSoT：incident timeline 由 IR 平台維護、Instatus 是對外揭露 view、不能讓 status page 變第二份 timeline 否則兩邊會漂移。實務上對外揭露的 update 是 IR timeline 的 過濾子集（去掉內部 root cause / 人名 / 攻擊細節）、不是原文同步。

Metrics 公開

子議題：uptime / response time、從 monitor source（如外部 uptime monitor、或自家 metrics）拉資料、決定哪些 metric 對外揭露。揭露太細（例：每個 endpoint p99）會讓潛在攻擊者 reverse-engineer attack surface 跟容量上限；只揭露用戶感受得到的 SLI（前台 availability / API success rate）通常足夠、敏感內部指標留在內部 dashboard。

排錯快速判讀

Subscriber 沒收到：跟 Statuspage 類似、provider quota / bounce / spam filter；SMS 在某些地區需要區號白名單；事故當下若大量 subscriber 同時收到 alert、Email provider 可能短時間 throttle、要留 buffer
Custom domain 失效：DNS CNAME 設定錯 / Let’s Encrypt 簽發失敗（CAA record 衝突、需在 DNS 加 letsencrypt.org 授權）/ SSL renew 卡住 — 事故發生時才發現 cert 過期是最常見的二次事故
API 失敗：rate limit / token 失效 / webhook signature 驗證錯誤；高 severity 事故時 IR 平台可能短時間發大量 update、要確認 rate limit 不會把 update 卡住
Pricing tier 切換成本：升 tier 取得更多 component / subscriber、但降 tier 可能要先刪 component 或 subscriber 才生效、規劃要先估好成長曲線
Subscriber list 上限：tier 有 subscriber 上限、逼近時要嘛升 tier、要嘛清理 inactive subscriber（長期 bounce / unsubscribe）；不要等到滿了才處理、新 subscriber 註冊失敗會直接傷品牌信任

何時改走其他服務

需求形狀	改走
Enterprise SLA / SAML SSO / audit	Atlassian Statuspage
OSS 自管 / 嚴格資料留在自家環境	Cachet
IR 平台內建 status	FireHydrant
Alert / on-call SSoT	PagerDuty / Opsgenie

不在本頁內的主題

完整 API reference / Pricing 細節 / Custom CSS 範本
SLA report 設計（Instatus 提供基本 uptime 計算、複雜 SLA 報表走 Statuspage 或 IR 平台）
Status page 對外揭露的法務 / 合約義務（合約 SLA、credit 計算）— 屬法務 / 商務、不在本頁
IR timeline 設計本身（誰寫、誰簽 — 屬 8.19 Incident Decision Log 的範圍）

案例回寫

Instatus 主打輕量、低成本公開狀態頁：本案例庫的案例多為大型平台、以 Atlassian Statuspage 揭露事故；Instatus 缺乏直接 vendor-level case、可參照的閱讀脈絡是「事故對外揭露的最小可行樣式」、特別適合中小 SaaS 跟 indie 開發者拿來對照自家 status page 的最低門檻。

案例	對應主題	對 Instatus 用戶的啟示
Heroku cases	平台型服務的 component 拆分與訂閱範例	component 拆分顆粒度可借鏡（Web / API / Build / Dyno）、中小 SaaS 不需要拆到 region 等級、但要分前後台
Discord cases	事件導向產品的最小事故時序揭露對照	incident update 節奏 — 第一則確認、後續更新、resolve 收尾、indie 級服務也至少跑這三段、不能 silent recovery

待補 candidate：從 Statuspage 遷移至 Instatus 的中小型 SaaS cost-saving story、indie hacker 個人 project 從零搭 status page 的最小配置（含 custom domain + 一個 component + 一個 incident template）。

下一步路由

上游：8.19 Incident Decision Log（決定哪些 timeline event 該對外揭露）
平行：Atlassian Statuspage、FireHydrant、PagerDuty、Opsgenie
下游：8.22 Incident Evidence Write-back（事故結束後對外揭露的 timeline / post-mortem 整理）
跨類：8 事故處理 vendor 清單（一次看完 IR / status / on-call vendor map）

LitmusChaos

Fri, 01 May 2026 00:00:00 +0000

LitmusChaos 是 CNCF graduated 的 Kubernetes chaos engineering 平台、承擔三個責任：ChaosHub experiment marketplace（現成 experiment 直接用）、ChaosWorkflow 編排多步驟實驗、Probe-based steady state validation。設計取捨偏向「現成 experiment 庫 + workflow-centric + CNCF graduated 治理」、是 Chaos Mesh 的近競品、Harness 提供商業版（ChaosNative）。

本章目標

讀完本章後、你應該能：

部署 Litmus 到 K8s
從 ChaosHub 引用現成 experiment
寫 ChaosWorkflow（多步驟 + probe）
設計 Probe（HTTP / Cmd / K8s / Prometheus）做 steady state
評估 Litmus vs Chaos Mesh vs Gremlin 的選用

最短路徑：5 分鐘把 Litmus 跑起來

1# 1. 安裝
2# TODO: helm install litmus litmus/litmus -n litmus --create-namespace
3
4# 2. 從 ChaosHub 引用 experiment
5# TODO: kubectl apply -f https://hub.litmuschaos.io/...
6
7# 3. 跑 experiment + 看 ChaosResult
8# TODO: kubectl apply -f chaosengine.yaml
9# TODO: kubectl describe chaosresult

日常操作與決策形狀

CRD 設計

子議題：

ChaosExperiment：experiment 定義
ChaosEngine：bind experiment 到 target
ChaosResult：執行結果

ChaosHub experiment

子議題：

現成 experiment marketplace
Generic / Kafka / Cassandra / GCP / AWS / VMware experiments
自訂 experiment 上傳 Hub

ChaosWorkflow

子議題：

Argo Workflow-based
多步驟 chaos 編排
Schedule trigger

進階主題（按需閱讀）

Probe-based steady state

子議題：

HTTP probe / Cmd probe / K8s probe / Prometheus probe
跟 chaos 同步 / 序列執行
Success threshold 設計

ChaosCenter（control plane）

子議題：

跨 cluster chaos 管理
ChaosResult dashboard
RBAC 控制

Harness ChaosNative（商業）

子議題：

商業支援版本
跟 Harness CD 整合
Enterprise governance

跟 Chaos Mesh 對照

子議題：

Litmus：workflow-centric、ChaosHub
Chaos Mesh：CRD-driven、Dashboard 友善
選擇判讀：現成 experiment 庫 → Litmus；fault types 多樣 → Chaos Mesh

Chaos as Code

子議題：

ChaosWorkflow YAML version control
GitOps integration
PR-based chaos review

排錯快速判讀

Experiment fail to start

操作原則：ServiceAccount + RBAC 不對、experiment image pull 失敗。判讀：kubectl describe chaosengine。

Probe 失敗

操作原則：probe 條件設錯 / target 沒準備好。判讀：ChaosResult 看 probe verdict。

Hub experiment 引用版本不對

操作原則：experiment.yaml 跟 Litmus version 不對齊。判讀：Litmus version + experiment compatibility。

Workflow 卡住

操作原則：Argo Workflow 卡 → 看 Argo pod log。

何時改走其他服務

需求形狀	改走
多 fault types / Dashboard	Chaos Mesh
非 K8s / 商業	Gremlin
Integration test	Toxiproxy
AWS-native	AWS Fault Injection Service

不在本頁內的主題

ChaosHub 各 experiment 詳細 parameter
Argo Workflow 內部
Litmus 商業版本 detail

案例回寫

案例方向	對應主題
Netflix：Steady State、Chaos 與 FIT	hypothesis-driven experiment 對應 ChaosHub workflow
Spotify：平台工程與可靠性契約	squad-based 採用 chaos 的平台化路徑

Case 庫稀薄：本 cases/ 目錄目前沒有以 LitmusChaos 為主軸的案例。

待補 LitmusChaos customer case：CNCF graduated 後客戶採用案例、Harness ChaosNative 客戶
候選 case：Meta（K8s-native region failover chaos）、Microsoft（Chaos Studio 對照組）— 若未來收錄需先在 cases/ 補正文

下一步路由

上游概念：6.20 Experiment Safety Boundary
平行 vendor：Chaos Mesh、Gremlin
下游能力：8 incident response

Traefik

Fri, 01 May 2026 00:00:00 +0000

Traefik 是 cloud-native reverse proxy / ingress、承擔三個責任：auto-discovery（從 Docker / K8s / Consul / file 自動發現 backend）、dynamic config（不 reload、即時更新）、ACME 自動 TLS（Let’s Encrypt 整合）。設計取捨偏向「cloud-native 簡潔 + auto-discovery 為核心 + middleware chain extensibility」、適合 Docker / K8s 中小規模、大規模 / 複雜 traffic management 跟 nginx / envoy 比相對弱。

對「Docker / K8s ingress、需要 auto-discovery、ACME 自動 TLS、配置簡潔」這條路徑、Traefik 是 cloud-native first 選擇。

本章目標

讀完本章後、你應該能：

部署 Traefik 到 Docker / K8s
配置 dynamic provider（labels / annotations / CRD / file）
配置 ACME 自動 TLS
設計 middleware chain（auth / rate limit / circuit breaker）
評估 Traefik vs nginx vs Envoy 的選用

最短路徑：5 分鐘把 Traefik 跑起來

 1# 1. Docker 跑 Traefik + dashboard
 2docker run -d -p 80:80 -p 8080:8080 \
 3  -v /var/run/docker.sock:/var/run/docker.sock \
 4  traefik:v3 --api.insecure=true --providers.docker
 5
 6# 2. 用 docker label 配置 routing
 7docker run -d --label "traefik.http.routers.demo.rule=Host(\`demo.local\`)" nginx
 8
 9# 3. 訪 dashboard 驗證
10curl -s http://localhost:8080/api/http/routers | jq '.[].rule'

日常操作與決策形狀

Provider auto-discovery

子議題：

Docker provider：從 container labels 讀 config
Kubernetes Ingress provider：從 Ingress resource
Kubernetes CRD provider：Traefik IngressRoute CRD
Consul / Etcd provider：從 KV store
File provider：YAML / TOML 靜態 file

IngressRoute（K8s CRD）

子議題：

Traefik CRD：IngressRoute / Middleware / TLSOption / ServersTransport
比 Ingress 表達力強（middleware chain / TLS option / multi-protocol）
跟 Gateway API 對比

Middleware chain

子議題：

內建 middleware：headers / rate limit / basic auth / forward auth / retry / circuit breaker / compress / IP whitelist
自訂 middleware：plugin（Yaegi-based）
順序：定義 middleware → 在 router 引用

進階主題（按需閱讀）

ACME 自動 TLS

子議題：

Let’s Encrypt 整合（自動憑證 + 續期）
DNS challenge（適合 wildcard）vs HTTP challenge（適合單 domain）
多 resolver 配置（staging / production / 不同 CA）
對應 ACME storage（local / KV / Traefik Hub）

Provider weight / priority

子議題：

多 provider 同時跑、config 來源衝突處理
Provider 優先順序
對應 dynamic config debug

Traefik Hub（managed）

子議題：

Traefik Hub：商業 managed control plane
適合：跨 cluster 統一管理 / API Gateway portal
跟 self-host Traefik 對比

跟 nginx / Envoy 對比

子議題：

Traefik 強：cloud-native auto-discovery、配置簡潔
nginx 強：穩定 + 配置控制力 + 大量 community recipe
Envoy 強：xDS dynamic config、advanced traffic management
選型判讀：Docker / K8s 小中規模 → Traefik；複雜 traffic → Envoy；標準 HTTP → nginx

Plugin 機制（Yaegi）

子議題：

Traefik plugins 用 Yaegi（Go interpreter）跑、不需 recompile
Plugin catalog（社群 + 官方）
適合：客戶 auth / metric / transformation 小邏輯
對應 Envoy WASM extension 對比

Multi-protocol

子議題：

HTTP / HTTPS / TCP / UDP
gRPC（HTTP/2）原生支援
WebSocket sticky session

排錯快速判讀

Service 沒被發現

操作原則：先看 provider 是否啟用、再看 label / annotation / CRD 配置。

1curl -s http://localhost:8080/api/http/services | jq '.[].name'

Route 衝突

操作原則：兩個 router 同 rule，看 priority 排序。判讀：dashboard 看 router list。

ACME rate limit

操作原則：Let’s Encrypt 有 rate limit、staging environment 先測再切 production。

Middleware chain 順序錯

操作原則：middleware 順序影響行為（auth before rate limit vs after）。判讀：dashboard 看 middleware order。

Dashboard 連不上

操作原則：dashboard 預設 8080、需要 entrypoint 配置。判讀：traefik.yml + entrypoints 設定。

何時改走其他服務

需求形狀	改走
配置控制力 / 大量 community 模板	nginx
Advanced traffic / xDS	Envoy
AWS managed	AWS ELB
Service mesh	Istio / Linkerd / Consul Connect
Gateway API standard	Envoy Gateway / Contour
純 dev / local	Docker Compose + direct port mapping

不在本頁內的主題

Traefik plugin 開發
Yaegi Go interpreter 細節
Traefik Hub 商業細節
各 cloud provider 整合差異

案例回寫

跨 vendor 對照

案例	對 Traefik 的對應
5.C9 cutover without drain	Traefik auto-discovery 在 service 下線時、要靠 health check + grace period 等價 drain
5.C10 規模對照	Docker / K8s 中小規模選 Traefik 簡潔、大規模通常升階到 Envoy / ingress-nginx 或 mesh

待補 Traefik 案例：Traefik Labs customer story、IngressRoute CRD 大規模採用、Traefik Hub 早期 adopter。

下一步路由

上游概念：5.3 LB Contract
平行 vendor：nginx、Envoy
下游能力：Kubernetes vendor 頁

AWS ACM

Mon, 18 May 2026 00:00:00 +0000

AWS Certificate Manager (ACM) 是 AWS-managed 的 certificate provisioning 服務、解決兩件事：public TLS cert 全自動化（Amazon Trust Services 簽發、DNS validation 通過後 60 天前自動 renew）跟 AWS-managed service 的 cert 整合（ELB / CloudFront / API Gateway / App Runner 直接 attach、不需要客戶持有私鑰）。內部 mTLS / 自管 endpoint 的 private cert 走另一個產品 ACM Private CA（PCA）— ACM 是 frontend、PCA 是 自管 CA hierarchy backend。

服務定位

ACM 的核心定位是 AWS 平台內 cert 的全託管 lifecycle。客戶不持私鑰、不跑 ACME client、不手動 renew — 但代價是 ACM public cert 只能 attach 到 AWS-managed service（ELB / CloudFront / API Gateway / App Runner / Nitro Enclaves）、不能 export 給自管 Nginx / EC2 應用。Private cert 必須有 ACM Private CA (PCA) 後端、ACM 自己不是 CA。

跟其他 cert 工具的場景重疊度低、定位是分工互補：cert-manager 走 cluster 內 K8s workload cert（Ingress / service mesh）、Let’s Encrypt 走跨平台公共 ACME cert（可 export 任何地方使用）、ACM Private CA 走自管 CA hierarchy（root + intermediate、客戶控制 policy）。常見組合：AWS-native endpoint 用 ACM、K8s workload + 自管伺服器走 cert-manager + Let’s Encrypt、內部 mTLS root 走 PCA。詳細差異見「核心取捨表」。

本章目標

讀完本頁、讀者能判斷：

ACM public cert vs private cert vs imported cert 各自的使用邊界（能 attach 哪些 service、能不能 export）
DNS validation vs Email validation 的差異、跟 auto-renewal 條件的關聯
跨 region 跟 CloudFront 的 us-east-1 限制如何處理
何時 ACM 不夠用、要改走 cert-manager / Let’s Encrypt / ACM Private CA

最短判讀路徑

判斷 ACM cert 部署是否健康、最少看四件事：

Cert 跟 service 整合：cert ARN 是否真的 attach 到 ELB / CloudFront / API Gateway listener、DescribeCertificate 的 InUseBy 有沒有資源、有 cert 但沒 attach 等於 issue 失敗
DNS validation 設定：cert 是 DNS 還是 Email validation、DNS 的 CNAME record 是否還留在 DNS（auto-renewal 需要這條 record 持續存在）、Route53 vs 外部 DNS 的責任分界
Renewal status：DescribeCertificate 的 RenewalSummary.RenewalStatus 是 SUCCESS / PENDING_AUTO_RENEWAL / FAILED、失敗時 RenewalStatusReason 是什麼（多半是 DNS record 被刪、CNAME 不再回應）
CloudTrail 證據：RequestCertificate / ImportCertificate / DeleteCertificate 的 caller identity、是否有非預期的 cert 建立或刪除（防誤刪 / 惡意刪）

四件事任一缺失、就是 Transport Trust and Certificate Lifecycle 的覆蓋缺口。

日常操作與決策形狀

Request public cert：對 internet-facing endpoint（網站、API）issue public cert、走 RequestCertificate API、選 DNS validation。ACM 給一組 CNAME record、放進 DNS（Route53 可一鍵 create）、ACM 自動驗證 + issue。Cert 生效後 attach 到 ELB / CloudFront / API Gateway listener。Issuer 是 Amazon Trust Services、所有主流瀏覽器 / OS trust store 都認。

Request private cert（需 PCA 後端）：內部 service mTLS root、走 RequestCertificate 但指定 PCA ARN。ACM 透過 PCA 簽 cert、cert chain 是組織內部 CA hierarchy。Trust store 必須在各 workload 手動建立（不像 public cert 自動 trust）。

DNS validation vs Email validation：DNS validation 是預設 + 推薦 — CNAME record 放進 DNS 後、ACM 持續驗證 domain ownership、auto-renewal 全自動。Email validation 是 legacy、ACM 寄信到 domain 的 WHOIS / 預設 admin email、人工點連結驗證；auto-renewal 不會自動完成、cert 到期前必須手動 re-validate。Production 一律用 DNS validation。

Auto-renewal 條件：ACM 在 cert lifetime 60 天前嘗試 renew、條件嚴格：(1) cert 是 ACM-issued（不是 imported）(2) DNS validation 走 CNAME record 仍存在且可回應 (3) cert 至少 attach 到一個 AWS service。三個條件任一不滿足、renewal 不自動觸發、cert 會 expire。Imported cert 完全不自動 renew、必須在 expiry 前手動 re-import。

跟 ELB / CloudFront / API Gateway 整合：ELB / API Gateway 用所在 region 的 ACM cert、CloudFront 例外 — 只認 us-east-1 region 的 ACM cert（CloudFront edge 是 global、cert metadata 統一從 us-east-1 拉）。Multi-region app 要在每個 region 各 request 一份 cert、CloudFront 那份固定放 us-east-1。

Imported certificate：自管 cert（外部 CA 簽的、舊系統遷移過來的）可以 import 進 ACM、拿到 ARN 後一樣 attach 到 AWS service。代價是 ACM 不會 renew、expiry 前必須手動 re-import 新版。常見事故源：imported cert 過期、AWS service 突然 serve expired cert、Browser 顯示警告。建議 imported cert 都設 CloudWatch alarm 監 DaysToExpiry。

跟 AWS IAM 整合：誰能 issue / delete cert 走 IAM policy 控制 — acm:RequestCertificate / acm:DeleteCertificate / acm:ImportCertificate。Tag-based access control 可以限定「只有帶 team=platform tag 的 cert 才能被 platform team IAM role 改」、防誤刪 production cert。Cert 是 region-scoped resource、IAM policy 可指定 Resource ARN 限定 region / cert ID。

核心取捨表

取捨維度	ACM (public)	ACM Private CA (PCA)	cert-manager + Let’s Encrypt	手動 OpenSSL CA
部署模型	AWS managed	AWS managed CA hierarchy	K8s cluster 內 self-hosted controller	手動腳本
私鑰持有	AWS 持有、客戶不能 export	AWS 持有 CA key、subordinate 可 export	cluster 內 Secret、可 export	自己持有
Issuer	Amazon Trust Services（public trust store）	客戶自管 CA（內部 trust）	Let’s Encrypt / 任何 ACME CA	自簽
適用 endpoint	AWS-managed service（ELB / CloudFront / API GW）	內部 mTLS、AWS service 也可用	K8s workload、Ingress、任何持有 PEM 的服務	實驗 / 內部小規模
Auto-renewal	DNS validation 全自動	透過 ACM 自動	cert-manager 自動	自己寫 cron
跨雲 / 跨平台	弱 — AWS 內	弱 — AWS 內	強 — K8s 在哪都可	強
計費	public cert 免費	per CA + per cert（PCA 較貴）	免費（Let’s Encrypt）	免費
適合場景	AWS-heavy + edge endpoint	內部 mTLS root + AWS 整合	K8s workload + 跨雲	實驗、極小規模
退場成本	中 — cert 重 issue 但 service 配置要改	高 — CA hierarchy 遷移痛苦	低 — PEM 在手、換 issuer 容易	低

選 ACM 的核心訴求：cert 主要 attach 到 AWS-managed service、希望 cert 完全 hands-off、不需要 export 私鑰、能接受 AWS lock-in。需要 export PEM 或跨雲 / 自管 endpoint、改走 cert-manager + Let’s Encrypt。需要內部 mTLS root + CA hierarchy 控制、走 ACM Private CA。

進階主題

ACM Private CA hierarchy：PCA 支援 root CA + 多層 intermediate CA、生產建議 root CA 離線（CA 簽完 intermediate 後 disable）、日常簽發走 subordinate CA。Subordinate CA compromise 時 revoke 該層、root 不受影響。Cert policy（path length、key usage、name constraint）在 CA 建立時設定、之後無法改、設計時要算對。

Cross-region cert（CloudFront 的 us-east-1 限制）：CloudFront 是 global service、但 attach 的 ACM cert 必須在 us-east-1。Multi-region 部署：每個 region 各 issue 一份 cert 給該 region 的 ELB / API Gateway、CloudFront 的那份單獨在 us-east-1 issue。Terraform / CloudFormation 要顯式宣告 provider region。

Imported cert 跟 auto-renewal 邊界：imported cert（外部 CA 簽的）ACM 知道存在、可以 attach、但 不 renew。常見事故：團隊 import cert 後忘了；幾個月後 cert 到期；CloudFront / ELB serve expired cert；客戶看到 browser 警告。對策：所有 imported cert 設 CloudWatch alarm DaysToExpiry < 30、AlmostExpired event 推 EventBridge → PagerDuty。長期策略是把 imported cert 都遷移成 ACM-issued cert（如果 domain ownership 可驗證）。

Tag-based access control：cert 加 tag（team=platform、env=prod）後、IAM policy 用 Condition 限定：只有同 tag 的 role 才能 update / delete。防誤刪 production cert（dev IAM role 跑 cleanup script 不會誤刪 prod）。配合 AWS IAM 的 ABAC 模型運作。

Wildcard cert 跟 SAN cert：ACM 支援 wildcard（*.example.com 涵蓋一層 subdomain）跟 SAN（一張 cert 多個 domain，最多 100 個）。Wildcard 簡化部署但 blast radius 大 — 一張 cert compromise 等於整個 subdomain tree 出事；SAN cert 細粒度但管理成本高。Production 建議按服務邊界拆 — 每個 service 一張 cert、不共用 wildcard，除非確實有大量短 lifecycle subdomain。

排錯與失敗快速判讀

Cert PENDING_VALIDATION 一直卡住：DNS validation CNAME record 沒放對、或 DNS provider 緩存太久 — 用 dig 直接查 CNAME 是否生效、Route53 + ACM 整合通常幾分鐘、外部 DNS 可能 30 分鐘以上
Cert renewal FAILED：RenewalStatusReason 多半是 DOMAIN_VALIDATION_DENIED（CNAME record 被刪了）或 cert 沒 attach 到任何 service — 補回 CNAME record、或把 cert attach 到至少一個 resource
CloudFront 找不到 cert：cert 在 us-east-1 以外的 region issue — 在 us-east-1 重 issue、或用 Terraform 顯式跨 provider 設定
Imported cert expired：忘了 manual renewal、AWS service serve expired cert — CloudWatch alarm + EventBridge 推 alert、長期遷成 ACM-issued
ACM cert 無法用在 EC2 自管 Nginx：public cert 私鑰不能 export 是設計限制 — 改用 ACM Private CA 或 Let’s Encrypt + cert-manager
誤刪 production cert：沒設 tag-based protection、admin script bug — 開 deletion protection（暫時無內建、用 IAM Condition 限定 delete operation + 24h cooldown via Lambda）+ CloudTrail alert 上 acm:DeleteCertificate
Cross-account cert 共用：ACM cert 不支援 RAM 共用 — 跨 account 要在每個 account 各 issue（或用 PCA + RAM 共用 PCA、各 account 從 PCA issue）

何時改走其他服務

需求形狀	改走
K8s workload mTLS / Ingress TLS	cert-manager + Let’s Encrypt / 內部 issuer
自管 Nginx / EC2 / 跨雲 endpoint	Let’s Encrypt + 自管 ACME client
內部 mTLS root + CA hierarchy 控制	ACM Private CA（PCA）或 HashiCorp Vault PKI engine
Workload identity（SPIFFE）跨平台	SPIRE
Cert renewal 證據鏈（rotation evidence）	7.5 Credential Rotation Scoped Evidence
Cert + session invalidation 邊界	7.3 入口治理、cert renew 跟 session token 是兩條獨立 lifecycle

不在本頁內的主題

ACM Private CA 完整 hierarchy 設計（root CA 離線儲存、HSM-backed CA key、CRL / OCSP responder 部署）
ACM API 完整 CLI reference 跟 Terraform resource 詳盡欄位
TLS protocol 本身（TLS 1.2 vs 1.3、cipher suite、handshake 流程）
Certificate Transparency log 跟 SCT embedding 內部機制
各 browser / OS trust store 的更新週期

案例回寫

ACM 在 07 案例庫沒有直接 vendor-level 事件、以下採對照引用：

案例	跟 ACM 的關係（對照）
Transport Trust and Certificate Lifecycle (section)	ACM 是 AWS 平台 cert lifecycle 自動化的具體落地 — DNS validation + auto-renewal 是自動化覆蓋率的指標、imported cert 是覆蓋缺口、要單獨設 alarm 兜底
Citrix Bleed 2023 Session Hijack	對照啟示 — cert 自動 renew 不等於 session 自動 invalidate、舊 session token 在新 cert 下仍可重放、session lifecycle 是另一層責任、不在 ACM 範圍
Credential Rotation Scoped Evidence (section)	ACM renewal 自動、但 Certificate Transparency log 比對 + fleet-wide trust bundle update 是另一條 evidence chain、要跟 SBOM / CMDB 對齊

下一步路由

上游：7.4 傳輸信任與憑證生命週期、7.3 入口治理與伺服器防護
平行：cert-manager、Let’s Encrypt、SPIRE
下游：AWS IAM（誰能 issue / delete cert）、AWS KMS（PCA CA key 後端）
跨模組：8 事故處理 vendor 清單（cert expiry / mis-issuance 進 IR 流程）
官方：AWS Certificate Manager Documentation

Azure Cosmos DB

Wed, 13 May 2026 00:00:00 +0000

Azure Cosmos DB 是 Microsoft 全球分散式 multi-model database、提供 SQL / MongoDB / Cassandra / Gremlin / Table 五種 API、五個 consistency levels、自動 multi-region write。Microsoft 自家 Microsoft 365 用它做 analytics、ASOS 在 Black Friday 撐 1.67 億請求 24 小時、Minecraft Earth 測試 1M RU/s — 是 Azure 上 NoSQL / Document 工作負載的旗艦。

教學路線：Multi-model API 與全球寫入

Cosmos DB 服務頁的教學目標是把 API model、consistency level、RU/s、logical partition 與 multi-region write 放在同一個 Azure 服務決策中。讀者讀完後要能判斷 Cosmos DB 是遷移相容層、全球 NoSQL 平台，還是特定 Azure workload 的容量抽象。

學習段	核心問題	對應段落
API model	SQL API、MongoDB API、Cassandra API 各自服務哪種遷移或資料形狀	定位、跟其他 vendor 的取捨
Consistency level	session、bounded staleness、strong consistency 如何改變產品語意	容量規劃要點、Consistency Level
RU/s capacity	request unit 如何把 query、index、payload 轉成成本與節流	容量特性、案例對照
Global write	multi-region write 何時值得承擔衝突與一致性成本	適用場景、案例對照
替代路由	何時用 MongoDB、DynamoDB、Spanner、PostgreSQL 或 analytics	不適用場景、下一步路由

定位：multi-model + multi-region write

Cosmos DB 跟其他 DB 最大差異是 multi-model。一個服務同時支援 5 種 API、每個 API 對應不同資料模型。應用層選擇用哪個 API、底層是同一個分散式 KV store。

5 個 API：

SQL API：document（JSON）+ SQL-like query、Cosmos DB native
MongoDB API：wire-protocol 相容 MongoDB
Cassandra API：wire-protocol 相容 Cassandra
Gremlin API：graph database
Table API：簡單 KV（Azure Table Storage 升級版）

5 個 consistency levels（從強到弱）：

Strong：在支援的 account / region 配置內提供最強一致性，通常帶來最高 latency
Bounded staleness：訂版本 / 時間差異上限
Session：同 session 內強一致（最常用）
Consistent prefix：保證寫入順序
Eventual：最便宜、最終一致

容量特性：

容量單位：RU/s（Request Unit per second）— 把 read / write / query 統一抽象
1 RU = strongly consistent read of 1KB document
配置擴容延遲：99 百分位 5 秒內生效
每個 logical partition 上限：10,000 RU/s
測試最高：1,000,000 RU/s（Minecraft Earth 案例）

適用場景

1. Azure 生態的 multi-model 需求：

同一服務多種 use case（document、graph、KV 共存）
想把多個 NoSQL 資料模型集中在 Azure 服務邊界內治理
對應案例：9.C30 Microsoft 365 — Microsoft 自家用 Cosmos DB 撐分析平台

2. 全球零售 + 季節性高峰：

multi-region write 讓全球用戶寫入本地 region
對應案例：9.C21 ASOS — Black Friday 24 小時 1.67 億請求、3500 RPS 峰值、48ms 平均延遲

3. 全球分散式遊戲後端：

AR / 即時遊戲跨地區同步
session consistency 對遊戲足夠、不需 strong
對應案例：9.C11 Minecraft Earth — AR 遊戲玩家位置、跨 region 寫入

4. MongoDB 應用想要 managed + 全球分散：

Cosmos DB MongoDB API wire protocol compatible
應用層主要驗證相容差異，底層改成分散式架構
對應案例：9.C30 Microsoft 365 — MongoDB → Cosmos DB MongoDB API、planet-scale 分析

5. 想用 multi-region active-active write：

不像 Spanner / Aurora DSQL 是 PC 系統、Cosmos DB 是 AP 系統
用 LWW（Last-Writer-Wins）或 stored procedure 處理 conflict
適合可接受 eventual / session consistency 的 multi-region write workload；需要 global SQL linearizability 時轉 Spanner / Aurora DSQL

不適用場景

1. 跨雲需求：

Cosmos DB only on Azure
替代：MongoDB Atlas（cross-cloud）、CockroachDB（自管）

2. Linearizable 全球 OLTP：

Cosmos DB Strong consistency 的適用範圍要按 account / region 配置判讀；全球 linearizable SQL 需求通常轉 Spanner / Aurora DSQL
替代：Spanner / Aurora DSQL（真正全球 linearizable）

3. 預算極敏感的小 workload：

最低 400 RU/s（約 $25/month）
小流量場景、Azure SQL Database 更便宜

4. 純 OLAP 分析：

Cosmos DB 定位在 OLTP / document，analytics workload 交給 Synapse、BigQuery 或 Snowflake
替代：Azure Synapse、BigQuery、Snowflake

5. 嚴格 ACID 跨 partition transaction：

Cosmos DB Transaction 限 same logical partition
跨 partition 的 multi-row transaction 要改用 workflow、stored procedure 邊界或 distributed SQL
替代：Spanner / Aurora DSQL

跟其他 vendor 的取捨

vs DynamoDB（AWS）：

Cosmos DB：multi-model（5 API）、5 consistency levels、multi-region write
DynamoDB：KV 為主、strong / eventual consistency、Global Tables 以 LWW 處理 multi-region conflict
選 Cosmos DB：Azure 生態、需要 multi-model、需要 consistency 細粒度控制
選 DynamoDB：AWS 生態、純 KV、AWS-native 整合（Lambda、Streams）

vs Spanner（GCP）：

Cosmos DB：AP 系統、5 consistency levels、multi-model
Spanner：CP 系統、external consistency、SQL only
選 Cosmos DB：可接受 eventual / session、需要 multi-model
選 Spanner：需要 linearizability 與 SQL workload

vs MongoDB Atlas：

Cosmos DB MongoDB API：Azure-only、managed、global 強
MongoDB Atlas：跨雲（AWS / GCP / Azure）、原生 MongoDB 行為
選 Cosmos DB：已在 Azure、想要更好 global distribution
選 MongoDB Atlas：跨雲、需要 MongoDB 完整功能（aggregation pipeline 等 native 行為）

vs Cassandra / ScyllaDB：

Cosmos DB Cassandra API：managed Azure
Cassandra / ScyllaDB：自管、跨雲
選 Cosmos DB：Azure 生態、想把 operation 交給 managed service
選 Cassandra：跨雲、自管、極限 throughput tuning

vs Azure SQL Hyperscale：

Cosmos DB：NoSQL / document、global 分散
Azure SQL Hyperscale：傳統 SQL OLTP、storage / compute 分離、AWS Aurora 對應
選 Cosmos DB：document model、global 分散
選 Azure SQL：SQL workload、應用已用 SQL Server
對應 9.C32 Clearent Azure SQL Hyperscale — SQL 工作負載選 Hyperscale，document / NoSQL workload 才進 Cosmos DB

vs PostgreSQL（SQL baseline）：

PostgreSQL：SQL、強一致、single-primary、跨雲可用
Cosmos DB：NoSQL / multi-model、AP 系統、Azure-only、global 分散
選 PostgreSQL：SQL workload、跨雲、需要進階 SQL 特性
選 Cosmos DB：Azure 生態、document / KV / multi-model、需要 global distribution

vs Aurora（AWS managed SQL）：

Aurora：AWS、SQL（PostgreSQL / MySQL）、single-region scaling
Cosmos DB：Azure、NoSQL / multi-model、global write
兩者分別站在 cloud provider 與 data model 兩個維度；同需求下通常先看既有雲平台（AWS → Aurora、Azure → Cosmos / Azure SQL）

vs CockroachDB（cross-cloud distributed SQL）：

CockroachDB：跨雲、PostgreSQL wire、distributed SQL、強一致
Cosmos DB：Azure-only、multi-model、5 consistency levels、AP 系統
選 CockroachDB：要 SQL + 跨雲 + 強一致
選 Cosmos DB：要 NoSQL + Azure 生態 + 細粒度 consistency 選擇

容量規劃要點

1. RU/s 抽象化把 read / write / query 統一：

不像 DynamoDB 拆 RCU / WCU、Cosmos DB 用單一 RU
簡化容量規劃、但要算「不同操作各吃多少 RU」
1 RU = 1 KB strong read、寫 ~5 RU、複雜 query 數百 RU

2. partition key 設計跟 DynamoDB 一樣關鍵：

每個 logical partition 上限 10,000 RU/s
partition key 不均 → hot partition
對應 9.C11 Minecraft Earth — synthetic partition key 強制分散
詳見 Hot Partition 卡片

3. multi-region 配置：

開啟跨 region 後、容量在每個 region 都 mirror、成本乘以 region 數
對應 9.C24 Genesys — 跟 DynamoDB Global Tables 同類思維、各 region 獨立容量

4. Consistency level 影響成本：

Strong consistency：跨 region quorum、單個 read 約 2x RU
Session：cost 跟 eventual 接近、但提供同 session 一致
Eventual：最便宜

5. Autoscale provisioned throughput：

訂 max RU/s、實際用多少算多少（10% min）
適合：流量 unpredictable、想降低 on-demand 成本治理負擔

6. Serverless mode：

按 request 計費，適合稀疏與小流量 workload
適合：dev / test、小流量、稀疏 workload

Deep article（已完成）

本批 5 篇 deep article 已完成、覆蓋 Cosmos DB 從 consistency level 選擇到 multi-region write conflict 的核心 production 議題：

主題	文章	對應 production 議題
Session 預設、Bounded staleness、Strong 邊界跟跨 collection 分流策略	consistency-levels-engineering	Session 為何是 production 預設、per-request override、Strong + multi-region 互斥 cross-link
Synthetic / composite / hierarchical partition key + 不可逆性硬約束	partition-key-design	10000 RU/s 上限、不可改、跟 DynamoDB / MongoDB 可逆性對比
RU/s 思維、payload、index、provisioned vs autoscale vs serverless	ru-cost-model-sizing	ASOS Black Friday + Minecraft Earth 1M RU/s 壓測、autoscale reactive 限制
MongoDB API vs SQL API：三型遷移、dogfood、multi-model、跨雲 hedging	mongodb-api-vs-sql-api	Microsoft 365 dogfood 邊界、document model 遷移三型 SSoT
Multi-region active-active + LWW / custom merge / Strong 互斥	multi-region-write-conflict	Strong + multi-region 互斥的 AP 取捨 SSoT、廣告 SLA vs 實測可用性鏈路

第二批 deep article 把 Cosmos DB 從核心容量 / 一致性議題推進到 server-side 邏輯、CDC、不同產品釐清與 OLTP / OLAP federation：

主題	文章	對應 production 議題
Change Feed (CDC)：persistent change log、Azure Functions trigger	change-feed-cdc	latest-version vs all-versions-and-deletes、lease container、DynamoDB Streams 對照
Stored procedure / trigger（JavaScript）：partition-scoped 交易	stored-procedure-trigger	single-partition atomicity、bounded execution、多數邏輯應在 application 層
Cosmos DB for PostgreSQL（Citus-based 分散式 PG、不同產品）	cosmos-for-postgresql	定位釐清、distribution column、何時選它而非核心 Cosmos / single-node PG
Cosmos DB ↔ Azure Synapse Link：OLTP / OLAP federation	synapse-link-federation	analytical store、HTAP、RU 隔離、何時 federate 到專用 OLAP

Migration playbook：

主題	文章	對應遷移議題
從 MongoDB / Cassandra 遷入 Cosmos DB	migrate-from-mongodb-cassandra	protocol-compat API drop-in（Type B）vs native API paradigm shift（Type E）、相容性邊界、dual-write cutover

跨 vendor entry：先看 DB3 vendor selection（MongoDB / DynamoDB / Cosmos DB 三方選型 + workload shape 前置判讀），再進本 vendor 的 deep article。

後續擴充（仍待補）

Hierarchical partition key 與 partition split / merge 運維
Autoscale vs serverless 的成本切換決策樹
Hands-on lab 入口（對齊 PostgreSQL / MySQL / SQLite hands-on 形態）
Backup / PITR 與 continuous backup tier 選擇
Gremlin / Table API 的適用邊界與遷入

Anti-recommendation 與升級路由

Cosmos DB 的 multi-model 能把遷移阻力降到很低，也會讓 API compatibility、RU/s、partition key 與 consistency level 同時變成設計責任。這一段先說何時維持單一 API model，再說何時升級 multi-region write、Synapse Link、MongoDB Atlas、Spanner 或 Azure SQL。

機制 / 路線	維持簡單設計的條件	升級訊號	主要引用路徑
單一 API model	document / MongoDB / Cassandra / Table 語意清楚分工	多 API 共用同一資料語意、相容層行為差異開始影響 production	MongoDB vendor、Database
Session consistency	user session 內讀寫一致已滿足產品需求	金融 / 庫存 / 票務需要更強順序承諾	Consistency Level、Linearizability
Provisioned RU/s	流量可預測、partition key 均勻	Black Friday、遊戲上線、全球事件帶來突發尖峰	Hot Partition、Peak Forecast
Multi-region write	single-region write + global read 已足夠	regional write latency、region residency、active-active 是產品需求	RPO、RTO、Stale Read
MongoDB Atlas	Azure global distribution 是主訴求	跨雲、原生 MongoDB 行為、Atlas ecosystem 是主訴求	MongoDB vendor
Spanner / CockroachDB	session / eventual consistency 可接受	global SQL、strong transaction、cross-partition ACID 是核心需求	Spanner vendor、CockroachDB vendor
Azure SQL Hyperscale	document / NoSQL 是主要資料形狀	JOIN-heavy、transaction-heavy、SQL Server 生態是主需求	Aurora vendor

Cosmos DB 的簡單路徑是先固定 API model 與 consistency level。每個 API 的相容範圍、index 行為與 query cost 都不同；單純因為「同一服務支援多模型」而混用 API，後續 migration、debug 與容量估算會變複雜。

RU/s 的升級路徑要把 partition key 與 query shape 放在同一張圖。單純提高 RU/s 只能提高名義容量；logical partition 熱點、跨 partition query、index policy 與 payload size 仍會決定真實成本。

已知 limitation 與後續路由

Cosmos DB overview 目前完成 Azure global NoSQL 判斷。下一輪 deep article / playbook 應補 consistency level 選擇、RU/s cost model、partition key design、multi-region conflict、Change Feed、MongoDB API migration、Cassandra API migration 與 Synapse Link。

案例對照

案例	規模	教學重點
9.C11 Minecraft Earth	1M RU/s 測試、turnkey global distribution	AR 遊戲全球分散
9.C21 ASOS	1.67 億 req / 24h、48ms p99	全球零售 Black Friday
9.C30 Microsoft 365	planet-scale analytics	MongoDB → Cosmos DB API-compatible 遷移、Microsoft 自家 dogfood

Cosmos DB case 的讀法是分開看三種壓力：Minecraft Earth 提供 global partition 與 RU/s 訊號，ASOS 提供季節性零售尖峰訊號，Microsoft 365 提供 MongoDB API 相容遷移與 Azure dogfood 訊號。

反向 sibling 路由

Cosmos DB 的反向 sibling 路由用來把 Azure global NoSQL、DynamoDB 與 document migration 分開。若讀者從 DynamoDB 過來，先比較 RU/s、partition key、multi-region conflict 與 API model；若讀者從 MongoDB 過來，先把 API compatibility 當 migration hypothesis，再用 aggregation、index、change stream / Change Feed 行為驗證；若需求其實是 SQL strong consistency，轉到 Spanner vendor 或 CockroachDB vendor。

這條路由的判準是 API model 是否已固定。Cosmos DB 的 multi-model 是產品入口，不代表同一套資料可以在多個 API 之間自由切換；partition key、index policy、RU/s 與 consistency level 一旦進 production，就會成為 migration 與成本邊界。

常見陷阱

Strong consistency 用太多：多數互動式業務用 session consistency 就能滿足讀寫體驗
partition key 只用 user_id：某些業務 user 集中（VIP、bot）會 hot
忽略 Change Feed：寫入後通知、投影與同步流程適合先評估 Change Feed
MongoDB API behavior 假設：API compat 仍要驗證 aggregation pipeline / index 行為
忽略 multi-region 成本乘數：開 3 region active-active = 3 倍 RU 成本

下一步路由

完整 T1 對照：01-database vendors index
平行：DynamoDB vendor、Spanner vendor、MongoDB vendor
上游：1.10 KV / Document DB 容量規劃 / 1.11 全球分散式 OLTP
下游：1.12 大規模 DB 遷移實戰（MongoDB → Cosmos 範例）
跨模組：9.6 容量規劃模型、9.4 Saturation Discovery
Last reviewed：2026-05-22（API compatibility / consistency / RU model 屬時間敏感 claim）
官方：Azure Cosmos DB、Cosmos DB consistency levels