Detection on Tarragon

Splunk

Mon, 18 May 2026 00:00:00 +0000

Splunk 是 SIEM（Security Information and Event Management）的事實標準、大企業 / 金融 / 政府的 SOC 主流選擇。2024 年被 Cisco 收購、產品線維持獨立發展。它跟 Elastic Security / Datadog Security / Google Security Operations 的差異在 計費模型 + ecosystem maturity + detection content 深度、偵測能力本身相近 — Splunk 的 ingestion-based pricing 是業界最貴的 SIEM 計費模式、但 detection content 跟 SOC tooling ecosystem 也是最成熟的。

服務定位

Splunk 的核心定位是 任意 log source 的統一查詢平台、SIEM 是其上的 application layer（Splunk Enterprise Security app）。底層是 Splunk Enterprise（自管）或 Splunk Cloud Platform（SaaS）、頂層產品包含：Enterprise Security (ES) — premium SIEM app、含 correlation rule、Risk-Based Alerting、ITSI 整合；SOAR（前 Phantom）— security orchestration / automated response；UBA（User Behavior Analytics）— ML-based anomaly detection。

跟 Elastic Security 比、Splunk 走 deeper but more expensive — SPL 比 KQL / EQL 表達力更強、detection content（Splunk Security Content 公開 YAML rules）覆蓋廣、ES app 的 Risk-Based Alerting 是業界先驅；但 ingestion-based pricing 在 TB/day 級別會痛。跟 Datadog Security 比、Splunk 走 security-first、Datadog Cloud SIEM 是 observability platform 加上 security view；Datadog 適合 cloud-native + 中等規模、Splunk 適合 enterprise + 跨 on-prem。跟 Google Security Operations（前 Chronicle）比、Google Security Ops 走 fixed-price by data、massive scale、Splunk 是 per-GB 累進、超大規模反而 Google 划算。

關鍵張力：ingestion-based 計費 ↔ 偵測覆蓋率 是 Splunk 客戶最大的 trade-off。為了省錢選擇性 ingest log（只進 Windows Event Log 不進 Linux auth log、只進 prod 不進 dev）、結果 Storm-0558 / Uber MFA 那種跨來源 correlation 抓不到。要看清楚自己 容忍多少偵測盲點換多少預算。

本章目標

讀完本頁、讀者能判斷：

Splunk 在 SOC stack 中承擔哪一段（log aggregation / SIEM / SOAR / UBA）、哪些要外接（Vault 管 service token、IdP log 來源治理）
SPL / correlation rule / detection content 的 ownership 設計（誰寫、誰 review、誰調 false positive）
Ingestion pricing trap 的應對（log priority tiering、Cribl / Cribl Stream 做 pre-filter、Splunk SmartStore 把冷資料丟 S3）
何時用 Splunk、何時走 Elastic / Datadog / Google Security Ops 的取捨

最短判讀路徑

判斷 Splunk deployment 是否健康、最少看四件事：

誰能改 correlation rule：Splunk admin / ES admin / KV store admin 的人數、SPL search 跟 saved search 是否走版控（Git → git-fusion / Splunk Cloud Versioned Configs）、rule change 是否經 PR review
Ingestion 治理：哪些 source 進 Splunk（IdP audit log / cloud control plane log / endpoint log / network log / app log）、是否有 log priority tier（critical / standard / archive）、Cribl Stream 是否在前面做 pre-filter / routing
Detection content coverage：Splunk Security Content（公開 YAML rule library）有多少 enabled、是否跟 MITRE ATT&CK 對照、自家 custom rule 是否補 organization-specific anti-pattern
Alert quality / SOC handoff：alert volume per day、SOC analyst triage time、false positive rate、alert 是否進 SOAR playbook 自動處理低風險、跟 8 incident response 的 routing 是否定義

四件事任一缺失、就是 Detection Coverage and Signal Governance 邊界的待補項目。

日常操作與決策形狀

Ingestion architecture：log 進 Splunk 三種路徑 — Universal Forwarder / Heavy Forwarder（agent-based，自管 host）、HTTP Event Collector (HEC)（push log via HTTP endpoint、SaaS / serverless workload 預設）、Splunk Add-on for 各 cloud / SaaS（cloud-native log pull）。production 通常混用：endpoint 用 Universal Forwarder、cloud control plane 用 Add-on（AWS / GCP / Azure / Okta）、自家 app 用 HEC。在前面接 Cribl Stream 做 routing / filtering / sampling 是大型 deployment 的標準補位。

SPL（Search Processing Language）：類 Unix pipe 的 | 串接（index=ids sourcetype=auth | stats count by user | where count > 100）、表達力強但學習曲線陡。SPL 是 first-class concept、不只是查詢工具 — saved search 變 correlation rule、scheduled search 變 alert、accelerated search 變 data model 加速。SPL 寫得好不好直接決定 偵測規則品質 + 查詢成本。

Correlation rule / Notable Event：ES app 把 high-confidence finding 轉成 Notable Event、進 Incident Review queue。Correlation rule 的反例是 single-event alert（看到一個 SSH brute force attempt 就 alert、SOC analyst 一天看 10000 個沒意義）— production rule 應該是 time-bounded aggregation（過去 5min 內 100 個 brute force from same IP）+ cross-source correlation（brute force IP 同時出現在 cloud control plane access）。

Detection content lifecycle：Splunk Security Content 是 Splunk 維護的 OSS detection rule library、YAML format、跟 MITRE ATT&CK 對應。組織通常 先 import 全部 baseline、再選擇性 disable noisy 規則 + 新增 organization-specific 規則。Rule change 走 PR review、staging tenant 跑 24-48hr 觀察 false positive curve 才 promote 到 production。對應 Detection Engineering Lifecycle 的章節原則。

Risk-Based Alerting (RBA)：ES app 7.0+ 引入、給每個 user / asset 累積 risk score（取代逐 finding alert）、累積到 threshold 才 alert。處理 alert fatigue 的工程化做法：5 個 low-confidence signal 加總超過 threshold 比單一 high-confidence alert 更接近真實 attack pattern。對應 Alert Fatigue and Signal Quality。

SOAR integration：Splunk SOAR（前 Phantom）接 alert + playbook 自動執行 — 例如 leaked credential 自動 rotate（拉 Vault API）、suspect IP 自動加 firewall block（拉 Cloudflare WAF custom rule）、suspect user 自動 force MFA re-enroll（拉 Okta API）。playbook 進版控、定期 dry-run、不能黑箱 production fire-and-forget。

Ingestion pricing 治理：Splunk 按 ingestion volume（GB/day）計費、TB-scale deployment 年費千萬美元級別。實務治理：tier 1 log（IdP / cloud control plane / payment processor / DB audit）進 Splunk hot index、tier 2 log（app log / web access log）按 sampling / filtering 進 Splunk、tier 3 log（debug / verbose）走 SmartStore 到 S3 / GCS 冷儲存、或繞過 Splunk 直接打到 Elastic / data lake。Cribl Stream 在 forwarder 前 pre-filter 是業界標準作法、可省 30-50% ingestion cost。

SmartStore 跟冷熱分離：SmartStore 把 indexer 的 warm + cold bucket 放到 S3 / Azure Blob / GCS、indexer 只保留 hot data + cache。意義是 retention 從幾個月延長到幾年但 cost 不線性漲。production deployment 幾乎都該開、不開等於每年砸錢買 EBS。

核心取捨表

取捨維度	Splunk	Elastic Security	Datadog Security	Google Security Operations
計費模型	Ingestion-based（GB/day、累進）	Resource-based（node / cluster size）	Per-host + per-event（events/month）	Fixed price by data tier（PB-scale 划算）
學習曲線	陡 — SPL 表達力強但 idiom 多	中 — KQL / EQL 較直觀	緩 — 沿用 Datadog observability 語法	中 — YARA-L 是新語法但結構清楚
部署模型	Self-hosted (Splunk Enterprise) / SaaS (Cloud)	Self-hosted / Elastic Cloud / Serverless	SaaS only	SaaS only（Google Cloud）
Detection content	Splunk Security Content（最豐富、社群活躍）	Elastic Prebuilt rules + Sigma 支援	Datadog Security Rules（中等）	Google YARA-L 內建 + Google threat intel
SOAR / Response	Splunk SOAR（前 Phantom、業界先驅）	內建 Cases + Endpoint response（Elastic Defend）	Workflow Automation（基本）	SOAR 內建（前 Siemplify）
跨來源 correlation	強 — data model + SPL 支撐	強 — EQL sequence + Lucene	中 — log + metrics + trace 同 plane	強 — UDM normalization + cross-tenant
Multi-cloud	強 — Add-on 覆蓋三大雲	強 — Beats / Agent 跨雲	強 — Datadog Agent 跨雲	GCP-first、跨雲靠 Forwarder
適合場景	Enterprise + 跨 on-prem / 多雲、預算允許	OSS-friendly、中大型、Elastic stack 已用	Cloud-native、observability 已用 Datadog	超大規模 ingestion、Google 雲 + 多雲 SOC
退場成本	高 — SPL / detection content / dashboard 量多	中 — Sigma / Lucene 較可移植	中	中

選 Splunk 的核心訴求：Enterprise scale + 跨 on-prem + detection content 跟 SOC tooling ecosystem 成熟、且能投入預算（千萬美元級別 license + Cribl pre-filter + SmartStore 冷儲存治理）+ 有 SOC team 維護 correlation rule 跟 SOAR playbook。中等規模 cloud-native 直接走 Datadog / Google Security Ops 更划算。

進階主題

Enterprise Security app 的 Risk-Based Alerting：RBA 把「事件 → alert」改成「事件 → risk score → 累積 → alert」、是 alert fatigue 的工程化解法。實作要決定 risk decay window（多久後 risk score 衰減）、risk attribution（同一台 EC2 上多 user 的 risk 怎麼分）、per-asset vs per-user threshold。配對 Uber 2022 MFA Fatigue 的 lesson：單一 MFA fail 不該 alert、5min 內 50 個 fail + 新裝置 + 異常地理就是 high risk。

Common Information Model (CIM) + Data Model：Splunk CIM 把不同 source 的欄位 normalize 到統一 schema（authentication / network_traffic / web 等 data model）。意義是 SPL 跨 source 寫一次、不用為 Okta log / Azure AD log / CrowdStrike log 各寫一份。CIM 配合 Add-on 自動 mapping、organization 寫 custom source 需要自己定 CIM mapping。

Multi-tenant deployment：MSSP / 大型集團多 BU 共用一個 Splunk 部署、用 index（隔離 data）+ role / capability（隔離 access）+ App（隔離 dashboard / search）三層。注意 Splunk admin 在跨 tenant 場景是高權限角色、應該走 break-glass 流程 + audit。

Cisco 整合（2024+）：Cisco 收購後 Splunk 跟 Cisco XDR / Talos threat intel / Cisco Secure Endpoint 整合加速。對 Cisco-heavy 環境是 ecosystem 一致性增加；對非 Cisco 環境暫時影響有限、但長期 roadmap 會有 Cisco-specific 加值。

排錯與失敗快速判讀

Alert volume 爆炸 / SOC 看不完：correlation rule 寫成 single-event alert、或 false positive baseline 沒調 — 用 RBA 改 risk-based、staging tenant 跑 48hr 觀察再 promote
Detection coverage 出事故時才發現缺：critical log source 沒進 Splunk（為了省錢）— 補回 tier 1 log priority、用 Cribl Stream 對 tier 2 / 3 做 sampling 而非整批不 ingest
Ingestion cost 暴衝：新 source 加入沒 review、debug log 直接打進 Splunk — Cribl Stream 前置 + license usage dashboard alert + indexer ingestion quota
SPL search 慢 / 卡 search head：full-fidelity search on 1TB raw event、沒用 data model acceleration — 改用 accelerated data model、限定 time range、用 tstats 而非 stats
Correlation rule false positive 多：rule 寫得太寬、env-specific noise 沒 tune — staging tenant 跑 1 週統計 FP、tune threshold、加 lookup table 排除已知合法 source
SOAR playbook 黑箱 fire-and-forget：自動 disable account 結果誤殺 CEO — playbook 走 approval gate for high-impact action、defaults to containment not deletion
Splunk admin 太多 / 沒 break-glass：日常運維用 admin token、admin compromise blast radius 太大 — 收 admin 角色、改 power user + 特定 capability、break-glass 走 Vault

何時改走其他服務

需求形狀	改走
OSS-friendly / 預算敏感	Elastic Security
Cloud-native + observability 已用	Datadog Security
超大規模 ingestion + Google 雲	Google Security Operations
DLP / sensitive data discovery	Google DLP / Microsoft Purview
Endpoint detection 為主	CrowdStrike Falcon / Microsoft Defender for Endpoint
Pre-filter / log routing	Cribl Stream（前置 forwarder、不是替代 SIEM）
Incident routing	8 事故處理 vendor 清單

不在本頁內的主題

SPL 完整語法 reference、saved search 跟 macro 進階用法
Splunk Cloud Platform vs Splunk Enterprise 的功能對照細節
Splunk Observability Cloud（前 SignalFx 收購、跟 Datadog 直接競爭、屬 observability 不屬 security）
ITSI（IT Service Intelligence）— 屬 ITSM / observability、不在資安範圍
SOAR playbook 的具體實作（Phantom Python SDK）

案例回寫

Splunk 在 07 案例庫沒有直接 vendor-level 事件、但所有 detection-related case 都是 SIEM 偵測覆蓋率的對照：

案例	跟 Splunk 的關係（對照啟示）
Uber 2022 MFA Fatigue	MFA 請求密度應是 Splunk correlation rule first-class signal、5min window count > N 直接 alert + RBA 升級高風險 user score
Microsoft Storm-0558 Signing Key Chain	跨租戶 token 異常驗證需 Splunk Add-on for Azure AD + cloud control plane log 同時 ingest、跨來源 correlation 才能秒級偵測
Snowflake 2024 Credential Abuse	資料平台 query volume + 跨 schema scan + 來源 IP 異常的複合 correlation rule、不只看 audit log 也要 query metrics correlation
SolarWinds 2020 Sunburst	簽章驗證通過但 runtime 行為異常需 endpoint log + network log correlation、不靠 IoC-only 規則
Detection Engineering Lifecycle (section)	Splunk Security Content + 自家 custom rule 走 propose → staging tune → promote → review 的工程 lifecycle、不是 console 直改
Alert Fatigue and Signal Quality (section)	RBA 是工程化解 alert fatigue、不是「忽略低風險」、要設 risk decay + threshold tuning lifecycle

下一步路由

上游：7.13 偵測覆蓋率與訊號治理、Detection Engineering Lifecycle
平行：Elastic Security、Datadog Security、Google Security Operations
下游：Google DLP / Microsoft Purview（DLP signal 進 Splunk）
跨類：Okta（IdP log source）、HashiCorp Vault（SOAR playbook 拉 API）、Cloudflare WAF（WAF log + auto-block）
跨模組：8 事故處理 vendor 清單（Notable Event → IR routing）、4 observability（log pipeline 共用）
官方：Splunk Documentation

Elastic Security

Mon, 18 May 2026 00:00:00 +0000

Elastic Security 是 Elastic Stack（Elasticsearch + Kibana + Beats / Agent）上的 SIEM + EDR + Cloud Security 套件、OSS 起源、現屬 Elastic 商業版的 Solution。它跟 Splunk / Datadog Security / Google Security Operations 的差異在 計費模型 + 查詢語言模型 + ecosystem 開放度、偵測能力本身相近 — Elastic 走 resource-based pricing（按 cluster size 而非 ingestion volume）、且提供 KQL / EQL / Lucene / ES|QL 四種互補的查詢語言。

服務定位

Elastic Security 的核心定位是 Elastic Stack 上的 security solution、底層是 Elasticsearch（資料層）+ Kibana（查詢與 UI 層）+ Fleet / Elastic Agent（採集層）、頂層產品分三條：Elastic SIEM（log aggregation + detection rule + Case + Timeline）、Elastic Defend（前 Endgame 收購而來、EDR + endpoint protection、跟 CrowdStrike / SentinelOne 同層）、Elastic Cloud Security（CSPM + CWP、雲端資源 misconfig 與 workload 防護）。

跟 Splunk 比、Elastic 走 OSS-friendly + resource-based pricing — TB-scale ingestion 不直接漲費用（要 scale node 但邊際成本遠低於 Splunk per-GB 累進）、Sigma rule 社群可直接 import 5000+ 規則；但 Splunk Security Content 跟 SOAR / RBA 等 detection content + SOC tooling 成熟度仍高一個量級。跟 Datadog Security 比、Elastic 跨 on-prem + 多雲、可自管也可 Elastic Cloud SaaS；Datadog 是 SaaS-only、適合純 cloud-native。跟 Google Security Operations 比、Elastic 多查詢語言（KQL / EQL / Lucene / ES|QL）、Google 走 YARA-L 單一統一語言、超大規模 ingestion Google 反而划算。

關鍵張力：多查詢語言模型 同時是 Elastic 的優勢跟負擔。EQL 寫 attack chain sequence 比 SPL correlation 更直接、KQL 過濾快、ES|QL 寫 aggregation 像 SQL 直覺、Lucene 處理 full-text；但 SOC team 要決定哪個 rule 用哪個語言、不能讓每個 analyst 各寫各的。

本章目標

讀完本頁、讀者能判斷：

Elastic Security 在 SOC stack 中承擔哪一段（log aggregation / SIEM / EDR / CSPM）、哪些要外接（Okta IdP log、Vault secret rotation）
KQL / EQL / Lucene / ES|QL 四種查詢語言的職責分工（誰用在哪種 rule、誰負責教育 SOC）
Resource-based pricing 的治理（cluster sizing、hot-warm-cold tier、Searchable Snapshots、Elastic Cloud Serverless）
何時用 Elastic、何時走 Splunk / Datadog / Google Security Ops 的取捨

最短判讀路徑

判斷 Elastic Security deployment 是否健康、最少看四件事：

誰能改 detection rule：Elastic Security app 的 rule editor 權限、detection-rules repo（Elastic 官方 OSS rule 庫）有沒有 fork 進組織版控、rule change 是否走 PR review + staging space 驗證
採集治理：Fleet 統一管 Elastic Agent policy / 還是散落 Beats（filebeat / metricbeat / auditbeat / winlogbeat）各自設定、log source 是否分 hot / warm / cold tier、Searchable Snapshots 是否開
Detection content coverage：Elastic Prebuilt rules + Sigma 社群規則 import 多少 enabled、是否跟 MITRE ATT&CK 對照、EQL sequence 規則覆蓋多少 attack chain pattern
Alert quality / SOC handoff：alert volume per day、Case 跟 Timeline 是否進入日常 SOC workflow、ML anomaly job 是否在線 + threshold 是否 tuned、跟 8 incident response 的 routing 是否定義

四件事任一缺失、就是 Detection Coverage and Signal Governance 邊界的待補項目。

日常操作與決策形狀

Ingestion architecture：log 進 Elastic 三種主路徑 — Elastic Agent + Fleet（現代部署的預設、單一 agent 收 system / endpoint / cloud / app log、中央 Fleet server 統一管 policy）、Beats（filebeat / metricbeat / auditbeat / winlogbeat 等專用 agent、Fleet 推出前的傳統做法、現在持續支援但建議遷移到 Elastic Agent）、Logstash（pipeline-style ETL、用在 enrich / filter / route 複雜場景）。production 通常 Elastic Agent + Fleet 為主、Logstash 補 ETL 缺口。

KQL / EQL / Lucene / ES|QL 的職責分工：四種查詢語言各有 first-class 場景。KQL（Kibana Query Language）是 Kibana 預設過濾語法、user.name : "alice" and event.action : "logon-failed"、簡單直觀、適合 dashboard / Discover 過濾。EQL（Event Query Language）做 sequence pattern matching、sequence by user.name [authentication where event.outcome=="failure"] [authentication where event.outcome=="success" and source.geo.country != "TW"]、表達 attack chain 比 SPL correlation 更直接。Lucene 是底層 full-text query、特殊需要時直接寫。ES|QL（Elasticsearch Query Language、2024+）是新版 SQL-like、FROM logs-* | WHERE event.category == "authentication" | STATS count = COUNT(*) BY user.name、寫 aggregation 直覺；屬新語言、production 採用 cadence 還在跟進中。

Detection rule 種類：Elastic Security 的 rule type 是六種 first-class 概念、不是只有「query rule」一種 — Query rule（KQL / Lucene 觸發）、EQL rule（sequence pattern）、Threshold rule（聚合超過閾值、例如同一 IP 5min 內 login fail > 100）、ML rule（綁 Elastic ML anomaly job、anomaly score 超過閾值觸發）、New term rule（首次出現的 entity、例如某 user 第一次從某國登入）、Indicator match rule（事件 enrich 比對 threat intel feed、IoC hit 觸發）。production rule 經常組合多種 — query rule 做粗篩、EQL rule 抓 sequence、threshold + ML 補 baseline anomaly。

Sigma rule import：Sigma 是 OSS 通用 detection rule 格式（YAML、跨 SIEM 可移植）、社群維護 5000+ 規則。Elastic 支援直接 import Sigma rule 轉成 Elastic detection rule、是 Elastic 拉開跟商業 SIEM 距離的 OSS 槓桿。實務做法：先 import Sigma baseline + 全部走 staging space 跑 false positive 觀察、再 enable 到 production；不要直接全 enable、Sigma rule 跨 SIEM 通用所以 environment-specific tuning 必須自己做。

Case + Timeline：Case 是 incident 容器、聚合 alert + comment + assignment + status；Timeline 是 SOC analyst 的 investigation workspace、可以 pin event / annotate / link related alert、產出 investigation narrative。兩者組合是 Elastic 的 SOC workflow first-class、不是外掛 — 對應 Splunk ES 的 Notable Event + Incident Review、但 Elastic 走 OSS 化、Case 可 export markdown 進 ticketing。

Elastic Defend（EDR）：前 Endgame 收購整合、提供 endpoint detection + prevention（malware block / ransomware protection / behavior detection）、跟 CrowdStrike Falcon / SentinelOne 同層。Elastic Defend 跑在 Elastic Agent 內、policy 從 Fleet 推。實務上多數 SIEM 客戶不會用內建 EDR、而是外接專業 EDR feed 進 Elastic SIEM；但 OSS-friendly + 預算敏感的中型客戶可以直接整合到一個 stack。

Cross-cluster search：跨多個 Elastic cluster 統一查詢（remote_cluster:index-name）、適合 multi-region / multi-tenant SOC、不需要把所有 log 搬到單一 cluster。對應 Splunk Cloud federated search。實務場景：歐洲 GDPR 資料留在 EU cluster、美國 cluster query 過去做 incident investigation 而不複製資料。

ML jobs（anomaly detection）：Elastic ML 內建 unsupervised anomaly detection、pre-built ML job library 覆蓋 SOC 常見場景（user behavior baseline、host login pattern、port scan detection、rare process）。ML rule 綁 ML job、anomaly score 超過閾值觸發 detection rule。對應 Splunk UBA、但 Elastic ML 是 stack 內建、不是 add-on app。

Resource-based pricing 治理：Elastic Cloud 按 cluster size（node count × node size）計費、不按 ingestion volume — 意義是 ingest 多 log 不直接漲費用、但要 scale node 維持查詢效能。實務治理：hot tier（最近 7-30 天、SSD 高效能 node）、warm tier（30-90 天、低 IO node）、cold tier / frozen tier（90 天以上、Searchable Snapshots on S3 / GCS、查詢慢但成本極低）。對應 Splunk SmartStore、但 Elastic frozen tier 把 retention 從幾個月延長到幾年、cost 不線性漲。

核心取捨表

取捨維度	Elastic Security	Splunk	Datadog Security	Google Security Operations
計費模型	Resource-based（node / cluster size）	Ingestion-based（GB/day、累進）	Per-host + per-event（events/month）	Fixed price by data tier（PB-scale 划算）
查詢語言	KQL / EQL / Lucene / ES\|QL 四種互補	SPL（單一強表達力）	Datadog Query（沿用 observability 語法）	YARA-L（統一、結構清楚）
Sequence 表達	EQL `sequence by` 直接表達 attack chain	SPL transaction / streamstats	log + metrics + trace 同 plane	UDM + YARA-L 多事件 rule
部署模型	Self-hosted / Elastic Cloud / Serverless	Self-hosted (Enterprise) / SaaS (Cloud)	SaaS only	SaaS only（Google Cloud）
Detection content	Elastic Prebuilt rules + Sigma 社群 5000+	Splunk Security Content（最豐富、社群活躍）	Datadog Security Rules（中等）	Google YARA-L + Google threat intel
EDR 整合	Elastic Defend 內建（前 Endgame）	外接 CrowdStrike / Defender	Workload Security（容器 focus）	外接（透過 forwarder）
SOAR / Response	Cases + Endpoint response（Elastic Defend）	Splunk SOAR（前 Phantom、業界先驅）	Workflow Automation（基本）	SOAR 內建（前 Siemplify）
適合場景	OSS-friendly、中大型、Elastic stack 已用	Enterprise + 跨 on-prem、預算允許	Cloud-native + observability 已用 Datadog	超大規模 ingestion、Google 雲 + 多雲 SOC
退場成本	中 — Sigma / Lucene / EQL 部分可移植	高 — SPL / detection content / dashboard 量多	中	中

選 Elastic 的核心訴求：OSS-friendly 文化 + resource-based pricing 友善 + Elastic Stack 已作為 observability 在用、團隊有能力跨四種查詢語言（或至少把 EQL 跟 KQL 雙語分工清楚）、能接受 detection content 跟 SOAR 成熟度 trade-off。TB-scale ingestion 時 Elastic 比 Splunk 省 60-80% license cost 是最大誘因、但要算進 cluster sizing 跟 SRE 維運的隱形成本。

進階主題

EQL sequence pattern（時序攻擊鏈）：EQL 的 sequence by 是 Elastic 表達 attack chain 的 first-class 武器、比 SPL correlation 直接。例如 MFA fatigue 寫成 sequence by user.name with maxspan=5m [authentication where event.outcome=="failure"] [authentication where event.outcome=="failure"] [authentication where event.outcome=="success" and source.ip != known_ip]、序列邏輯直接表達。配對 Uber 2022 MFA Fatigue lesson：MFA fail 序列 + 新裝置 success 直接觸發。

Elastic Defend endpoint response：除偵測外、Defend 支援 host isolation（隔離受感染 endpoint 但保留 SOC 連線）、process kill、file quarantine 等 response action、直接從 Kibana Security app 觸發。對應 CrowdStrike Real Time Response。production 採用前要設 approval gate、避免 SOC analyst 誤觸動 production server。

CSPM / CWP（Elastic Cloud Security）：CSPM（Cloud Security Posture Management）對 AWS / GCP / Azure 帳號做 misconfig 掃描（S3 bucket public、IAM over-permission、security group 0.0.0.0/0）、對照 CIS Benchmark；CWP（Cloud Workload Protection）對 Kubernetes workload 跑 runtime detection。屬較新的功能、跟 Wiz / Lacework 等專業 CNAPP 比覆蓋還在追趕。

Cross-cluster search 跨環境 federated query：multi-region SOC 的 first-class 工具 — query 寫 FROM logs-auth-*, eu-cluster:logs-auth-*、Elastic 自動路由跨 cluster。實務注意：跨 cluster query 延遲較高、要設 timeout；資料合規（GDPR）必須留意 query 結果是否包含跨境資料、不是搬資料但 query 結果回傳算不算傳輸要法務確認。

Sigma 規則社群：Sigma 是 OSS detection rule 通用格式、Elastic 是 Sigma 主力使用者（內建 importer + Elastic 工程師參與 Sigma upstream）。實務做法：fork SigmaHQ repo 進組織版控、CI pipeline 自動轉 Sigma → Elastic detection rule、staging space 跑 false positive curve、promote 到 production；不要每次 manually import。

Elastic Cloud Serverless（2024+）：新模型、按 workload type（search / observability / security）計費、不再按 cluster size — 減少 sizing 決策、autoscaling 由 Elastic 託管。屬新模型、production 採用 cadence 還在跟進中、適合 greenfield 部署或 PoC、existing cluster 遷移 roadmap 還在演進。

排錯與失敗快速判讀

Alert volume 爆炸 / SOC 看不完：Sigma rule 全 enable 沒 tune、或 threshold rule 閾值太低 — staging space 跑 1 週統計 FP、tune threshold、加 exception list 排除已知合法 source、ML rule 補 user-specific baseline
EQL sequence rule 跑不動 / timeout：sequence span 太長（24h）或 by field cardinality 太高、查詢成本爆炸 — 縮短 maxspan、限定 index pattern、加 pre-filter 條件
Cluster 查詢慢 / Kibana 卡：hot tier 塞太多舊資料、沒做 hot-warm-cold tier 分層 — 開 ILM（Index Lifecycle Management）policy 自動 rollover、warm tier 用便宜 node、cold / frozen 走 Searchable Snapshots
Fleet agent enrollment 失敗：Fleet server 跟 Elasticsearch 之間網路 / 憑證 / token 問題 — 檢查 Fleet server health、確認 enrollment token 未過期、agent log 看 specific 錯誤
Sigma rule import 後大量 FP：Sigma rule 是 cross-SIEM 通用、沒有 environment-specific exclusion — 不要全 enable、staging tune 後再 promote、加 exception list（known scanner IP / 內部測試帳號）
Resource-based pricing 超預算：node 過度 scale 或 hot tier 留太多 — 開 hot-warm-cold ILM、把 retention 超過 30 天的 index 推到 frozen tier on S3、Searchable Snapshots 是預設應該開
ML job anomaly score 不準：training data 包含已 compromise 期間、baseline 被汙染 — 確認 training window 在乾淨期、定期重訓、配 detection rule 用 anomaly_score > 75 而非 > 50

何時改走其他服務

需求形狀	改走
Enterprise + detection content 最豐富	Splunk
Cloud-native + observability 已用 Datadog	Datadog Security
超大規模 ingestion + Google 雲	Google Security Operations
DLP / sensitive data discovery	Google DLP / Microsoft Purview
Endpoint detection 為主、不要全 stack	CrowdStrike Falcon / Microsoft Defender for Endpoint / SentinelOne
CNAPP 為主（雲端 posture + workload）	Wiz / Lacework / Prisma Cloud（Elastic Cloud Security 較新）
Incident routing	8 事故處理 vendor 清單

不在本頁內的主題

KQL / EQL / ES|QL 完整語法 reference、Lucene query DSL 進階用法
Elasticsearch index sharding / replica / ILM tuning 細節（屬 observability / 資料工程範圍）
Elastic Observability（APM / logs / metrics）— 屬 observability 不屬 security
Elastic Cloud Serverless 詳細 sizing 與 pricing 模型（2024+ 新模型、變動中）
Elastic Stack 自管的維運（cluster upgrade、Kibana plugin 開發）

案例回寫

Elastic Security 在 07 案例庫沒有直接 vendor-level 事件、但所有 detection-related case 都是 SIEM 偵測覆蓋率的對照：

案例	跟 Elastic Security 的關係（對照啟示）
Uber 2022 MFA Fatigue	Elastic EQL `sequence by user.name [auth fail count > 50 in 5min] [auth success from new device]` 直接表達 MFA fatigue pattern、Sigma 社群有現成規則可 import 起步
Microsoft Storm-0558 Signing Key Chain	跨租戶 token 異常驗證需 Elastic Cross-cluster search 跨 Azure AD log + GCP audit log + 自家 app log 同時 query、不需先搬資料
3CX 2023 Desktop App Supply Chain	Elastic Defend 直接看到 desktop app process spawn + 異常網路 callback、不需外接 EDR feed；EQL `sequence` 抓 process → DNS → C2 行為鏈
Detection Engineering Lifecycle (section)	Elastic rule 走 `detection-rules` repo（OSS、Elastic 官方維護）+ Sigma fork + staging space + promote 工程化 lifecycle、不是 Kibana UI 直改
Alert Fatigue and Signal Quality (section)	Elastic 沒有 Splunk RBA 對應、用 ML anomaly rule + threshold rule severity + Case grouping 三層降噪、要設 ML job 重訓 lifecycle

下一步路由

上游：7.13 偵測覆蓋率與訊號治理、Detection Engineering Lifecycle
平行：Splunk、Datadog Security、Google Security Operations
下游：Google DLP / Microsoft Purview（DLP signal 進 Elastic SIEM）
跨類：Okta（IdP log source）、HashiCorp Vault（secret rotation API）、Cloudflare WAF（WAF log + Sigma rule 對接）
跨模組：8 事故處理 vendor 清單（Case → IR routing）、4 observability（Elastic Stack 共用 log pipeline）
官方：Elastic Security Documentation、detection-rules repo

Datadog Security

Mon, 18 May 2026 00:00:00 +0000

Datadog Security 是 Datadog observability platform 上的 security 套件、跟 Datadog logs / metrics / APM / infrastructure 共用同一個 control plane 與 data plane。它的設計起點不是 SIEM、是 把資安訊號當成 observability 的一個維度：alert 不只看 log、可以同時 pivot 到 APM trace、infra metrics 與 host context。這個定位決定了它的優勢（cloud-native + 混合 incident 偵測）與限制（SaaS-only + 計費隨 host 量線性漲、不適合 on-prem-heavy 或預算敏感場景）。

服務定位

Datadog Security 由四個 product 構成、共用 Datadog Agent 與 backend：Cloud SIEM（log-based detection、跟 Splunk Enterprise Security 同類）、Cloud Security Management (CSM) — 涵蓋 CSPM（cloud config posture）與 Cloud Workload Security (CWS)（container / Linux runtime via eBPF）、App and API Protection (AAP、前 ASM) — RASP-style 在 app runtime 收 attack signal、Sensitive Data Scanner — scan log 中的 PII / credential 並 redact。

跟 Splunk 比、Datadog 走 observability-first + security 是 view、Splunk 是 security-first。Splunk 在 enterprise SOC tooling 深度（SOAR playbook、RBA、CIM data model）與跨 on-prem 部署上更成熟、Datadog SaaS-only 但跟 APM / Infra 同 plane、混合 incident（latency 異常是攻擊還是容量？）的判讀路徑更短。跟 Elastic Security 比、Elastic 可跨 on-prem + OSS、Datadog 只給 SaaS；Elastic 要自己整合 observability 訊號、Datadog 出廠就有。跟 Google Security Operations 比、Google 走 fixed-price by data、PB-scale 划算、Datadog 隨 host 線性漲、中等規模友善但破千 host 後 cost 曲線變陡。

關鍵張力：observability 與 security 同 plane 是 Datadog 最大賣點、也是 cost 風險來源。host count 跟 events/month 同時是 observability 跟 security 的計費基準、security 加上去後 bill 不會獨立 — 預算要從 整個 Datadog 帳單 看、不是 security 單列。

本章目標

讀完本頁、讀者能判斷：

Datadog Security 在 SOC stack 中承擔哪一段（log SIEM / CSPM / 容器 runtime / WAF-runtime / log DLP）、哪些要外接（Vault、Okta IdP log、edge WAF）
observability + security 同 plane 的優勢何時成立、何時是 vendor lock-in 風險
Cloud SIEM 計費（events/month + indexed）跟 Standard / Flex Logs retention tier 的成本治理
何時用 Datadog、何時走 Splunk / Elastic / Google Security Ops 的取捨

最短判讀路徑

判斷 Datadog Security 部署是否健康、最少看四件事：

Datadog Agent coverage：agent 是否裝在所有 host / container / serverless wrapper、log forwarder 是否覆蓋 cloud control plane（AWS CloudTrail / GCP Audit Log / Azure Activity Log）、IdP（Okta）audit log 是否進來 — 缺一個就是 detection 盲點
Detection rule ownership：Cloud SIEM rule 是用內建還是 custom、custom rule 是否走 Git 版控（Terraform datadog_security_monitoring_rule）、staging 環境是否 dry-run 24-48hr 才 promote production
CSPM compliance check 治理：CIS / NIST / PCI baseline 開哪些、findings 是否進 ticket workflow、misconfig 修復 SLA 有沒有定義（critical 24hr、high 7d、medium 30d）
Events/month + Indexed Log 預算：Cloud SIEM 按 events/month + indexed event 計費、新加 source 前是否估算 ingestion impact、Standard / Flex Logs retention tier 是否依 log priority 分流

四件事任一缺失、就是 Detection Coverage and Signal Governance 邊界的待補項目。

日常操作與決策形狀

Datadog Agent 採集：log / metrics / trace / security event 走同一個 Agent、用 integration（150+）抓 cloud / SaaS / database / queue。security event 跟 observability event 在後端用 attribute tag（env、service、host、trace_id）關聯、查 incident 時可以從 log alert pivot 到同 trace_id 的 APM trace 看 attack 發生的 application context。

Cloud SIEM detection rule：rule 形式類似 SPL 的 query — source:okta @evt.name:user.authentication.auth_via_mfa @outcome:failure 加 signal aggregation（rolling window count、new value、anomaly detection、impossible travel）。內建 rule 跟 MITRE ATT&CK 對應、跟 Splunk Security Content 同類但 rule 數量較少；custom rule 走 Terraform provider 進版控、不在 UI 直改 production。

CSPM compliance check：scan AWS / GCP / Azure 配置 vs CIS / NIST 800-53 / PCI / SOC 2 baseline、發現 misconfig（public S3 bucket、overly permissive IAM、不安全 SG rule）。跟 Wiz / Prisma Cloud 同類但跟 Datadog Infra 同 dashboard、findings 可以直接看到 affected resource 的 metrics / log。優勢是 資安發現可以直接看業務影響、限制是 graph-based attack path（Wiz 強項）不及專業 CNAPP。

Cloud Workload Security（CWS）：用 Linux eBPF probe 在 kernel 層觀察 container / process behavior、偵測 cryptominer / privilege escalation / 異常 syscall / file integrity 變動。跟 Falco 同類但跟 Datadog Infra 同 plane、CWS alert 可以直接 pivot 到該 container 的 CPU / memory / trace。Linux eBPF 對 kernel 版本敏感、舊 kernel 部份功能不可用、production 前要確認 fleet kernel matrix。

App and API Protection（AAP）：RASP-style protection、Datadog APM library 在 application runtime 收 attack signal（SQLi / XSS / SSRF / 異常 traffic pattern）。跟 Cloudflare WAF / AWS WAF 不同層 — WAF 在 edge / CDN、AAP 在 app runtime 看到的是真實 request handler / DB query。兩者互補不互斥：edge WAF 擋 volumetric attack 跟已知 pattern、AAP 補 app-specific business logic abuse。

Sensitive Data Scanner：scan ingest 進來的 log、用內建或 custom pattern 偵測 PII / credential / payment card / API key、發現後可以 redact、quarantine 或 alert。是 DLP-lite — 比不上 Google DLP / Microsoft Purview 的 sensitive data discovery / classification / lineage 全套、但對 log 中誤洩 secret 的場景夠用、是 detection signal source 也是 DLP 補位。

Notebooks + Workflow Automation：Notebooks 是 incident investigation 用的 query workbook、混 log query + metric chart + APM trace + 註記、跟 Splunk Search 比較像 Jupyter notebook 的 SOC 版。Workflow Automation 是輕量 SOAR、接 PagerDuty / Slack / Jira / Webhook / Vault API、playbook 走 visual builder + Python。SOAR 深度不到 Splunk SOAR、但對中等規模 SOC（10-50 人）的常見 response 動作（rotate credential / block IP / open ticket）夠用。

Standard Logs / Flex Logs + retention tier：log 進 Datadog 後分 Indexed（hot、可全文搜尋、貴）、Flex Logs（warm、retention 長、查詢延遲較高、cost 1/3-1/5）、Archive（cold、丟 S3 / GCS、純儲存）三層。Cloud SIEM detection 跑在 indexed log 上、所以 哪些 log 走 indexed 直接決定 detection coverage 跟 bill。tier 1 source（IdP / cloud control plane / payment）必 indexed、tier 2 source（app log）按 sampling、tier 3（debug）走 Flex 或 Archive。

核心取捨表

取捨維度	Datadog Security	Splunk	Elastic Security	Google Security Operations
設計起點	Observability + security 同 plane	Security-first、log 統一查詢平台	Search-first、ELK stack 延伸	Massive scale ingestion、Google threat intel
計費模型	Per-host + per-event（events/month）	Ingestion-based（GB/day、累進）	Resource-based（node / cluster）	Fixed price by data tier（PB-scale 划算）
部署模型	SaaS only	Self-hosted / SaaS	Self-hosted / Cloud / Serverless	SaaS only（Google Cloud）
觀測整合	Native — log + APM + metrics + infra 同 query	需自接（Splunk Observability 另收）	需自接（Elastic Observability 另開）	弱 — 跨產品 federation
雲端 posture (CSPM)	內建（CSM）	第三方 add-on / Cisco 整合	第三方 / Wazuh	第三方 / Mandiant 整合
容器 runtime	內建 CWS（eBPF）	需 Falco / 第三方	Elastic Defend	需 Falco / 第三方
App runtime（RASP）	內建 AAP	需第三方	第三方	第三方
SOAR / Response	Workflow Automation（輕量）	Splunk SOAR（業界先驅）	Cases + Endpoint response	SOAR 內建（前 Siemplify）
適合場景	Cloud-native + 已用 Datadog + 中等規模 SOC	Enterprise + 跨 on-prem、預算允許	OSS-friendly、Elastic stack 已用	超大規模 ingestion、Google 雲

選 Datadog 的核心訴求：已經用 Datadog observability、cloud-native 為主、SOC 規模中等（10-50 人）、需要 observability + security 同 plane 的 incident 判讀路徑。on-prem 為主、預算敏感（host 量 1000+）、需要 enterprise SOAR / RBA 深度、走 Splunk；OSS-friendly、跨 on-prem、走 Elastic。

進階主題

Cross-product correlation（log + APM + metrics 同 trace_id）：Datadog 最特別的偵測形狀 — security alert 不只 log line、而是綁 trace_id 的 integrated incident view。例如 API endpoint 出現 SQLi 嘗試、Cloud SIEM 開 signal、同時 APM 看到該 request 的 DB query 跟 latency、infra 看到該 host 的 CPU。對「query latency 異常是不是被攻擊」這種混合 incident 偵測有結構性優勢、跟 Snowflake 2024 Credential Abuse 的調查路徑直接對應。

CWS Linux eBPF 行為偵測：eBPF probe 在 kernel 層、不需要 kernel module、不影響 process performance（< 1% overhead）。可以偵測的行為包括 file integrity（/etc/passwd 被改）、process tree（bash → curl → /tmp/payload 異常 chain）、network connection（容器對外連 cryptominer pool）、syscall pattern（ptrace 用於 process injection）。跟 Falco 同樣用 eBPF、差別是 Datadog CWS 不需要單獨部署 + 跟 Datadog 其他 signal 同 plane。

Datadog Threat Intelligence：內建 threat feed（malicious IP / domain / file hash）、自動標記 log / network event 命中 IoC。可以加自家 STIX/TAXII feed、不過深度比不上 Mandiant / Recorded Future / 專業 TI platform；中等規模 SOC 夠用、嚴重 APT 對抗場景要外接專業 TI。

跟 Datadog Incident Management 整合：security signal 可以直接開 Datadog Incident（內建 incident channel + timeline + post-mortem template）、跟 PagerDuty 同類但跟 observability 同 plane。對 資安事件升級成全公司 incident 的場景（Change Healthcare 2024 Operations Impact 那種規模）可以共用 incident commander 視角、不用兩套 timeline 拼起來。

排錯與失敗快速判讀

Cloud SIEM 偵測 lag / 沒 alert：events 沒進 indexed log（走了 Flex）、retention tier 設錯 — 檢查 log pipeline rule 是否把 security-critical source 標 indexed
Events/month 暴衝：debug log / verbose log 進 Cloud SIEM index、CWS event 量爆 — log pipeline 前置 filter（Datadog Observability Pipeline 或 Cribl）、CWS rule 收斂 noisy 行為
CSPM findings 100+ 沒人修：findings 沒進 ticket workflow、沒分 priority — 整合 Jira / ServiceNow、severity 對應 SLA、findings 老化超 30 天升級
CWS 在舊 kernel host 沒資料：eBPF feature 對 kernel 版本敏感（< 4.18 部份功能不支援）— 升級 kernel 或標記該 host 為 CWS-incompatible、補位用 host-based agent
AAP false positive 卡 user：RASP 在 app runtime 直接 block、誤殺正常 request — AAP 先走 monitor mode 1-2 週收 baseline、tune 後再轉 protect mode
Sensitive Data Scanner miss PII：custom pattern 沒寫對、log format 嵌套（JSON 內又是 JSON）— 用 sample log 跑 dry-run、scanner 跑在 ingest 階段不是 retroactive
Workflow Automation playbook 黑箱：自動 rotate credential 結果誤殺 prod service account — playbook high-impact action 走 approval gate、default 走 containment 不走 deletion

何時改走其他服務

需求形狀	改走
Enterprise + 跨 on-prem、預算允許	Splunk
OSS-friendly / Elastic stack 已用	Elastic Security
超大規模 ingestion + Google 雲	Google Security Operations
嚴格 DLP / 資料分類	Google DLP / Microsoft Purview
Cloud posture graph / attack path	Wiz / Prisma Cloud / Lacework
Edge WAF / volumetric attack	Cloudflare WAF / AWS WAF
Endpoint EDR	CrowdStrike Falcon / Microsoft Defender for Endpoint
Incident routing	8 事故處理 vendor 清單

不在本頁內的主題

Datadog Agent 完整 configuration reference、custom check 撰寫
Datadog observability（APM / RUM / Synthetics / DBM）細節 — 屬 4 observability 模組
Cloud SIEM rule 完整語法 reference
CWS eBPF probe 撰寫（custom rule via Agent Expression Language）細節
Datadog Incident Management workflow（屬 8 IR 模組）

案例回寫

Datadog Security 在 07 案例庫沒有直接 vendor-level 事件、但 observability + security 同 plane 的偵測形狀讓部份案例的調查路徑變短、值得對照：

案例	跟 Datadog Security 的關係（對照啟示）
Snowflake 2024 Credential Abuse	Query volume + 連接數 + CPU 負載異常是 Datadog 同 plane 的強項、Cloud SIEM rule + DBM metrics 同 query 不用 SIEM + 監控工具拼接
Change Healthcare 2024 Operations Impact	業務中樞事件的影響評估、APM + Infra 可秒級判斷 latency 異常源自資安 vs 容量、Datadog Incident 共用 IC 視角
Mailchimp 2023 Support Tool Abuse	APM span correlation 可看到單一 operator 短時間跨多 tenant access 的 trace pattern、log-only SIEM 看不到 application-level tenant 切換
Uber 2022 MFA Fatigue	Cloud SIEM detection rule 配 Okta MFA log + APM error rate correlation、不靠單一 log source
Detection Coverage and Signal Governance (section)	Standard / Flex Logs + retention tier 是 detection coverage 治理的工具、tier 1 source 必 indexed、tier 2 / 3 走 Flex / Archive

下一步路由

上游：7.13 偵測覆蓋率與訊號治理、Detection Engineering Lifecycle
平行：Splunk、Elastic Security、Google Security Operations
下游：Google DLP / Microsoft Purview（DLP signal 進 Datadog）
跨類：Okta（IdP log source）、HashiCorp Vault（Workflow Automation 拉 API）、Cloudflare WAF / AWS WAF（edge WAF log 進 Cloud SIEM、AAP 在 app 層補位）
跨模組：4 observability（同 Agent / 同 plane）、8 事故處理 vendor 清單（Datadog Incident → IR routing）
官方：Datadog Security Documentation

Splunk Risk-Based Alerting：從 alert per rule 到 score-aggregated notable

Mon, 18 May 2026 00:00:00 +0000

本文是 Splunk overview 的 implementation-layer deep article。Overview 已說明 Splunk Enterprise Security 在 SIEM / Detection 譜系的定位、本文聚焦 Risk-Based Alerting (RBA) 的實作層 — 從「per-rule alert」轉到「score 累積 + threshold 觸發 notable」的方法論轉變、跟 tuning / scaling / 整合的具體做法。

為什麼 RBA：alert fatigue 是 detection engineering 的天花板

Detection engineering 的成熟度上限不是「能寫多少 correlation rule」、是「SOC analyst 能處理多少 alert / day 而不會麻木」。多數 SOC 在 200-500 alert/day 區間就到處理上限、再加 rule 只會推升 false positive、analyst 開始 silent ignore 中低嚴重度 alert。

RBA 的核心轉折是 把 alert 邏輯從「rule 觸發」拆成「score 累積」：每個 detection rule 不直接產 alert、而是給 user / asset / process 加 risk score；多個低嚴重訊號累積到 threshold 才產 notable（高優先 case）。SOC 看的不是「rule X 觸發了」、是「user Y 今天累積 70 分、上週 12 分」。

RBA 不是 寫 detection rule 的替代、是 aggregation 跟 prioritization 的新層。原本 100 條 rule 各自產 alert 變成 100 條 rule 共同貢獻 score、score → notable 是新的 alert 邊界。

RBA 三層 model：modifier、score、notable

Risk 流程的三個 first-class object：

Object	責任	例
Risk modifier	一條 detection rule 產出、提供「給誰加多少分、為什麼、什麼類別」	user `alice@corp` +25 分、reason `unusual_login_geo`
Risk index	累積所有 modifier、依時間衰減；query 出「user / asset 當前 score」	`index=risk earliest=-7d`
Risk notable	當 score 累積超過 threshold 觸發、進 SOC case management	user 累積 50 分 → 開 incident

關鍵設計選擇都在 modifier 層：

加分維度：per user / per asset / per process tree / per IP — 維度越細粒度、score 越能對應「個體」、但 query 成本越高
加分 weight：簡單做法 severity 直接對應（low=5 / med=15 / high=30 / critical=60）；細做要考慮 signal precision（rule 的歷史 FP rate）
MITRE ATT&CK 對應：每個 modifier 標 tactic / technique、跟 ATT&CK 對應、用來判斷 kill chain 階段 是否完整（reconnaissance → exfiltration 全套出現 vs 單一 tactic 重複）

ES 配置 step-by-step

Risk modifier 從 correlation search 產出

| search index=auth user=* unusual_geo=true
| stats count by user, src_ip, _time
| eval risk_score=25
| eval risk_object_type="user"
| eval risk_object=user
| eval risk_message="Unusual login geography"
| eval threat_object=src_ip
| eval threat_object_type="ip_address"
| eval mitre_technique="T1078"
| collect index=risk

關鍵欄位：

risk_object + risk_object_type：誰被加分、預設 user / system / other
risk_score：加多少分、考量 signal precision
threat_object：對應的 attacker artifact（IP / hash / domain）、用來跨 modifier 關聯
mitre_technique：對應 ATT&CK ID、用於 kill chain analysis

Tuning 提醒：第一次部署別直接 collect index=risk、先 | table 看 output、估算每天會產多少 modifier；超出 indexer 容量規劃前先做 sampling（| where random()/2147483647<0.1 取 10%）。

Risk notable：threshold aggregation

| tstats summariesonly=t count, sum(All_Risk.calculated_risk_score) as total_risk
  from datamodel=Risk.All_Risk
  where earliest=-24h
  by All_Risk.risk_object, All_Risk.risk_object_type
| where total_risk > 80
| `risk_score_format`

total_risk > 80 是觸發 notable 的 threshold。Tuning 重點：

Time window：-24h 是預設、但要看 attack pattern average duration 調整；APT 用 7-14 day window、commodity attack 用 4-12h
Threshold value：80 是當量不是普世值、依 modifier weight 分佈調整；ES 7.0+ 預設建議 100、實務多在 60-150 區間
Aggregation 維度：by user 是 default、但 lateral movement scenario 要 by asset、credential abuse 要 by service account

Tuning 提醒：第一週跑 shadow mode — 觸發 notable 但不 page、SOC 後續 review、調整 threshold 跟 weight；shadow 跑 1-2 週後再啟 production page。

Notable enrichment：人類能看的 case

| eval description="User ".risk_object." accumulated ".total_risk." risk over 24h"
| eval mitre_techniques=mvjoin(mitre_technique, ", ")
| eval contributing_rules=mvjoin(search_name, ", ")
| sendalert notable

Notable 進入 ES Incident Review、SOC analyst 看到的不只 score、還有 組成這 80 分的 N 條 rule + ATT&CK 覆蓋的 tactic；這是 RBA 比 per-rule alert 強的核心 — analyst 直接看完整 narrative、不用拼湊。

Tuning playbook：四類常見 drift

Playbook A：False positive 累積

徵兆：某 user 連續 N 天觸發 notable、SOC 每次 review 後 close 為 FP；但 modifier 仍持續加分。

根因：modifier 加分邏輯沒考慮 baseline — 例：DBA 每天用 psql 連 prod 是正常、unusual_command rule 把它當異常加 15 分、累積到 threshold。

修法：

Modifier 端加 whitelist_lookup：DBA / SRE / approved service account 跳過 specific modifier
進階：modifier 加 signal_precision weight、historical FP rate > 30% 的 rule weight 降到 5 分以下
不能輕易加 NOT user IN (...) exclusion、long whitelist 是反模式 — 用 role-based exclusion（query AD group）

Playbook B：Score inflation

徵兆：threshold 設 80、SOC 收到的 notable 每 day 從 5 個漲到 25 個、但「實際攻擊」沒對應增加。

根因：新加的 detection rule 沒對齊既有 weight 分佈、新 rule 都給 +30 / +40、global average 抬升、threshold 變相降低。

修法：

每加新 rule 時跑「+1 rule 對 daily notable 數的影響」shadow simulation
重新 calibrate threshold — 不是固定值、是 p95 daily total_risk 的 1.5 倍
季度 review：跑 index=risk | stats sum(risk_score) by source 看 modifier 來源分佈、score 集中在少數 rule 是 inflation 訊號

Tuning 提醒：score inflation 跟 alert fatigue 是同樣症狀的不同根因；前者改 threshold + rule weight calibration、後者改 modifier 維度跟 whitelist。

Playbook C：Threshold drift

徵兆：threshold 設定半年沒動、但 attack landscape / business 行為都變了；要嘛 notable 太多（threshold 低於 baseline）、要嘛 missed detection（threshold 高於實際攻擊累積）。

根因：threshold 是 static value、但 baseline 是 dynamic；business 流程變動（雲端遷移 / 新部門 / WFH 比例變化）影響 modifier 觸發頻率。

修法：

Quarterly tuning cadence：每季跑 tstats sum(All_Risk.calculated_risk_score) by user | stats p50, p95, p99 看分佈
Adaptive threshold：用 p95 × 1.3 動態計算、寫 macro 自動 update
不要把 threshold drift 當「rule 不準」、是 基準漂移、不是 rule 錯

Playbook D：Decay 設計

徵兆：user 7 天前的低分異常持續累積在 score 內、threshold 觸發 notable 但實際是 7 天分散事件、不是 當前攻擊 episode。

根因：default RBA 在 -24h window 內 sum、沒考慮 時間衰減；7 天前的低分跟今天的低分權重一樣。

修法：加 decay function、modifier weight 隨時間衰減：

| eval age_hours=(now() - _time)/3600
| eval decayed_score = calculated_risk_score * exp(-age_hours / 48)
| stats sum(decayed_score) as total_risk by risk_object

exp(-age/48) 是 48 小時半衰期、24h 前的事件權重剩 60%、48h 剩 37%、7 天前剩 < 3%。half-life 依 attack pattern 調整：commodity attack 12-24h、APT 5-14 day。

Capacity 規劃

RBA 的 capacity 三個面向：

維度	估算方式	警戒值
Risk index event/day	`總 detection rule × 平均 trigger 次數/day`	中型 SOC ~100K-500K / day
Risk datamodel size	`event/day × 365 day × 1KB avg`	100K/day × 365 × 1KB ≈ 36GB / year
Search head load	RBA tstats 比 raw search 便宜 ~10x、但 by-user aggregation 在 1M+ user 仍重	跑 hourly notable trigger search、不是 streaming
Indexer ingest	RBA 不大增 ingest（已 ingest 的 log 處理出 modifier）、但 datamodel acceleration 要 CPU	每 indexer 預留 10-15% CPU 給 datamodel accel

實務 sizing：500K modifier/day、用戶 5K、tstats hourly trigger search、需要 3 indexer + 1 search head（含 RBA 之外的工作）。

注意 SC4S / Splunk Cloud ingest pricing — RBA 不增 ingest GB / day、但 datamodel acceleration 算 CPU 工作量、Splunk Cloud 是另外計費的 vCPU；on-prem 自管 indexer 沒這個 cost。

整合 / 下一步

跟 SOAR / case management

Notable 觸發後接 SOAR：

enrichment：自動 query AD / asset DB / threat intel、把 user role / asset criticality / known IoC 補進 case
decision tree：根據 risk score 區間決定 SOC tier（< 100 tier 1 / 100-200 tier 2 / 200+ tier 3 + page）
playbook automation：disable user / isolate endpoint / rotate credential 走 SOAR pipeline、不要 SOC analyst 手動 click

跟 Elastic Security / Sentinel 對照

各家對 RBA 的實作命名不同：Splunk 叫 RBA、Elastic 叫 Risk Engine、Microsoft Sentinel 叫 Fusion + UEBA aggregation、Sumo Logic 叫 Insight Trainer；底層概念相同（score aggregation + threshold notable）、細節差在 modifier 寫法跟 ML 自動化程度。跨平台遷移時 modifier 邏輯多半要重寫、threshold + decay tuning 經驗可以平移。

跟 UEBA

RBA 跟 UEBA（user / entity behavior analytics）是 互補不是替代 — UEBA 用 ML 算 baseline 偏差、輸出 anomaly score 餵進 RBA 當一個 modifier 來源。實作順序通常是 先靜態 rule + RBA、再加 UEBA 補充；直接從 ML-first 開始通常 tuning 成本爆炸。

下一步議題

Threat object correlation：跨 modifier 用 threat_object 串相同 attacker artifact、score 跨 user 跨 asset 聚合
Kill chain coverage analysis：notable 拆成「ATT&CK tactic 覆蓋 N/14」、覆蓋越廣 priority 越高
Risk-based response automation：score 區間自動觸發不同 SOAR playbook、人工只 review tier 3

7.B 防守者視角（藍隊）與控制面驗證

Thu, 30 Apr 2026 00:00:00 +0000

藍隊子分類的核心目標是建立防守判讀與控制面驗證路徑。這裡的藍隊定位為防守者視角的工程交接層，負責回答要防什麼、看什麼訊號、誰接手、如何驗證與如何回寫。

判讀分類

分類	內容方向	承接章節
Defense control map	身份、入口、資料、供應鏈、偵測與治理控制面	`7.B1` + `7.8`
Detection routing	signal、threshold、triage、severity、escalation	`7.B2` + `7.13`
Control validation	release gate、evidence chain、rollback、correctness	`7.B3` + `05/06`
Tabletop and game day	scenario、role、decision route、exercise write-back	`7.B4` + `7.19`
Incident handoff	owner、runbook、communication、post-incident review	`7.B2` + `08`
Materials	professional sources、field cases、scenarios、patterns	`7.BM` + `7.B1-7.B4`

選型入口

藍隊分析優先問「防守者如何讓風險被看見並被收斂」。當一個風險已經能被 red-team problem card 描述，下一步就是把它轉成控制面、訊號、驗證條件與回寫位置。

與安全主模組的關係

本子分類與資安主模組形成防守操作視角。資安主模組定義問題節點與路由規則，藍隊子分類負責把這些節點整理成防守判讀、控制面驗證與演練材料。

與紅隊子分類的關係

藍隊與紅隊共用同一批風險語言。紅隊從攻擊路徑確認弱點，藍隊從防守流程確認控制面是否能偵測、升級、驗證與回寫。

章節列表

章節	主題	目標
7.B1	防守控制面地圖	把 7.x 風險判讀轉成控制面與 owner
7.B2	偵測到回應的路由	把 signal 轉成 triage、severity 與升級流程
7.B3	資安控制驗證	定義控制面如何用 evidence 與演練驗證
7.B4	Tabletop 與 Game Day	把 problem card 轉成演練與回寫任務
7.B5	Detection Lifecycle	把偵測規則變成可維護資產與交接流程
7.B6	Incident Triage Loop	把訊號轉成分級、接手、處置與證據循環
7.B7	Threat-Informed Validation	用威脅導向方式驗證控制面與偵測能力
7.B8	Defensive Vocabulary Map	用防守詞彙統一控制面、規則與交接語言
7.B9	Scenario Library	把高風險情境轉成可重播演練素材
7.B10	Alert Fatigue	建立訊號品質治理與調校策略
7.B11	Vulnerability State Machine	把漏洞回應拆成可交接狀態機
7.B12	Defender Pressure	從真實事故抽出防守壓力模型與回寫路由
7.BM	藍隊素材庫	整理專業來源、現場案例、推演情境與控制模式

本子分類會先建立防守判讀順序與控制面驗證語言，再交接到部署、可靠性與事故流程的實作章節。

藍隊章節的工程交接可參考 7.18 資安控制面如何交接到部署與事故流程與 7.19 資安演練：從 Abuse Case 到 Game Day。

模組完成狀態

藍隊章節目前已形成從控制地圖、規則生命周期、triage、威脅導向驗證到情境庫與素材庫的完整循環。素材庫共 11 張 field cases、4 張 scenarios、7 張 control patterns，並透過 7.B12 防守壓力地圖、7.B9 情境庫與 7.24 回寫路由串接到主章。

下一輪素材化大綱

類型	建議卡片	推演責任	承接章節
Field case	身份濫用、入口曝險、供應鏈偏移、資料外送、協調壓力	提供防守方真實壓力與決策節點	`7.B12`
Scenario	身份接管推演、邊界入口推演、供應鏈 artifact 推演、低頻資料外送推演	提供 tabletop 與 Game Day 劇本	`7.B9`
Control pattern	owner、evidence chain、detection lifecycle、vulnerability response、exercise write-back	提供可搬運控制欄位與驗證方法	`7.B1` + `7.B3`
Write-back	產品回寫、架構回寫、runbook 回寫、release gate 回寫	把演練結果轉成後續工程任務	`7.24`

這輪實作完成後，藍隊章節的價值會從「說明防守流程」提升為「提供可直接組裝的演練材料」。

LLM Service 偵測訊號覆蓋

Tue, 12 May 2026 00:00:00 +0000

本章的責任是把 LLM 服務的異常行為訊號、納入 7.13 偵測覆蓋與訊號治理的既有偵測框架。LLM 服務的偵測訊號跟一般 service 的差異在「需要看 prompt / response / tool call 三個語意層」、不只是 traffic 跟 error rate；LLM-specific 訊號的關鍵範例是 refusal rate、通用 alerting 詞彙見 alert、alert-fatigue、symptom-based-alert 卡。本章聚焦這層特殊性、通用偵測流程沿用 7.13。

本章寫作邊界

本章聚焦 production LLM 服務的偵測訊號設計：tool call 異常、prompt injection 觸發徵兆、abuse 模式、cost / token 異常、模型行為偏移。通用偵測平台選型與 SIEM / SOAR 整合屬 04-observability 跟 7.13。

本章 threat scope

In-scope：LLM 服務的特殊偵測訊號（prompt / response / tool call 語意層）、agent 行為異常、abuse / 濫用模式、cost 異常、模型 drift。

Out-of-scope（路由到他章）：

通用偵測覆蓋與訊號治理 → 7.13 detection-coverage-and-signal-governance
偵測平台 → 04-observability
IR 工作流 → 7.10 incident-case-to-control-workflow
agent prompt injection 後果 → llm-prompt-injection-in-agent
log / PII 治理 → llm-log-and-pii-governance

從本章到實作

Mechanism：問題節點表 → knowledge-card。
Delivery：交接路由 → 04-observability 偵測平台、08-incident-response IR 流程。

LLM 服務的偵測語意層

一般 service 的偵測訊號集中在 traffic / error / latency / auth event；LLM 服務增加了三個語意層：

prompt 語意層：使用者輸入的內容模式、prompt 長度分布、特殊 token / pattern 出現頻率。
response 語意層：模型輸出的內容類型、refusal rate、輸出長度分布、tool call 出現模式。
tool call 序列層：agent 場景下、tool call 順序、頻率、跨 tool 依賴模式。

這三層的訊號通常無法用傳統 monitoring stack 直接抓、需要 LLM-specific 的 telemetry pipeline。

分析模型

LLM 服務偵測依四個層次設計訊號：

traffic 層：跟一般 service 一致、QPS / latency / error rate / auth event。
content 層：prompt 跟 response 的語意特徵（長度、token 類型、敏感詞）。
behavior 層：tool call 序列、agent loop 步數、cross-service call pattern。
cost 層：token / call 累積、cost 異常（單一 tenant 突然暴增、cost-per-result 飆高）。

判讀流程

判讀流程的責任是把「能偵測一般服務異常的偵測平台」擴成「能偵測 LLM 特殊異常的偵測平台」。

先盤點現有偵測平台覆蓋哪些訊號類別、哪些是 LLM-specific 缺漏。
再設計 LLM-specific 訊號的採集路徑（log → metric → alert）。
接著定義 baseline 跟 anomaly threshold、避免假陽性過高。
最後交接到 IR 流程、確認 alert 能對應到具體處置動作。

問題節點（案例觸發式）

問題節點	判讀訊號	風險後果	前置控制面
tool call 序列異常	同一 session 內 tool call 暴增、跨 tool 跳躍頻繁	injection 觸發 agent 進入非預期 loop	detection-coverage-and-signal-governance
Refusal rate 突然下降	模型開始接受原本拒絕的 prompt	對齊被繞過、injection 攻擊在進行	symptom-based-alert
token usage 異常飆升	單一 tenant cost 跳一個量級	abuse / DoS / 自動化攻擊	rate-limit
prompt 含 injection 模式	“ignore previous instructions” / 大量 system prompt 字樣	已知 injection 模式試探	symptom-based-alert
response 含 PII 模式	模型輸出含信用卡 / 身分證號碼 pattern	訓練資料洩漏 / hallucinate PII	data-protection
跨 tenant pattern 相似性	不同 tenant 同時出現相似異常 prompt	協同攻擊 / botnet	symptom-based-alert
模型 drift	同 prompt 在不同時段 response 品質明顯變化	模型版本切換問題 / vendor 端變動	contract-test

常見風險邊界

風險邊界的責任是界定何時 LLM 偵測覆蓋已進入高壓狀態。

tool call 序列、refusal rate、token usage 任一缺乏 baseline 時、代表 content / behavior / cost 層偵測不足。
prompt injection 已知 pattern 沒列入 alert 時、代表已知威脅未覆蓋。
跨 tenant 模式分析缺失時、代表協同攻擊偵測能力不足。
alert 沒對應到 IR 處置動作時、代表偵測與處置斷層。

LLM 場景的特殊判讀

LLM 服務偵測相對一般 service 偵測的特殊性：

訊號是非結構化的：prompt / response 是自由文字、不是 status code 跟 endpoint name；偵測 pipeline 需要 NLP / embedding 等手段、不只是 grep / regex。
baseline 漂移：使用者行為跟 LLM 使用模式持續演進、baseline 比一般 service 更需要 rolling window 更新。
「正常」prompt 跟「injection」prompt 的邊界模糊：教 LLM 寫 prompt injection 教材的使用者、prompt 內容跟攻擊者的測試 prompt 形式上類似；偵測需要結合 intent 跟 context。
cost-based detection 是 LLM 特有的 strong signal：傳統 service 的「cost」對應 infra、容易被視為運維議題；LLM service 的 token cost 直接連結到 abuse、cost 異常本身是強訊號。
跨 tenant 相關性分析：協同攻擊跟 botnet 在 LLM 服務上、可能用相同 prompt 在不同帳號試探；跨 tenant pattern 分析比一般 service 更有用。
模型 vendor 是 third-party 失敗點：vendor 端的模型更新、API 限流、政策變更會直接影響服務行為；需要 vendor-side 訊號（status page、release notes）納入偵測範圍。

訊號設計的核心原則

traffic 層沿用既有監控：QPS / latency / error rate / 5xx、跟一般 service 一致、用既有平台。
content 層需建 NLP pipeline：prompt 長度分布、敏感詞 detector、injection pattern detector、response PII detector。
behavior 層追蹤 tool call 序列：每個 session 的 tool call DAG、跟 baseline 比對。
cost 層做 tenant-scoped baseline：每個 tenant 的 token / cost 用 rolling baseline、突破 threshold 觸發 alert。
跨 tenant pattern 用 embedding 相似性：用 prompt embedding 做相似性分析、找協同攻擊。
vendor-side 訊號納入：vendor status page、release notes、incident 公告應該 watch、作為 external signal source。

案例觸發參考

LLM 服務偵測的公開案例累積中、值得追蹤的方向：

大型 LLM vendor 的 abuse detection pipeline 公開介紹
prompt injection 攻擊在 production agent 場景的真實案例
token usage abuse 的 botnet 案例

LLM-specific 偵測案例累積後會補入 red-team/cases/llm-detection/。一般偵測案例見 7.13 detection-coverage-and-signal-governance。

事實查核註：LLM 服務的偵測 baseline、attack pattern、defense 工具都在快速演進、本章列舉的訊號類型為 2026 年 5 月常見社群實踐、具體 threshold、tooling、commercial product 依時段變化、引用前以最新研究跟產品文件為準。

引用標準

標準	版本 / 年份	適用場景
MITRE ATLAS	continuous	AI 系統威脅戰術 / 偵測戰術 reference
OWASP LLM Top 10	2025	LLM application security 通用 reference
NIST AI RMF	1.0 (2023)	AI 系統風險偵測 reference
MITRE ATT&CK	continuous	一般系統威脅戰術、部分適用 LLM 服務基礎設施

引用版本與 cadence 規則見 security-citation-currency-and-precision。Last reviewed: 2026-05-12。

下一步路由

通用偵測覆蓋：7.13 detection-coverage-and-signal-governance
偵測平台：04-observability
agent prompt injection 後果：llm-prompt-injection-in-agent
log / PII 治理：llm-log-and-pii-governance
事件案例工作流：7.10 incident-case-to-control-workflow

7.B2 從偵測到回應的路由

Thu, 30 Apr 2026 00:00:00 +0000

本篇的責任是把資安偵測訊號轉成回應路由。讀者讀完後，能把 alert、tripwire、audit signal 或外部通報，轉成 triage、severity、owner 與升級流程。

核心論點

偵測到回應路由的核心概念是「訊號要能推動決策」。偵測本身提供觀察，回應路由則定義誰判讀、如何分級、何時升級、何時關閉。

讀者入口

本篇適合銜接 7.13 偵測覆蓋率與訊號治理、7.14 資安治理例外與 Tripwire 與 Escalation Policy。

路由欄位

欄位	責任	常見來源
Signal	描述觸發事件與觀察證據	alert、audit log、external advisory
Triage question	定義第一輪判讀問題	影響範圍、可信度、緊急度
Severity	對應產品影響與回應節奏	incident severity
Owner	定義接手角色與升級路徑	on-call、service owner、security owner
Exit condition	定義本輪回應的關閉條件	containment、validation、write-back

路由欄位的核心是把訊號轉成可執行任務。若欄位完整，團隊在壓力下仍能用一致方式判讀與升級。

訊號分類

訊號分類的責任是建立優先順序。建議先區分三種來源：

技術訊號：監控、掃描、驗證結果。
流程訊號：例外到期、審查延遲、關卡失敗。
外部訊號：公開漏洞、供應鏈公告、客戶通報。

Triage 問題設計

Triage 問題的責任是縮短第一輪決策時間。常用問題包含：

影響範圍是否持續擴大。
訊號可信度是否足夠觸發升級。
目前證據是否支持 containment。
目前事件是否需要跨團隊決策。

Severity 對齊

Severity 對齊的責任是讓資安訊號與 incident 節奏一致。這一層建議直接掛到 incident severity 與 escalation policy。

做法上可先定義分級規則，再為每個分級綁定 owner、通訊節奏與關閉條件。

Response 路由

Response 路由的責任是把分級後動作排成流程。建議最小流程：

Containment：先穩定影響面。
Evidence collection：同步保留關鍵證據。
Communication：同步內外部利害關係人。
Write-back plan：預留回寫任務入口。

Exit 與回寫

Exit 的責任是定義這輪事件何時完成。關閉前應確認：

影響面收斂到目標範圍。
事件證據可回查。
後續任務已進入問題卡與 workflow。

回寫位置建議固定到 detection rule、problem card 與 incident workflow，讓下一輪判讀更快收斂。

判讀訊號與路由

判讀訊號	代表需求	下一步路由
告警名稱清楚但處理者判讀不一致	需要 triage question	7.B2 → 08
tripwire 觸發後缺少升級對象	需要 escalation route	7.B2 → 7.14
外部公告進來後影響範圍判斷緩慢	需要 service owner map	7.B2 → 7.B1
回應結束後偵測規則沒有更新	需要 write-back loop	7.B2 → 7.16

判讀表格可以直接當作值班檢查單。每次事件結束後重新掃一次，能快速找到下輪優先補強項目。

必連章節

完稿判準

完稿時要讓讀者能把一個偵測訊號寫成回應路由。路由至少包含 signal、triage question、severity、owner、escalation path、exit condition 與 write-back target。

7.B10 Alert Fatigue and Signal Quality

Thu, 30 Apr 2026 00:00:00 +0000

本篇的責任是建立 alert fatigue 治理方法。讀者讀完後，能把噪音告警轉成可分級、可交接、可調校的訊號集合。

核心論點

Alert fatigue 治理的核心概念是把告警品質當系統能力管理。判讀效率與決策一致性是主要目標，告警數量則作為輔助觀測指標。

讀者入口

本篇適合銜接 7.13 偵測覆蓋率與訊號治理、7.B5 Detection Engineering Lifecycle 與 alert fatigue。

訊號品質欄位

欄位	責任	指標
Precision	降低誤報密度	false positive rate
Recall	保持重要事件命中	missed detection rate
Context richness	提供足夠判讀上下文	triage completion rate
Routing quality	提供正確接手路由	misrouting rate
Actionability	提供可執行下一步	response start time

告警分層

告警分層的責任是讓值班負載可控。分層可依風險與動作分成：

Informational：觀測型訊號。
Action-required：需值班處理。
Escalation-required：需跨團隊升級。

調校節奏

調校節奏的責任是讓告警品質持續改善。每輪調校至少記錄觸發條件、誤報來源、調整內容、影響範圍與回退條件。

與 triage loop 對齊

與 triage loop 對齊的責任是讓告警到回應保持一致。告警內容至少提供 signal source、impact hint、recommended owner 與下一步路由。

判讀訊號與路由

判讀訊號	代表需求	下一步路由
值班人員持續手動排除同類告警	需要規則調校與分層	7.B10 → 7.B5
告警描述不足以支持分級	需要補 context 欄位	7.B10 → 7.B6
告警量下降但漏報上升	需要平衡 precision 與 recall	7.B10 → 7.B7
告警調整缺少變更證據	需要補 release gate 記錄	7.B10 → 7.22

必連章節

完稿判準

完稿時要讓讀者能為告警系統建立品質治理循環。輸出至少包含品質欄位、分層策略、調校節奏、對齊路由與回寫位置。

Detection on Tarragon

Splunk

服務定位

本章目標

最短判讀路徑

日常操作與決策形狀

核心取捨表

進階主題

排錯與失敗快速判讀

何時改走其他服務

不在本頁內的主題

案例回寫

下一步路由

Elastic Security

服務定位

本章目標

最短判讀路徑

日常操作與決策形狀

核心取捨表

進階主題

排錯與失敗快速判讀

何時改走其他服務

不在本頁內的主題

案例回寫

下一步路由

Datadog Security

服務定位

本章目標

最短判讀路徑

日常操作與決策形狀

核心取捨表

進階主題

排錯與失敗快速判讀

何時改走其他服務

不在本頁內的主題

案例回寫

下一步路由

Splunk Risk-Based Alerting：從 alert per rule 到 score-aggregated notable

為什麼 RBA：alert fatigue 是 detection engineering 的天花板

RBA 三層 model：modifier、score、notable

ES 配置 step-by-step

Risk modifier 從 correlation search 產出

Risk notable：threshold aggregation

Notable enrichment：人類能看的 case

Tuning playbook：四類常見 drift

Playbook A：False positive 累積

Playbook B：Score inflation

Playbook C：Threshold drift

Playbook D：Decay 設計

Capacity 規劃

整合 / 下一步

跟 SOAR / case management

跟 Elastic Security / Sentinel 對照

跟 UEBA

下一步議題

相關連結

7.B 防守者視角（藍隊）與控制面驗證

判讀分類

選型入口

與安全主模組的關係

與紅隊子分類的關係

章節列表

模組完成狀態

下一輪素材化大綱

LLM Service 偵測訊號覆蓋

本章寫作邊界

本章 threat scope

從本章到實作

LLM 服務的偵測語意層

分析模型

判讀流程

問題節點（案例觸發式）

常見風險邊界

LLM 場景的特殊判讀

訊號設計的核心原則

案例觸發參考

引用標準

下一步路由

7.B2 從偵測到回應的路由

核心論點