Detection-Rule on Tarragon

Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷

Mon, 18 May 2026 00:00:00 +0000

本文是跨 vendor migration playbook、cross-link 到 Splunk（source）跟 Elastic Security（target）兩個 vendor overview。Migration playbook 跟 vendor deep article methodology 的 6-section flow 不同 — 是 phased process（audit → translation → parallel run → cutover → cleanup）、強調 時間軸 跟 回退邊界。

為什麼遷：cost / multi-vendor / cloud-native 三條 driver

Splunk → Elastic 遷移在 2022+ 變主流選項、driver 通常三條疊加：

Driver	觸發場景
Cost	Splunk per-GB ingest pricing 在 5+ TB/day 規模累積到無法接受、Elastic fixed-tier pricing 可省 50-70%
Multi-vendor	想避免 SIEM lock-in、跟 Sentinel / Datadog Security 同時跑形成 portfolio
Cloud-native	已用 Elasticsearch / Kibana 做 application observability、想統一 stack 走 Elastic Cloud / ECK

反向 driver（Elastic → Splunk）也存在但少數 — 主要是 合規 / 政府客戶要 Splunk Cloud GovCloud、或 Splunk Premium ES 的 RBA + UEBA 成熟度仍領先。本文聚焦 Splunk → Elastic、反向流程結構相同但 schema 對位方向相反。

結構：phased migration 不是 6-section deep article

跟 single-feature deep article（Splunk RBA、Vault dynamic credential）不同、migration playbook 的核心是 time-sequenced phase + 回退邊界。6 段 phase：

Phase	內容	預估時長	回退邊界
Phase 0：rule audit	盤點 Splunk 端 rule、量化 precision / FP rate / alert volume	1-2 週	不影響 production
Phase 1：schema 對位	SPL ↔ KQL / ES	QL、CIM ↔ ECS、index ↔ data view 對應規格	1-2 週	不影響 production
Phase 2：translation	rule 一條條轉、AI-assisted + 人工 verify	4-12 週	翻譯失敗的 rule 退回 manual / 標 deferred
Phase 3：parallel run	兩 SIEM 同時跑、alert 兩邊產出、累積 confidence	4-8 週	切回單 Splunk、Elastic 端關 alert
Phase 4：cutover	alert routing 切到 Elastic、Splunk 仍 ingest 但不送 alert	1 週	routing 切回 Splunk、半小時內可逆
Phase 5：cleanup	Splunk ingest 停、歷史資料 archive 到 S3、license decommission	2-4 週	不可逆 — 過早走會失去歷史查詢能力

整個遷移週期 4-9 個月、跟 single deep article 1-2 小時完全不同 scale。

Phase 0：rule audit 建 baseline

遷移前必須先知道 current state：

-- Splunk rule 盤點
| rest /servicesNS/-/-/saved/searches
  splunk_server=local search="alert"
| where disabled=0
| eval rule_age=now()-strptime(updated, "%Y-%m-%dT%H:%M:%S")
| stats count, avg(rule_age) by app, owner

每條 rule 量化四個指標：

指標	怎麼算	用途
Alert volume / day	`index=_audit action=alert_fired rule_name=X` 過 30 天	高 volume 先翻、cutover 期間影響大
Precision (TP / total)	SOC review 過去 30 天 alert、標 TP / FP / unknown	低 precision 先翻（藉機 fix、不是直接複製問題）
Detection coverage	對應 MITRE ATT&CK technique	確認 Elastic 端有對應 coverage、不能漏 tactic
Owner / 維護狀態	rule 的 owner team + 最後 update 時間	Owner 失聯的 rule 翻譯成本爆、考慮直接退役

Audit 階段的關鍵決策：哪些 rule 不翻譯 — production 通常 30-50% rule 是 legacy / dead code / 已 deprecated；遷移是 清理機會、不是「全部複製過去」。

Phase 1：Schema 對位

Splunk 跟 Elastic 的 data model 沒有 1:1 mapping、必須先建對位 spec：

Splunk concept	Elastic 對應	對位難度
SPL search language	KQL（簡單）/ ES	QL（複雜 query、PG 14+ piped）	中、語法差距大但概念對齊
Index	Data view（read）/ data stream（write）	低、概念相同
CIM data model	Elastic Common Schema (ECS)	中、欄位命名差、有對照表（CIM→ECS open source）
Macros	Runtime fields / transforms / ingest pipeline	高、Splunk macro 是 SPL fragment、Elastic 沒對等概念
Lookups	Enrich processors / lookup index	中、邏輯對等但 lifecycle 管法不同
Correlation search	Detection rule（KQL / EQL / Threshold / ML）	中、Splunk 一條 search、Elastic 拆 rule type
Summary index	Transform / rollup	高、Splunk `tstats` summary index 概念複雜
Notable event	Alert + signal（Security app）	低、Elastic 7.x+ 已成熟
Saved search	Saved query	低
Dashboard	Kibana dashboard	中、Splunk XML/SimpleXML 跟 Kibana JSON 不可直接轉

Field mapping 是最大坑：Splunk 自由 schema（extract runtime）vs Elastic 強 type ECS。Splunk 端 src_ip 可能是 string；Elastic 端必須 source.ip 是 ip type — 任何 ingest pipeline 都要先把 raw event 轉成 ECS 結構。

Phase 2：Translation pipeline

實務 translation 用 3-tier hybrid：

Tier 1: vendor tool（cover 30-50%）

Elastic 官方提供 splunk-to-elastic migration assistant（SaaS / on-prem）— 對 簡單 SPL search 自動轉 KQL；cover ratio 視 SPL 複雜度而定。

Tier 2: LLM-assisted（cover 30-40%）

對 中等複雜 SPL（含 stats / eval / where）、用 Claude / GPT 翻譯：

1prompt template:
2"Convert this Splunk SPL to Elastic ES|QL. Preserve detection logic. List any
3unmappable functions.
4
5SPL:
6index=auth action=login user=* | bucket _time span=5m
7| stats count by user, src_ip, _time | where count > 10"

LLM output 必須 人工 verify：

對相同樣本資料跑 SPL vs ES|QL、output 對齊
FP rate 不能惡化
Threshold / window 對等（5m window 跟 5m window 對應）

Tier 3: manual（cover 10-30%）

剩下的是：

含 macro 跨 SPL fragment 的 rule（macro 必須先展開或 inline）
含 summary index 跟 tstats 的高效能 rule
用 transaction / streamstats 的 stateful query

這類 rule 翻譯成 KQL 邏輯後、通常 效能差 5-20x（Splunk summary index 是 precomputed、KQL 是 runtime）；要評估 改用 Elastic transform 或 接受效能下降。

Phase 3：Parallel run

雙 SIEM 同時跑是 最重要的 confidence-building 階段：

1                 ┌─→ Splunk ──→ alert ──┐
2data source ─┤                          ├─→ alert dedup ──→ SOAR / SOC
3                 └─→ Elastic ──→ alert ─┘

Dedup 策略：

Key：rule_name + event_id + timestamp_5min_bucket
Window：5-10 分鐘（兩端有不同處理 latency）
Routing：dedup 後送 SOAR、SOC 看「來自哪個 SIEM」標籤

跑 4-8 週累積：

指標	期望
Alert coverage 一致性	Elastic 抓到 Splunk 的 95%+ 對應 alert
FP rate 不惡化	Elastic FP / Splunk FP ≤ 1.2（允許 20% 浮動）
Detection latency 對等	Elastic 端 alert 時間在 Splunk 端 ± 5 分鐘內
Volume / day	Alert 總數兩端對齊（10% 內）

不對齊的 rule 退回 Phase 2 重新 translation；累積到 95%+ 對齊才能進 Phase 4。

Phase 4：Cutover — routing 切換

1Before cutover:
2  Splunk → SOAR (active routing)
3  Elastic → SOAR (parallel, marked test)
4
5After cutover:
6  Splunk → ingest 持續 / alert disabled
7  Elastic → SOAR (active routing)

Cutover 期間：

PagerDuty / Opsgenie 端 先建 Elastic integration、不立刻 disable Splunk
切換 dedup key 的 routing priority — 同一 alert 優先取 Elastic 那條
保留 Splunk ingest — 不立刻停、提供 fallback 半小時
SOC 24h 監視、無異常進入 Phase 5

回退邊界：cutover 失敗（Elastic 端 alert 大量遺漏 / 延遲）→ routing 切回 Splunk、Elastic 端 alert 再標 test、回 Phase 3。回退時間 30 分鐘內。

Phase 5：Cleanup — 不可逆階段

Splunk ingest 停、license decommission、歷史資料 archive：

1# 1. 歷史 archive 到 S3（Splunk DDAS / Smart Store / 第三方）
2splunk export ... | aws s3 cp - s3://splunk-archive/
3
4# 2. 確認 archive 可查（cold storage retrieve test）
5# 3. Splunk indexer disable / Splunk Cloud subscription downgrade

不可逆邊界：Splunk license 退掉、historical query 必須走 S3 + 重 ingest 才能跑、SLA 從即時變天級。決策關鍵：

法規 retention（GDPR / SOX / HIPAA）多久
Incident response 需要 historical query 的頻率
翻譯後的歷史資料 indexable in Elastic？多數情況 ECS 跟 CIM 結構差太大、historical 不直接可查

實務 default：Splunk Cloud 保留最低 tier 1 年、Elastic 接新資料；1 年後再評估 archive 策略。

Production 故障演練

Case 1：Macro 跨 SPL 沒對應 KQL function

徵兆：translation tool 把 macro \my_internal_lookup(…)`` 標 unmappable、人工翻譯後發現 macro 含 3 個巢狀 macro、共 80 行 SPL 邏輯；KQL 端拆成 5 個 runtime field + 2 個 ingest processor 才對等。

修法：

Audit 階段 用 splunk btool savedsearches list | grep 找所有 macro 使用點、估翻譯成本
Inline 策略：macro 在 5 處以下、直接 inline 到 detection rule、不重建 KQL macro
Ingest processor 策略：macro 是 資料轉換 邏輯、放 Elastic ingest pipeline、不放 detection rule
退役策略：macro 已 deprecated、不翻譯、把使用的 rule 一起退役

Case 2：Time zone parsing 差異

徵兆：parallel run 階段、Splunk 跟 Elastic 對同一個 raw event 解出的 _time 差 8 小時；dedup key 沒對齊、雙 alert 都觸發。

根因：Splunk _time 是 epoch、time zone 由 props.conf 端決定；Elastic ingest pipeline 用 date processor、time zone 預設 UTC。raw event 有 Asia/Taipei timestamp、Splunk 解 UTC、Elastic 解 local。

修法：

Ingest pipeline 統一：所有 raw event 在 ingest 時轉 UTC、不依賴 source-side time zone
dedup 容忍 window：dedup window 拉到 30 分鐘、cover time zone 漂移
schema 對位 spec 明示時區處理：Phase 1 spec 要列「所有時間戳統一 UTC」

Case 3：Summary index 翻譯效能爆

徵兆：Splunk 端 tstats count from datamodel=Authentication where _time>=-7d 跑 2 秒、翻譯成 KQL 後 Elastic 跑 45 秒；SOC dashboard 端 timeout。

根因：Splunk summary index 是 precomputed（小時 / 天聚合預先算好）、tstats 直接讀 summary；KQL 直接跑 search 是 raw event scan、效能差數量級。

修法：

Elastic Transform：Elastic 端建 continuous transform、把 raw event 預先 aggregate 到 transform index、KQL 查 transform index、效能對等
Rollup index（Elastic legacy）：給 metric-style data 用、deprecated 但仍可
接受 latency：dashboard query 可接受 30s、不必精準對等 Splunk

Case 4：Cutover 期間 PagerDuty dedup key 衝突

徵兆：cutover 後 24h、SOC 收到雙倍 alert；PagerDuty 兩條 incident 各標 splunk 跟 elastic source、實際是同一事件。

根因：PagerDuty 的 dedup key 用 rule_name + alert_id、Splunk alert_id 跟 Elastic signal_id 命名空間不同、PagerDuty 視為兩個獨立 incident。

修法：

預先設計 dedup key：用 rule_name + event_hash、不用 SIEM 內部 ID
PagerDuty routing rule：cutover 期間 disable Splunk source routing、不要靠 dedup
Phase 3 parallel run 期間就測試 dedup：不要拖到 cutover 才發現

Case 5：過早 decommission Splunk、歷史 incident 無法回溯

徵兆：cutover 後 6 個月、發生 incident、需要回查 12 個月前的 auth log；Splunk 已 decom、Elastic 端歷史資料缺、S3 archive 無索引、4 小時找不到 evidence。

根因：Cleanup phase 過早走、沒先做 historical query rehearsal；S3 archive 沒可用的索引層。

修法：

預防：Phase 5 前跑 5 個 historical query drill、驗證 incident response 時能用
架構：S3 archive 配 Elastic frozen tier（searchable snapshot）、6 個月 retrieve latency 接受
法規對齊：Cleanup 時間表必須跟 compliance retention requirement 對齊、不只是 cost-driven

Capacity / cost 對照

維度	Splunk Enterprise / Cloud	Elastic Security	取捨
Pricing model	per-GB ingest（昂貴 in scale）	fixed tier / data tier / per-resource	Elastic 5+ TB/day 規模便宜 50-70%
Ingest performance	強、Splunk forwarder 成熟	強、Elastic Agent / Filebeat	略接近、Splunk 對 unstructured raw 略優
Search performance	強、SPL + summary index	中、KQL runtime + transform	Splunk 對複雜 query 仍領先
Detection content	ES content + SOC content	Elastic Security 内建 detection rule + 開源	兩端都有、Elastic 對 cloud-native 較強
UEBA / ML	ES Premium UEBA、成熟	Elastic ML + 7.x+ rule type	Splunk 領先、Elastic 追趕中
Cloud-native	Splunk Cloud（managed but proprietary）	Elastic Cloud / ECK on K8s	Elastic 更 K8s-friendly
Lock-in	高（SPL / 自家 forwarder / ES app）	中（open-source core + commercial extension）	Elastic 較易遷出（理論上）
Total cost (5y, 10TB/day)	$5-15M USD	$1.5-5M USD	5-3 倍差

整合 / 下一步

跟 SOAR 整合

PagerDuty / Tines / Splunk SOAR：

cutover 期間 SOAR playbook 仍用 Splunk-shaped event、Phase 5 後改 Elastic-shaped
Playbook 內 SPL query 必須改寫 KQL / ES|QL、可 hybrid（短期保留 SOAR 端原 SPL 邏輯）

跟 case management 整合

Jira / ServiceNow / Elastic Cases：

Splunk notable → Jira ticket 用 link field 帶 splunk_url
Elastic alert → Jira 用 elastic_url
兩個 URL field 期間同時存在、Phase 5 後 archive

反向遷移（Elastic → Splunk）

結構 mirror 對稱、phase 仍 6 段、但 schema 對位方向相反：

KQL → SPL 翻譯（vendor tool 對等度低、ES|QL → SPL 更困難）
ECS → CIM 對位
多數企業不會反向遷、reverse migration 多半是合規驅動（特定客戶要 Splunk）

下一步議題

Multi-vendor SIEM portfolio：不選一家、Splunk + Elastic + Sentinel 同時跑、routing 邏輯按 cost / use case 切
AI-native detection：兩家都在發展、translation 流程可能再次重來
Compliance migration constraints：金融 / 政府客戶 SIEM migration 需通過 audit、phase 時間表會被拉長

7.B5 Detection Engineering Lifecycle

Thu, 30 Apr 2026 00:00:00 +0000

本篇的責任是建立偵測規則生命週期。讀者讀完後，能把一條 detection rule 從來源定義、驗證、調校、上線、退場整理成可維護流程。

核心論點

Detection engineering lifecycle 的核心概念是把規則當資產管理。規則資產包含來源、邏輯、測試、誤報處理、owner、驗收門檻與退場條件。

讀者入口

本篇適合銜接 7.B2 從偵測到回應的路由、7.13 偵測覆蓋率與訊號治理與 Sigma：偵測規則生命週期素材。

生命周期欄位

欄位	責任	常見來源
Rule source	描述規則來自哪個威脅假設與資料源	Sigma、事件復盤、演練結果
Detection logic	定義條件、例外、聚合方式	rule repository、query package
Validation evidence	證明規則可命中目標情境	測試事件、回放資料、對照 log
Tuning decision	收斂誤報與漏報	triage 結果、分析註記、例外記錄
Release condition	定義規則上線條件	release gate、變更審查
Retirement condition	定義規則退場條件	覆蓋重疊、威脅變化、資料源變動

生命周期欄位的核心是讓規則維護可以追溯。每次規則更新都能回查它解哪個風險、用哪個證據驗證、為何做這次調整。

規則來源治理

規則來源治理的責任是讓規則與威脅假設對齊。來源可來自公開框架、事件教訓、演練情境與稽核要求，並需要在建立時寫清楚 threat hypothesis 與 data dependency。

驗證節奏

驗證節奏的責任是確保規則在上線前後都保持有效。建議至少建立三層驗證：

邏輯驗證：條件可讀、可測、可重現。
資料驗證：log schema 與欄位品質可支撐判讀。
情境驗證：在事件回放或 game day 中能命中目標行為。

調校策略

調校策略的責任是把 alert 噪音轉成可判讀訊號。調校時同步記錄 false positive 情境、排除條件、影響範圍與回退方式，並和 incident severity 對齊分級節奏。

上線與退場

上線與退場的責任是讓規則變更進入受控流程。上線前需確認 evidence、owner 與回退路徑；退場時要確認替代規則、覆蓋遷移與歷史證據保留。

與事故流程的交接

與事故流程交接的責任是把規則命中轉成回應路由。規則命中後應直接輸出 triage 問題、owner、升級條件與 runbook 路由，讓 08 模組可以快速接手。

判讀訊號與路由

判讀訊號	代表需求	下一步路由
規則持續觸發但分析結論分散	需要調校紀錄與 triage 問題	7.B5 → 7.B2
規則上線後缺少驗證證據	需要補 validation evidence	7.B5 → 7.B3
相同風險出現多條重複規則	需要整理來源與退場條件	7.B5 → 7.B1
規則變更未進入放行流程	需要 release condition	7.B5 → 05
事故後規則未更新	需要 write-back 閉環	7.B5 → 7.24

判讀表格的作用是把規則問題轉成維護任務。每一列都能直接對應到 owner 與下一步交接章節。

必連章節

完稿判準

完稿時要讓讀者能為一條偵測規則設計完整生命週期。輸出至少包含來源、邏輯、驗證證據、調校策略、上線條件、退場條件與回寫位置。

Sigma：偵測規則生命週期素材