Prometheus on Tarragon

Prometheus 容量規劃與故障模式

Mon, 22 Jun 2026 00:00:00 +0000

本文是 Prometheus 的 vendor deep article，深化 overview「Cardinality 管理」跟「Memory pressure」段。初次接觸 Prometheus 的讀者建議先讀 Prometheus 服務頁。

定位

Prometheus 的容量模型跟傳統資料庫不同 — 它的容量邊界主要受 active series 數量（cardinality）跟 retention 期決定，而非資料筆數或 disk size。理解 Prometheus 的資源消耗模型，才能判斷什麼時候單機夠用、什麼時候需要 remote write 卸載或遷移到 Mimir / Thanos。

資源消耗模型

Memory：由 active series 決定

Prometheus 把近期的 time series 保存在記憶體（head block）。每個 active series 大約消耗 3-4 KB 記憶體（含 index、chunks、postings；Prometheus TSDB 的業界經驗值，實際依 label 長度與 chunk encoding 而定）。

Active series	預估 memory（head block）	適合的機器規格
10 萬	~400 MB	任何 VM
100 萬	~4 GB	8 GB VM
500 萬	~20 GB	32 GB VM
1000 萬	~40 GB	64 GB VM

這是 head block 的記憶體，不含 query execution 跟 WAL replay 的暫時開銷。Heavy PromQL query（大範圍 aggregation、多 series join）會額外消耗數 GB 的暫時記憶體。

判讀指標：prometheus_tsdb_head_series 代表當前 active series 數量，process_resident_memory_bytes 代表實際記憶體使用。兩者的比值偏離預期時（例如 50 萬 series 但記憶體用了 10 GB），可能是 query 記憶體壓力或 WAL corruption。

Disk：由 retention 期與 ingestion rate 決定

Prometheus 的 disk 消耗 = ingestion rate × retention 期 × 壓縮後每 sample 大小（約 1-2 bytes，Gorilla 壓縮算法下的業界經驗值）。

Ingestion rate	Retention	預估 disk
10 萬 samples/sec	15 天	~130 GB
10 萬 samples/sec	30 天	~260 GB
50 萬 samples/sec	15 天	~650 GB

Disk I/O 的瓶頸通常在 compaction — Prometheus 定期把 head block 壓縮成 persistent block。Compaction 期間的 disk write 跟 CPU 使用會短暫上升。SSD 環境下 compaction 通常不是問題；HDD 環境下可能造成 scrape timeout。

CPU：由 scrape 數量與 query 負載決定

Scrape 本身的 CPU 消耗不高（HTTP GET + parse），但 scrape 數量 × scrape 間隔決定了基本的 CPU 基線。1000 個 target × 15 秒間隔 = 每秒 ~67 次 scrape，單核可以處理。

Query 是 CPU 的主要消耗者。Recording rule evaluation、alert rule evaluation、dashboard panel 查詢各自佔 CPU。Recording rule 數量增長到數百條時，evaluation 的 CPU 消耗可能成為瓶頸。

判讀指標：prometheus_rule_evaluation_duration_seconds 的 p99 超過 evaluation interval 時，rule 跑不完、alert 會延遲。

Cardinality 失控的判讀

Cardinality 是 Prometheus 最常見的容量問題。一個意外的高 cardinality label（user_id、request_id、完整 URL）可以在分鐘內把 series 數從 10 萬推到 100 萬、消耗數 GB 記憶體。

判讀訊號

prometheus_tsdb_head_series 持續成長、斜率陡峭
prometheus_tsdb_head_active_appenders 成長（新 series 的寫入速率）
Prometheus 的 memory 持續上升、最終 OOM kill
Query 延遲增加（更多 series 要掃描）
Compaction 時間變長

定位方式

1# 找出哪個 metric name 的 series 最多
2topk(10, count by (__name__)({__name__=~".+"}))
3
4# 找出哪個 job（scrape target）的 series 最多
5topk(10, count by (job)({__name__=~".+"}))
6
7# 找出某個 metric 的哪個 label 組合在爆
8count by (method, status) (http_requests_total)

修復方向

Label 白名單：在 scrape config 或 relabeling rule 中 drop 高 cardinality label
Metric relabeling：metric_relabel_configs 在 scrape 後、寫入前移除特定 label
Recording rule 替代：把高 cardinality metric 聚合成低 cardinality 的 recording rule，下游只讀 recording rule
移到 traces：user_id / request_id 這類維度放在 trace 的 span attribute 而非 metric label

常見故障模式

OOM Kill

觸發條件：active series 超過記憶體容量、或 heavy query 消耗大量暫時記憶體。

表現：Prometheus process 被 kernel OOM killer 終止。重啟後 WAL replay 可能需要分鐘到十分鐘（取決於 WAL 大小），期間 scrape 跟 query 都不可用。

預防：設定 memory limit alert（process_resident_memory_bytes / machine memory > 70%）、tracking cardinality growth slope、query timeout 限制。

Scrape timeout 連鎖

觸發條件：target 的 metrics endpoint 回應慢（> scrape_timeout）、或 target 數量超過 Prometheus 的並行 scrape 能力。

表現：up metric 為 0、scrape_duration_seconds 升高、dashboard 出現資料斷層（missing data points）。大量 target 同時 timeout 時，Prometheus 的 scrape goroutine pool 被佔滿，影響其他健康 target 的 scrape。

修復：調整 scrape_timeout（預設 10s，太短會造成 false timeout）、把慢 target 移到獨立的 scrape pool、或把 metrics endpoint 的回應最佳化（減少 expose 的 metric 數量）。

WAL corruption

觸發條件：Prometheus process 非正常終止（OOM kill、機器斷電）時，WAL 可能損壞。

表現：重啟後 WAL replay 失敗、Prometheus 無法啟動。Error log 顯示 WAL corrupted 或 invalid segment。

修復：刪除損壞的 WAL segment（丟失對應時間段的資料），重啟 Prometheus。嚴重時刪除整個 data 目錄重新開始（丟失所有歷史資料）。WAL 的持久性保證不如資料庫 — Prometheus 設計上允許短暫資料丟失，長期儲存靠 remote write 到 Mimir / Thanos。

Recording rule evaluation lag

觸發條件：recording rule 數量多且表達式複雜、evaluation 時間超過 evaluation interval。

表現：prometheus_rule_group_last_duration_seconds 超過 prometheus_rule_group_interval_seconds。Dashboard 讀 recording rule 的 panel 看到的資料落後當前時間。Alert rule 也在同一個 evaluation pipeline 裡，evaluation lag 會讓 alert 延遲觸發。

修復：把重的 recording rule 拆到獨立的 rule group（各自 evaluation interval）、最佳化 PromQL expression（減少 aggregation 層數、縮小 time range）、或把 recording rule 卸載到 Mimir（ruler component 獨立擴展）。

何時該從單機 Prometheus 遷出

訊號	下一步
Active series > 500 萬、memory 吃緊（32 GB VM 上 head block ~20 GB + query overhead 接近上限）	Remote write 到 Mimir / Thanos 做長期儲存
需要跨 region / cluster 查詢	Thanos query 或 Mimir multi-tenant
Recording rule evaluation lag 持續	把 rule evaluation 卸載到 Mimir ruler
需要 HA（single Prometheus = SPOF）	兩個 instance + Thanos dedup
Retention 要 > 90 天但 disk 不夠	Remote write + 短 local retention

遷出的第一步通常是加 remote write — Prometheus 繼續本地 scrape 跟短期查詢，長期資料寫到遠端。這是最低風險的演進路徑，不需要改 scrape config 或 PromQL。

下一步路由

Prometheus 服務頁：overview 跟日常操作
4.7 cardinality：cardinality 治理的完整策略
4.2 metrics basics：recording rule 跟 rollup 的查詢面設計
Grafana Stack：Mimir 作為 Prometheus 的長期儲存後端
4.23 觀測查詢設計：recording rule 在查詢設計中的定位

PromQL 與 Recording Rules 實務

Mon, 22 Jun 2026 00:00:00 +0000

本文是 Prometheus 的 vendor deep article，深化 overview「PromQL 查詢」跟「Recording rules / Alerting rules」段。初次接觸 Prometheus 的讀者建議先讀 Prometheus 服務頁。

問題情境

Recording rules 把昂貴的即時聚合預先計算成低延遲 series，降低 dashboard 查詢成本並穩定 alerting 表達式。三個觸發點會讓團隊需要認真處理 PromQL 與 recording rules：

Grafana dashboard 的某些 panel 載入超過 10 秒。原因通常是 panel 直接查詢高 cardinality 的原始 metric，每次載入都做一次完整的 range query aggregation。Recording rules 預先計算聚合結果，dashboard 只讀計算好的 series，查詢時間從秒級降到毫秒級。

Alert 表達式想表達「最近 5 分鐘的 error rate 超過 1% 且持續 2 分鐘」，但寫出來的 PromQL 要麼漏抓（counter reset 時 rate 歸零）、要麼誤報（absent series 觸發 NaN 比較）。這類問題的根源是對 counter vs gauge 的語意差異理解不夠精確。

Recording rules 堆了上百條但沒有命名慣例，新加的 rule 不確定是否跟既有 rule 重疊、也不確定 evaluation 順序是否正確。缺乏結構化的 rule 管理會讓 rule group 的 evaluation 時間逐漸超過 interval。

核心概念

Counter 與 gauge 的查詢差異

Counter 是單調遞增的累計值（total requests、total bytes sent），只在 process 重啟時 reset。Gauge 是瞬時值（temperature、goroutine count、queue depth），隨時上下波動。

查詢 counter 必須用 rate() 或 increase() — 直接讀 counter 的原始值沒有業務意義（「從啟動到現在共 5 百萬個 request」不是有用訊號）。rate() 回傳每秒平均增量，increase() 回傳區間內的總增量。兩者都自動處理 counter reset — 當值突然下降時（process restart），rate 不會回傳負值。

查詢 gauge 直接讀原始值即可，用 avg_over_time()、max_over_time() 等做區間統計。

常見錯誤是對 gauge 用 rate（結果無意義 — 溫度的「每秒變化率」不是有用訊號）、或對 counter 直接取 max_over_time（只拿到 counter 的最大累計值、不是最大 QPS）。

rate 與 increase 的差異

rate(http_requests_total[5m]) 回傳 5 分鐘內的平均每秒 request 數。increase(http_requests_total[5m]) 回傳 5 分鐘內的總增量，等於 rate() * 300。

選擇取決於讀者的心智模型：SLI dashboard 用 rate（「每秒多少」直觀）；報表用 increase（「過去一小時多少筆」直觀）。

Range 的選擇有一個實務邊界：range 至少要涵蓋 2 個 scrape interval。15 秒 scrape interval 搭配 rate(...[30s]) 是最小可用 range；rate(...[15s]) 可能只抓到一個 sample，回傳 NaN。production 常用 [5m] 作為預設 range — 足夠平滑短暫抖動、又不會過度延遲異常偵測。

histogram_quantile 的 bucket 設計

Prometheus histogram 使用預定義 bucket 邊界收集觀測值分布。histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) 計算 p95 延遲。

Bucket 邊界的設計直接影響精確度。預設 bucket（0.005, 0.01, 0.025, … 10）適合 HTTP request 延遲場景。如果服務的 p50 在 200ms 而 bucket 只有 0.1 跟 0.25 兩個相鄰邊界，p50 的計算會在 100ms-250ms 之間做線性內插，精確度受限。

設計 bucket 的判準：p50 和 p99 附近各要有 2-3 個相鄰 bucket，讓內插結果接近真實值。SLO 的 latency threshold 也應該落在某個 bucket 邊界上 — 例如 SLO 是 p95 < 500ms，那 500ms 應該是一個 bucket 邊界。

每個 bucket 是一個 time series。10 個 bucket 的 histogram + 4 個 label 組合 = 40 個 series。Bucket 數量增加到 30 個時，同一個 metric 的 series 數量膨脹 3 倍。Bucket 設計要在精確度與 cardinality 之間取捨。

Label matching 規則

PromQL 的 binary operation（/、+、comparison）預設要求兩邊的 label set 完全一致才做 matching。這會在 error rate 計算時造成問題：rate(http_requests_total{status=~"5.."}[5m]) 的 label set 含 status、但 rate(http_requests_total[5m]) 的 total 不含 status。

解法是在分子做 aggregation 時 drop 掉 status label：

1sum by (job, method) (rate(http_requests_total{status=~"5.."}[5m]))
2/
3sum by (job, method) (rate(http_requests_total[5m]))

on() 和 ignoring() 修飾符可以在不做 aggregation 的前提下控制 matching，但可讀性較差。production 推薦的做法是先用 sum by() 控制輸出的 label set，讓兩邊的 label 對齊。

配置：常見 SLI Pattern

Error rate

 1# recording rule: 每 5 分鐘計算一次 error rate
 2groups:
 3  - name: sli_error_rate
 4    interval: 30s
 5    rules:
 6      - record: job:http_request_error_rate:ratio_rate5m
 7        expr: |
 8          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
 9          /
10          sum by (job) (rate(http_requests_total[5m]))

命名慣例 level:metric:operations 來自 Prometheus 官方建議：job 是聚合的 level、http_request_error_rate 是語意、ratio_rate5m 是操作。遵循慣例讓團隊成員看到 rule 名稱就知道它的聚合粒度與計算方式。

Latency percentile

1      - record: job:http_request_duration_seconds:p95_rate5m
2        expr: |
3          histogram_quantile(0.95,
4            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
5          )

le label 是 histogram bucket 邊界，sum by (job, le) 把 instance 維度聚合掉、保留 bucket 結構。如果漏掉 le，histogram_quantile 會回傳錯誤結果。

Throughput

1      - record: job:http_requests:rate5m
2        expr: sum by (job) (rate(http_requests_total[5m]))

三個 SLI — error rate、latency、throughput — 組成服務的 RED metrics（Rate、Errors、Duration）。Recording rules 預先計算後，dashboard 只需讀三個 series。

Alerting rule 搭配 recording rule

1  - name: sli_alerts
2    rules:
3      - alert: HighErrorRate
4        expr: job:http_request_error_rate:ratio_rate5m > 0.01
5        for: 5m
6        labels:
7          severity: page
8        annotations:
9          summary: "{{ $labels.job }} error rate above 1% for 5 minutes"

Alert 表達式讀 recording rule 而非原始 metric。好處有二：alert evaluation 更快（讀預先計算的 series）、alert 表達式與 dashboard panel 使用同一組 recording rule（確保看到的數字一致）。

故障與邊界

Series churn 導致 absent() 判斷失準

absent(up{job="myapp"}) 用來偵測 target 完全消失（沒在 scrape）。但在 K8s 環境，pod 頻繁 rolling update 會造成 series churn — 舊 pod 的 series 消失、新 pod 的 series 出現。短暫的時間窗內 absent() 可能誤觸。

修法：用 absent_over_time(up{job="myapp"}[5m]) 替代，要求整個 5 分鐘區間都沒有 series 才觸發。或用 count(up{job="myapp"}) == 0 明確檢查 series 數量。

Recording rules circular dependency

Rule group A 的 rule 讀 rule group B 的 recording rule、group B 又讀 group A 的結果。Prometheus 按 group name 字母序 evaluate，circular dependency 會讓一方讀到上一輪的 stale 結果。

預防方式：recording rules 形成 DAG（有向無環圖）。Prometheus 文件建議把 rule 分成 aggregation 層級 — 底層 group 算 raw metric 的 aggregation、上層 group 算 recording rule 的 aggregation。同一個 group 內的 rule 按宣告順序同步 evaluate。

大 range query OOM

Dashboard panel 用 rate(metric[30d]) 查詢 30 天 range — Prometheus 要載入 30 天的 samples 到記憶體做計算。100 萬 series × 30 天 × 15 秒 interval ≈ 1.7 億 samples per series 是不可能完成的查詢。

修法：長時間 range 必須用 recording rules 做 step-down aggregation。先用 rate(...[5m]) recording rule 每 30 秒算一次、再用 avg_over_time(recording_rule[30d]) 查詢。Recording rule 的 series 數量通常比原始 metric 少一到兩個數量級。

Prometheus 2.x 支援 --query.max-samples flag 限制單一 query 能處理的 sample 數量（預設 5000 萬），超過就回傳 error。這是 OOM 的最後防線、不是常態。

Counter reset 導致 rate 異常

Process 重啟時 counter 歸零。rate() 和 increase() 自動偵測 counter reset 並補償，但有邊界條件：如果 scrape interval 內發生多次 restart（例如 crash loop），rate() 可能低估真實值（只能偵測到一次 reset）。

這種情境下的判讀：如果 rate() 的結果明顯低於預期、且同時段有 pod restart 紀錄，rate 低估是正常的。修法是解決 crash loop 本身、而非調整 PromQL。

容量與 Cost

Recording rules 的 CPU 成本 = rule 數量 × 每條 rule 的 evaluation 時間 × (1 / evaluation interval)。

Rule 數量	平均 evaluation 時間	Interval	每秒 evaluation 消耗
50	10ms	30s	50 × 0.01 / 30 = 0.017 core
200	50ms	30s	200 × 0.05 / 30 = 0.33 core
500	100ms	15s	500 × 0.1 / 15 = 3.33 core

表中的 evaluation 時間是 10 萬到 50 萬 active series 規模下的經驗值。Series 數量影響 evaluation 時間 — 100 萬 series 的 complex aggregation 可能 500ms+，跟表中假設偏差很大。用 prometheus_rule_group_last_duration_seconds 量測自己環境的實際值。

500 條 complex rule 搭配 15 秒 interval 會消耗超過 3 個 CPU core 在 rule evaluation 上。這時候的修法方向有三：

把 evaluation interval 放寬到 30s 或 60s（犧牲即時性）
把 rule 表達式最佳化（減少 aggregation 層數）
把 rule evaluation 卸載到 Mimir ruler（水平擴展）

Recording rules 產生的新 series 也會增加 cardinality。200 條 recording rule × 平均 5 個 label 組合 = 1000 個新 series，通常可接受。但如果 recording rule 沒做 aggregation 而是直接 alias（record: new_name expr: old_metric），cardinality 不會減少，只增加了寫入成本。

判讀指標：prometheus_rule_group_last_duration_seconds 跟 prometheus_rule_group_interval_seconds 的比值。前者超過後者時，evaluation 跑不完、dashboard 跟 alert 都會延遲。見容量規劃與故障模式的 Recording rule evaluation lag 段。

Recording rules 作為成本控制工具

觀測成本治理案例提出一個被低估的用法：recording rules 不只是加速查詢、也是控制 remote write 成本的手段。

模式是這樣的：application 暴露 200 個 label 組合的原始 metric（per-endpoint × per-status × per-region），recording rule 聚合成 5 個 label 組合（per-service × per-region）。如果 remote write 設定了 write_relabel_configs drop 掉原始 series、只 forward recording rule 產生的 aggregated series，remote write bandwidth 跟長期儲存的 cardinality 都大幅降低。

 1# Step 1: recording rule 做 aggregation
 2groups:
 3  - name: cost_optimized
 4    rules:
 5      - record: service_region:http_requests:rate5m
 6        expr: sum by (service, region) (rate(http_requests_total[5m]))
 7
 8# Step 2: remote write 只送 aggregated series
 9remote_write:
10  - url: "http://mimir:9009/api/v1/push"
11    write_relabel_configs:
12      - source_labels: [__name__]
13        regex: "service_region:.*"
14        action: keep

這個模式的取捨：長期儲存只有 aggregated 資料、無法回溯到原始 per-endpoint 維度。如果事故時需要 per-endpoint 的歷史資料，要麼保留原始 series 在本地 Prometheus（短期 retention）、要麼接受長期儲存只有 aggregated 粒度。

適用場景判斷：如果 dashboard 跟 alert 都只看 service-level 聚合、per-endpoint 維度只在即時除錯時才需要（Prometheus 本地 15 天 retention 夠用），這個模式的成本節省值得。如果有合規需求要 per-endpoint 歷史資料（例如 FinTech 案例的 evidence chain），就不能 drop 原始 series。

Evaluation interval 對 CPU 的影響

Rule group 的 interval 決定 evaluation 頻率。同一組 rules 從 30s interval 改成 15s interval，CPU 消耗翻倍。從 30s 改成 60s，CPU 減半但 alert 跟 dashboard 的即時性下降。

經驗值：

場景	建議 interval	理由
SLI / SLO recording rules	30s	平衡即時性跟成本、多數 burn rate alert 的最小 window 是 5 分鐘
Capacity trending rules	60s-120s	趨勢不需要秒級即時性
High-frequency operational rules	15s	需要跟 scrape interval 對齊的場景（例如 real-time anomaly detection）

15 秒 interval 的 rule group 要特別注意 evaluation 時間 — 如果 evaluation 本身花 12 秒，只剩 3 秒 buffer。prometheus_rule_group_last_duration_seconds 持續接近 prometheus_rule_group_interval_seconds 時，要麼拆 rule group 到不同 Prometheus instance、要麼放寬 interval。

整合與下一步

Alertmanager

Alert rule 寫在 Prometheus 的 rule_files 內、觸發後送到 Alertmanager。Alertmanager 負責去重、分組、抑制與路由（route to PagerDuty / Slack / email）。Alert rule 的表達式跟 recording rule 共用同一組語意 — 讀 recording rule 而非原始 metric。

Grafana dashboard

Grafana 的 Prometheus datasource 直接查 PromQL。Dashboard panel 推薦讀 recording rule series 而非寫 raw PromQL — 減少 dashboard 載入時間、確保 dashboard 跟 alert 看到的數字一致。

對齊 SLI/SLO

Recording rules 產生的 SLI metrics 是 4.6 SLI/SLO 訊號設計的資料來源。SLO burn rate alert 也讀同一組 recording rule。確保 SLI recording rule 的 time window 跟 SLO window 對齊（例如 SLO 用 30 天 rolling window，recording rule 至少提供 5m 和 1h 兩個 aggregation 粒度給 burn rate 計算）。

交接路由

Prometheus 服務頁：overview 跟日常操作入口
容量規劃與故障模式：recording rules 成長後的資源衝擊
Remote Write 與長期儲存整合：recording rule 在 remote write 架構下的部署選擇
4.6 SLI/SLO 訊號設計：recording rules 如何餵給 SLO burn rate
4.7 Cardinality 治理：recording rules 作為 cardinality 減量手段
4.23 觀測查詢設計：recording rules 在 pre-aggregation 與 query tiering 中的定位

Remote Write 與長期儲存整合

Mon, 22 Jun 2026 00:00:00 +0000

本文是 Prometheus 的 vendor deep article，深化 overview「Remote write / read」段。初次接觸 Prometheus 的讀者建議先讀 Prometheus 服務頁。

問題情境

Remote write 把 Prometheus 的 metrics 即時推送到外部長期儲存，解決單機 retention 上限與跨實例統一查詢的限制。三個觸發點會讓團隊需要 remote write 與長期儲存：

Prometheus 預設 retention 是 15 天。業務需要回顧 90 天的趨勢（容量規劃、季度 SLO 報告、成本歸因），本地 disk 不夠放。加大 disk 可以延長 retention，但 Prometheus 的查詢效能會隨資料量下降 — 本地 TSDB 不做 downsampling，查 90 天 range 的 query 要掃描全量 sample。

多個 Prometheus 實例分散在不同叢集（prod-us、prod-eu、staging），團隊需要一個統一查詢入口看跨叢集 metrics。每個 Prometheus 各自保存自己的資料，沒有跨實例查詢能力。手動切換 Grafana datasource 容易遺漏某個叢集的異常。

單機 Prometheus 是 SPOF — process crash 或 VM 故障時 metrics 完全不可用。跑兩個 Prometheus 各自 scrape 同一組 target 可以達到 HA，但兩份資料有微小差異（scrape 時間偏移），下游查詢需要 dedup。

Remote write 解決這三個問題：Prometheus 保持短期本地儲存（scrape + 即時查詢），同時把 metrics 串流到長期儲存後端。長期後端負責壓縮、downsampling、跨實例查詢與 HA dedup。

核心概念

Remote write protocol

Prometheus 透過 HTTP POST 把 time series 送到 remote write endpoint。每次 POST 包含一批 samples（protobuf 編碼、snappy 壓縮），由 Prometheus 的 WAL（write-ahead log）驅動 — WAL 記錄所有 scrape 到的 samples，remote write 從 WAL 讀取並串流到遠端。

這個設計意味著 remote write 是 best-effort 但有 buffer：如果遠端暫時不可達，samples 會堆在 WAL 裡等重試。WAL 的大小有上限（--storage.tsdb.wal-segment-size，預設 128 MB per segment），堆積太多會導致 WAL 佔用大量 disk。

Exemplar forwarding

Prometheus 2.26 開始支援 exemplar — 在 histogram 或 counter sample 上附加 trace_id / span_id。Remote write 也能把 exemplar 送到支援的後端（Mimir、Grafana Cloud、Tempo）。Exemplar 讓讀者從 metric anomaly 一鍵跳到對應的 trace，是 metrics-to-traces 橋接的關鍵能力。

啟用方式：scrape config 加 enable_features: [exemplar-storage]，remote write endpoint 支援 exemplar 即可自動 forward。

Dedup 策略

跑兩個 Prometheus HA pair 時，兩個實例都 scrape 同一組 target、都 remote write 到同一個後端。後端會收到兩份幾乎相同但不完全一致的 samples（scrape 時間差 ±1-2 秒）。

Thanos 和 Mimir 都有 dedup 機制：Thanos 在 query 層根據 external_labels（replica label）做 dedup，每個 time window 只取一個 replica 的值。Mimir 在 ingester 層做 dedup，同一個 series 的重複 sample 在寫入時合併。

Dedup 的前提是兩個 Prometheus 實例設定不同的 external_labels（例如 replica: a / replica: b），讓後端能辨別哪些 series 是同一組的不同副本。

配置

Remote write 基本設定

 1# prometheus.yml
 2remote_write:
 3  - url: "http://mimir-distributor:9009/api/v1/push"
 4    queue_config:
 5      capacity: 10000
 6      max_shards: 30
 7      max_samples_per_send: 5000
 8      batch_send_deadline: 5s
 9    write_relabel_configs:
10      - source_labels: [__name__]
11        regex: "go_.*"
12        action: drop

queue_config 控制 remote write 的並行度與批次大小：

capacity：內存中暫存的 sample 數量。太小會頻繁 flush、太大會佔記憶體
max_shards：並行的 write goroutine 數量。Shard 太少會造成 backlog、太多會壓垮遠端
max_samples_per_send：每次 POST 的 sample 數量。5000 是常用值
batch_send_deadline：即使 batch 沒滿也在這個時間內 flush，避免低流量時 sample 延遲太久

write_relabel_configs 在 remote write 前過濾 series — 不需要長期保存的 internal metrics（go runtime、scrape metadata）可以在這裡 drop，減少長期儲存的 cardinality 與成本。

External labels（HA 與多叢集）

1global:
2  external_labels:
3    cluster: prod-us
4    replica: a

cluster label 區分來源叢集，replica label 讓長期儲存做 dedup。每個 Prometheus 實例的 external_labels 必須唯一。

三家長期儲存比較

維度	Mimir	Thanos	Cortex
架構模式	Microservice（distributor / ingester / compactor / querier）	Sidecar + Store Gateway + Compactor + Query	Microservice（跟 Mimir 同源、Mimir 是 Cortex fork）
部署複雜度	中（Helm chart，最少 4 個元件）	中高（sidecar 綁 Prometheus pod，元件分散）	高（元件多、已進入維護模式）
Query layer	原生 PromQL + split/merge	Thanos Query 做 fan-out + dedup	原生 PromQL（跟 Mimir 共用）
多租戶	原生（X-Scope-OrgID header）	有限（靠 label 或獨立部署）	原生（Mimir 繼承）
Downsampling	支援（compactor 做 1h/5m 降取樣）	支援（compactor）	支援
開發狀態	活躍（Grafana Labs 主推）	活躍（CNCF incubating）	維護模式（Grafana Labs 把精力轉到 Mimir）
對象儲存	S3 / GCS / Azure Blob	S3 / GCS / Azure Blob / 本地	S3 / GCS
成本模型	自管 compute + storage；Grafana Cloud 按 active series 計費	自管 compute + storage	自管（不推薦新部署）

選擇判準依三個維度排序：

已經在用 Grafana 生態（Grafana dashboard、Loki、Tempo）：Mimir 是最自然的選擇，跟 Grafana Stack 的整合最深，Grafana Cloud 可以免管 Mimir。

需要最小化對 Prometheus 的改動：Thanos sidecar 模式不改 Prometheus 配置（sidecar 讀本地 TSDB block），適合「先加長期儲存、Prometheus 維持現狀」的漸進路徑。但 sidecar 綁 Prometheus pod，K8s 環境外的部署更複雜。

多租戶需求：Mimir 原生支援多租戶隔離（每個 tenant 獨立 TSDB、query isolation），Thanos 的多租戶靠 label 或獨立部署。

Cortex 是 Mimir 的前身，新部署不推薦。既有 Cortex 部署可參考 Grafana Labs 的 Mimir migration guide。

Uber M3 的第四條路

Uber M3 案例選擇了自建 M3DB 而非 Mimir / Thanos / Cortex — 原因是 M3DB 在 2018 年啟動時、Mimir 尚未存在、Cortex 還在早期階段、Thanos 也剛開源。M3DB 的設計核心是 namespace-level retention（不同 namespace 不同 retention 跟 resolution）、跟 Uber 的 etcd service discovery 深度整合。

M3 的經驗對後來的三家有直接影響：Mimir 的 per-tenant retention、Thanos 的 downsampling compactor、都能追溯到 M3 先踩過的問題。今天做新部署不需要重走 M3 的路 — Mimir 跟 Thanos 已經成熟。但 M3 案例揭露的設計判準仍然有效：

跨 cluster 查詢需要 fan-out + dedup：三家都實作了這個能力，但部署配置跟 dedup 策略各有差異
Downsampling 是長期成本控制的必要手段：不做 downsampling、90 天 range query 的效能跟成本都不可接受
多租戶隔離不只是 query 層面：ingestion rate limit 跟 storage quota per tenant 才能防止「一個團隊的 cardinality 爆炸拖垮整個平台」

故障與邊界

Remote write backlog 佔滿 WAL

觸發條件：遠端不可達（network 問題、後端過載）持續超過數分鐘，WAL segment 堆積。

表現：prometheus_remote_storage_bytes_total 停止增長（寫不出去）、prometheus_wal_storage_size_bytes 持續增長、disk 使用率上升。嚴重時 WAL 佔滿 disk，Prometheus 無法寫入新 sample、連 local scrape 也受影響。

修復：先恢復遠端連線。WAL backlog 會在連線恢復後自動 catch up — Prometheus 按 WAL 順序重送積壓的 samples。如果 catch up 時間太長（例如堆了數小時），remote write 的 max_shards 可以暫時調高加速回補，但要注意不要壓垮剛恢復的遠端。

預防：監控 prometheus_remote_storage_queue_highest_sent_timestamp_seconds 跟 current time 的差距 — 差距代表 remote write 延遲。差距超過 5 分鐘時告警。設定 WAL 的 disk 空間上限（--storage.tsdb.max-block-duration 搭配 retention 控制 total disk）。

Target 不可達時的 retry storm

觸發條件：remote write endpoint 回傳 5xx 或 429（rate limit），Prometheus 進入指數退避重試。大量 shard 同時 retry，CPU 跟 network 消耗上升。

表現：prometheus_remote_storage_retried_samples_total 增長、CPU 使用上升、remote write 延遲拉大。如果後端本來就過載，retry storm 會讓情況惡化。

修復：remote write 配置中的 min_backoff / max_backoff 控制 retry 間隔（預設 30ms / 5s）。可以調高 min_backoff 減緩 retry 頻率。長期修法是讓後端回傳 429 搭配 Retry-After header，Prometheus 會遵守。

Metrics 語意 drift

觸發條件：多個 Prometheus 實例的 write_relabel_configs 不一致、或 external_labels 設定有誤。

表現：同一個 metric 在長期儲存中出現語意不同的 series — 有些 instance 保留了某個 label、有些 drop 掉了。Dashboard 查詢結果不一致（取決於查到哪個實例的 series）。

修復：remote write 的 write_relabel_configs 集中管理（配置模板或 Prometheus Operator 的 PrometheusSpec.remoteWrite）。每次修改 relabel 規則後，驗證所有實例的 series label set 一致。Mimir 的 active_series API 可以列出目前所有 active series 的 label set。

Remote write protocol 版本不匹配

觸發條件：Prometheus 版本跟長期儲存後端期望的 remote write protocol 版本不一致。Prometheus 2.x 使用 remote write v1（protobuf + snappy），部分較新後端開始支援 v2（native histogram 支援、metadata 改進）。

表現：後端回傳 400 Bad Request。Prometheus 對 4xx 的預設行為是不 retry（視為 client error、retry 無意義），samples 被 drop。prometheus_remote_storage_samples_failed_total 增長但不像 5xx 那樣有明顯的 retry storm — 靜默丟失更難察覺。

修復：確認 Prometheus 版本跟後端的 protocol 相容性。Mimir / Thanos 的文件通常標明支援的 remote write protocol 版本。版本不匹配時升級 Prometheus 或降級後端配置。

何時單機 Prometheus 不夠

三個訊號同時出現時，remote write + 長期儲存從「可選」變成「必要」：

Active series 超過 500 萬。單機 Prometheus 在 500 萬 series 左右開始出現記憶體壓力（head block ~20 GB）、WAL replay 時間拉長（重啟要數分鐘）、compaction 佔用 CPU。Uber 在 M3 專案遇到的正是這個天花板 — 數十個叢集各自 scrape 的 metrics 匯總後 series 數遠超單機能力，但「用更大的 VM 跑 Prometheus」不是解法，因為 Prometheus 的 TSDB 是單線程 compaction、垂直擴展的效益有上限。

Retention 需求超過 30 天。本地 TSDB 的 retention 拉長時，range query 的效能線性退化 — 查 90 天 range 要掃描的 block 數量是 15 天的 6 倍。Downsampling 是長期儲存後端的標準能力（Mimir / Thanos compactor 把 5 分鐘 resolution 降到 1 小時），但 Prometheus 本地 TSDB 不做 downsampling。Uber 的 M3DB 設計了 namespace-level retention（short-term 48h full resolution、long-term 1y downsampled），讓查詢成本不隨 retention 線性成長。

跨叢集統一查詢。多個 Prometheus 各自 scrape 不同 cluster 時，工程師需要一個入口看「所有 cluster 的 checkout error rate」。手動切 Grafana datasource 容易遺漏。Remote write 把所有 Prometheus 的 metrics 匯入同一個長期儲存、用單一查詢入口（Mimir querier / Thanos Query）做 fan-out。

這三個需求在中型公司（50-200 服務、3+ K8s cluster）通常在 1-2 年內同時浮現。規劃 remote write 時不用等三個都出現 — 任一個出現就是啟動的合理時機。

容量與 Cost

Remote write bandwidth

Remote write 的 bandwidth ≈ ingestion rate × 每 sample 壓縮後大小（約 1-2 bytes with snappy）。

Ingestion rate	估算 bandwidth	對應規模參考
10 萬 samples/sec	~100-200 KB/s	小型：5-10 服務、1 cluster
50 萬 samples/sec	~500 KB/s-1 MB/s	中型：50 服務、2-3 cluster
200 萬 samples/sec	~2-4 MB/s	大型：200 服務、5+ cluster
1000 萬 samples/sec	~10-20 MB/s	平台級：Uber M3 等級

每個 active series 在 15 秒 scrape interval 下每秒產生 ~0.067 個 sample。100 萬 active series 的 ingestion rate ≈ 6.7 萬 samples/sec，對應 ~70-140 KB/s remote write bandwidth。這個數字在內網環境下通常不是瓶頸。

真正的瓶頸在兩個地方：roundtrip latency 決定單 shard 吞吐上限（每次 POST 等回應才發下一批）、後端 ingestion capacity 決定能消化多少 samples/sec。Mimir 的 distributor 跟 ingester 可以水平擴展，但每加一個 ingester 增加 compute 成本。bandwidth 只是 capacity planning 的第一步，實際規模要用 Mimir 的 cortex_distributor_received_samples_total 跟 cortex_ingester_memory_series 做持續觀測。

長期儲存的 compaction 與 downsampling cost

Mimir 和 Thanos 的 compactor 定期合併 block 並做 downsampling（5m → 1h 粒度）。Compaction 消耗 CPU 和 disk I/O，但跑在長期儲存自己的 compute 上，不影響 Prometheus。

成本結構：

Compute：distributor + ingester + querier + compactor 的 CPU / memory。Mimir 官方建議 ingester 是最吃資源的元件（記憶體中保存 active series）
Object storage：S3 / GCS 的儲存量 ≈ ingestion rate × retention × 壓縮率。Compaction 跟 downsampling 會降低儲存量（通常 2-5x 壓縮）
Query cost：長 range query 需要讀大量 block — 在 cloud object storage 上是 GET request 成本。Mimir 用 index cache（memcached）降低重複查詢的 GET request

跟 Prometheus 本地 TSDB 比，長期儲存把 disk cost 換成 object storage cost（通常更便宜），但增加了 compute cost（長期儲存的 ingester / querier / compactor）。判斷轉折點的方式是比較本地 SSD cost × retention 跟 object storage cost + compute cost。retention 超過 30 天時，object storage 的成本優勢通常明顯。

整合與下一步

接 Grafana Stack LGTM

Mimir 是 Grafana Stack LGTM（Loki + Grafana + Tempo + Mimir）的 metrics 後端。Prometheus remote write 到 Mimir 後，Grafana 用 Mimir 作為 Prometheus-compatible datasource，查詢語言仍是 PromQL。Exemplar forwarding 讓 Mimir metrics 可以連結到 Tempo traces。

接 Telemetry Pipeline

Remote write 在 4.11 telemetry pipeline 中扮演 metrics ingestion 段。如果同時使用 OpenTelemetry Collector，Collector 可以作為 remote write 的中繼（接收 Prometheus scrape → OTLP export → Mimir OTLP endpoint），但多一層中繼增加了 failure point。直接 Prometheus → Mimir remote write 是最簡路徑。

接 Cost Attribution

長期儲存的多租戶能力讓 4.15 cost attribution 可以按 tenant / team / service 拆分 metrics 成本。Mimir 的 per-tenant active series quota 同時控制 cardinality 與成本。

交接路由

Prometheus 服務頁：overview 跟日常操作入口
PromQL 與 Recording Rules 實務：remote write 架構下 recording rules 的部署位置選擇
容量規劃與故障模式：remote write 作為容量超限時的卸載路徑
Grafana Stack：Mimir 作為長期儲存的完整操作指南
4.11 Telemetry Pipeline：remote write 在 pipeline 架構中的定位
4.15 Cost Attribution：多租戶 metrics 的成本拆分