Metrics on Tarragon

Cloud Monitoring Metrics Model 與 MQL

Mon, 22 Jun 2026 00:00:00 +0000

本文是 GCP Cloud Operations 的 vendor deep article，深化 overview「Cloud Monitoring uptime checks / SLO」跟「OTLP integration」段。初次接觸 GCP 觀測的讀者建議先讀 GCP Cloud Operations 服務頁。

問題情境

GCP 服務預設把 metrics 寫到 Cloud Monitoring，工程師打開 Metrics Explorer 就能看到 CPU、記憶體、request count。問題通常出在三個地方：GCP 內建 metrics 的 resource model 跟應用層的 business metrics 用不同語言描述同一件事，PromQL 使用者要重新學 MQL 語法，alerting policy 的 condition type 跟 notification channel 配置比預期複雜。理解 Cloud Monitoring 的 metrics model 才能避免 custom metrics 爆量、alert noise、跟 Prometheus 生態的銜接摩擦。

核心概念

Monitored resource 與 metric descriptor

Cloud Monitoring 的資料模型有兩個軸：monitored resource 描述「誰產生了這個 metric」，metric descriptor 描述「這個 metric 量什麼」。

Monitored resource 是 GCP 自動帶入的標籤集合。GKE pod 的 monitored resource type 是 k8s_pod，帶 project_id、location、cluster_name、namespace_name、pod_name。Cloud Run revision 是 cloud_run_revision，帶 service_name、revision_name、location。這層標籤不需要工程師手動設定，GCP agent 或 SDK 自動填入。

Metric descriptor 定義 metric 的名稱、型別（GAUGE / DELTA / CUMULATIVE）、value type（INT64 / DOUBLE / DISTRIBUTION）與自訂 label。GCP 內建 metrics 用 compute.googleapis.com/instance/cpu/utilization 這樣的命名空間格式；custom metrics 用 custom.googleapis.com/ 或 workload.googleapis.com/（後者透過 OTel Collector 或 Managed Prometheus 寫入時使用）。

兩個軸相乘就是 time series 的數量。Cardinality 管理在 GCP 上等同於控制 monitored resource × metric label 的組合數。GCP 對 custom metrics 有每個 project 的 time series 配額（預設 500 per metric descriptor、可申請提高），超過時寫入會被拒。

MQL vs PromQL

Cloud Monitoring 有兩種查詢語言。MQL（Monitoring Query Language）是 GCP 自家設計的 pipeline 語法：

1fetch k8s_container
2| metric 'kubernetes.io/container/cpu/core_usage_time'
3| align rate(1m)
4| every 1m
5| group_by [resource.cluster_name, resource.namespace_name],
6    [value_cpu_usage: aggregate(value.core_usage_time)]

PromQL 在 Cloud Monitoring 上也可用（透過 Managed Service for Prometheus）。兩者的核心差異：

面向	MQL	PromQL（via Managed Prometheus）
資料來源	所有 Cloud Monitoring metrics	透過 Managed Prometheus 寫入的 metrics
查詢介面	Metrics Explorer / alerting condition	Grafana / Prometheus UI / API
Aggregation 語法	pipe-style `group_by`	函式風格 `sum by (label)`
跨 GCP 與 custom	原生支援 GCP 內建 metrics	需要轉成 Prometheus 格式
學習曲線	GCP-specific、不可搬到其他平台	跨平台標準、可搬到 Mimir / Thanos

選擇判讀：純 GCP 環境且團隊沒有 Prometheus 經驗 → MQL 起步快。已有 Prometheus / Grafana 生態 → 用 Managed Prometheus + PromQL、把 GCP 內建 metrics 透過 Prometheus-compatible exporter 導入。混合環境 → 兩者並存、GCP 原生 metrics 用 MQL 做 alerting、application metrics 用 PromQL 查詢。

配置 step-by-step

Custom metrics 設計與寫入

Custom metrics 的常見路徑有三條：

路徑一：Cloud Monitoring API 直接寫入。應用程式用 Cloud Monitoring client library 建立 metric descriptor 並寫入 time series。適合 GCP-native 應用，不需要額外 agent。

1metric type: custom.googleapis.com/checkout/latency_ms
2kind: GAUGE
3value type: DISTRIBUTION
4labels: [service, region, status_code]

路徑二：OTel Collector + GCP exporter。應用程式用 OTel SDK 產生 metrics，OTel Collector 透過 googlecloud exporter 寫到 Cloud Monitoring。Metrics 命名空間是 workload.googleapis.com/。適合已有 OTel instrumentation 的服務。

路徑三：Managed Service for Prometheus。部署 GCP 的 Managed Prometheus collector（或自管 Prometheus + remote write），metrics 存在 GCP 託管的 Monarch backend。查詢用 PromQL。適合 Kubernetes 環境且團隊熟悉 Prometheus 生態。

三條路徑可以共存。選擇判讀：先看團隊的 metrics 生態是 GCP-native 還是 Prometheus-native，再看 multi-cloud 需求。Managed Prometheus 的優勢是 PromQL 可搬、劣勢是 GCP 內建 metrics 需要額外整合。

Alerting policy 配置

Cloud Monitoring alerting policy 由三部分組成：condition、notification channel、documentation。

Condition types：

Metric threshold：metric 超過閾值 N 分鐘。適合「error rate > 1% 持續 5 分鐘」。
Metric absence：metric 消失。適合偵測 scrape 斷裂或服務停擺。
Forecasting：預測 metric 在 N 小時後超過閾值。適合 disk 滿、quota 耗盡。
Process health：GCE instance 的 process 是否存活。
Log-based：Cloud Logging 出現特定 pattern 時觸發。適合把 error log 轉成 alert。
SLO burn rate：SLO 設定後、burn rate 超過閾值。對應 burn-rate 概念。

Notification channels：Email / PagerDuty / Slack / Pub/Sub / Webhook / SMS。Pub/Sub channel 適合接自定義 automation（收到 alert → trigger Cloud Function）。

Snooze 與 maintenance window：暫時抑制特定 alerting policy。部署期間或已知維護時使用。

Managed Prometheus 整合

GCP Managed Service for Prometheus 的部署模式：

GKE 模式：啟用 GKE monitoring、Managed Prometheus collector 自動部署。不需要自管 Prometheus server。
Remote write 模式：自管 Prometheus server + remote_write 到 GCP Monarch endpoint。保留本地查詢能力，同時長期儲存在 GCP。
OTel Collector 模式：OTel Collector 用 googlemanagedprometheus exporter 寫到 Monarch。

查詢端：用 GCP Console 的 PromQL UI、或部署 Grafana + GMP datasource。PromQL 功能子集支援良好（rate / histogram_quantile / aggregation），少數進階功能（subquery）有限制。

故障演練與邊界

Custom metric 配額用盡

觸發條件：custom metric descriptor 數量超過 project 配額（預設 500），或單一 metric descriptor 的 time series 數量超過配額。

表現：API 回傳 429 或 quota exceeded error。新 time series 寫不進去，既有的不受影響。

修復：清理不再使用的 metric descriptor（describe → delete）、合併語意重疊的 metrics、減少 label cardinality。GCP Console → IAM → Quotas 可以申請提高配額，但先確認是設計問題而非真的需要那麼多 series。

Alerting policy 觸發延遲

觸發條件：alerting policy 使用的 metrics 的 alignment period 或 duration 設定過長。

表現：異常已經發生 10 分鐘，alert 才觸發。原因是 Cloud Monitoring 的 evaluation cycle 跟 metrics ingestion delay 相加。GCP 內建 metrics 的 ingestion delay 約 1-3 分鐘；custom metrics 透過 API 寫入的 delay 約 10-30 秒。

修復：把 condition 的 alignment period 設短（1 分鐘）、duration 設短（但太短會造成 flapping）。Log-based alerting condition 的 delay 通常比 metric-based 短（秒級 vs 分鐘級），緊急異常考慮用 log-based condition。

Managed Prometheus 查詢與自管 Prometheus 結果不一致

觸發條件：同一個 PromQL query 在本地 Prometheus 跟 GMP 的結果不同。

表現：dashboard 數字對不上、alert 觸發行為不一致。

修復：先確認 remote write 是否有 sample drop（看 prometheus_remote_storage_samples_failed_total）。再確認 GMP 的 PromQL 子集限制（部分 subquery 語法不支援）。最後確認 metric naming：local Prometheus 的 metric name 跟 GMP 儲存後的 naming convention 可能有差異（加了 __name__ prefix 或 resource label）。

容量與成本

Cloud Monitoring 的計費模型基於 ingested metrics volume（per million data points）。GCP 內建 metrics（agent metrics 除外）免費。Custom metrics 的前 150 MB per billing account 免費，超過後按 volume 計費。

成本治理的判讀：

最大成本來源通常是高頻率的 custom metrics 或高 cardinality label
用 monitoring.googleapis.com/billing/bytes_ingested metric 追蹤 ingestion 量
減少 scrape interval（15s → 30s 或 60s）可以直接降低 ingestion 量
Managed Prometheus 的計費跟 custom metrics 分開計算（per samples ingested）

整合與下一步

GCP Cloud Operations 服務頁：overview 與日常操作
4.7 cardinality 治理：cardinality 治理的完整策略
4.6 SLI/SLO signal：SLO burn rate alert 的訊號設計
Prometheus：Managed Prometheus 的上游概念
OpenTelemetry：OTel Collector + GCP exporter 整合
Cloud Logging 查詢、匯出與合規：同 vendor 的 logs 面

4.C11 Uber：M3 大規模 Metrics 平台

Mon, 22 Jun 2026 00:00:00 +0000

Uber 的 M3 案例揭露了 metrics 系統從「每個團隊各跑一套 Prometheus」到「全公司共用的 metrics 平台」的轉折點。轉折的核心判斷是：當 active series 總量超過單機 Prometheus 的記憶體上限、且多個團隊需要跨叢集查詢時，自建平台層的成本低於持續橫向複製 Prometheus 實例的成本。

業務背景

Uber 的服務觀測涵蓋行程追蹤、即時定價、ETA 計算、司機定位、支付結算與推播通知。每個微服務都暴露 Prometheus-compatible metrics，隨著服務數量成長到數千個，寫入速率達到每秒數十億 data points。

早期每個團隊各自部署 Prometheus，各管自己的 retention、scrape config 與 alerting rules。規模小時這個模式運作良好 — 每個 Prometheus 實例只需要處理自己團隊的幾萬到幾十萬 series。但當組織成長到數百個團隊、數千個服務時，散落的 Prometheus 實例帶來三個問題。

技術挑戰

單機記憶體天花板

Prometheus 的 TSDB 把 active series 放在記憶體的 head block，每個 series 消耗約 3-4 KB（詳見 Prometheus 容量規劃）。當單一 Prometheus 實例需要 scrape 的 series 超過 1000 萬時，head block 就需要 40+ GB 記憶體。加上 query execution 跟 WAL replay 的暫時開銷，單機很容易 OOM。

團隊的第一反應是按服務拆分多個 Prometheus 實例，但這讓跨服務查詢變得困難 — 要看一條 request 從 gateway 到 payment 的 latency 分布，需要分別查三個 Prometheus 再手動關聯。

Retention 與長期趨勢

Prometheus 預設 retention 15 天。容量規劃與季度趨勢分析需要 90 天甚至 1 年的歷史資料。把 Prometheus retention 拉長到 90 天，disk 跟 memory 需求同步上升，而且 compaction 效率在資料量大時會下降。

團隊需要的是分層 retention — 近期資料保留全精度、歷史資料做 downsampling 後保留更久。Prometheus 原生不支援 downsampling。

高可用與跨叢集查詢

Prometheus 沒有原生 HA — 標準做法是跑兩個 instance scrape 同一批 target，靠下游去重。但兩個 instance 各自獨立儲存，查詢只打一個；instance 故障切換時會有短暫資料缺口。

跨叢集查詢更困難。Prometheus federation 可以做簡單的 metric 聚合，但 federation 本身是 pull-based scrape — federation target 太多或 series 太大時，federation Prometheus 自己也會 OOM。

解法：M3 平台

Uber 開發了 M3 — 一個 Prometheus-compatible 的分散式 metrics 平台，由三個核心元件組成。

M3DB：分散式 time series storage

M3DB 是分散式 TSDB，資料按 namespace 和 shard 分布在多個節點。每個 namespace 可以有不同的 retention 和 resolution — 例如 realtime namespace 保留 2 天全精度，aggregated_1m namespace 保留 90 天 1 分鐘精度。這解決了 retention tiering 的問題。

M3DB 的記憶體模型跟 Prometheus 不同 — 近期資料在記憶體，冷資料在 disk，不像 Prometheus 把所有 active series 都放 head block。這讓它能處理遠超單機 Prometheus 的 series 數量。

M3 Coordinator：統一查詢入口

M3 Coordinator 接收 PromQL 查詢，轉譯後分發到 M3DB 節點，聚合結果後返回。對 Grafana 和 alerting rules 來說，M3 Coordinator 的 API 跟 Prometheus 完全相容 — 不需要改 dashboard 或 alert config。

M3 Aggregator：寫入路徑聚合

高 cardinality 的原始 series 在寫入 M3DB 前先經過 M3 Aggregator 做 pre-aggregation — 例如把每秒的 request count 聚合成每分鐘，再寫入長期 namespace。這控制了長期儲存的資料量跟成本。

取捨

面向	Prometheus standalone	M3 平台	Mimir / Thanos（替代）
部署複雜度	低（單一 binary）	高（M3DB + Coordinator + Aggregator）	中到高
單機 series 上限	~500 萬-1000 萬	不適用（分散式）	不適用
Retention tiering	無	原生支援	Thanos compactor / Mimir 支援
PromQL 相容	原生	相容	相容
社群活躍度	高（CNCF）	低（Uber 主導、2023 後維護縮減）	高（Grafana Labs / 社群）
適用規模	單團隊到中型組織	大型組織（數十億 series）	中型到大型

M3 的最大風險是社群活躍度 — Uber 自 2023 年後縮減了 M3 的開發投入，Grafana Mimir 成為更活躍的替代。新專案選型時，Mimir 跟 Thanos 的社群支援度跟 Grafana 生態整合度都優於 M3。M3 的價值在於它驗證了「分散式 TSDB + 寫入路徑聚合 + retention tiering」這組設計模式，這組模式在 Mimir 跟 Thanos 裡以不同形式被採用。

回寫教材的連結

4.2 Metrics Basics：active series、cardinality 與 recording rules 的基礎模型，M3 的 pre-aggregation 對應 recording rules 的平台化版本。
4.11 Telemetry Pipeline：M3 的 Aggregator 是 pipeline 中 processing 層的實例。
Prometheus Remote Write 與長期儲存：M3 是 remote write 目標之一，跟 Mimir / Thanos / Cortex 的比較在該文。
4.7 Cardinality 治理：M3 的 per-namespace cardinality limit 是治理機制的生產實例。

判讀徵兆

讀者在自己的系統看到以下訊號時，應該回讀本案例：

單一 Prometheus 實例 memory 接近機器上限，開始 OOM restart
多個 Prometheus 實例各自 scrape，跨服務查詢需要手動關聯
Retention 15 天不夠做季度趨勢分析，但拉長 retention 資源撐不住
團隊開始問「我們的 metrics 總共有多少 series、誰佔最多」但沒有統一的 cardinality 觀測
Grafana federation dashboard 查詢越來越慢或經常 timeout

引用源

M3: Uber’s Open Source, Large-scale Metrics Platform for Prometheus