<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Grafana on Tarragon</title><link>https://tarrragon.github.io/blog/tags/grafana/</link><description>Recent content in Grafana on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Fri, 26 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/grafana/index.xml" rel="self" type="application/rss+xml"/><item><title>斷網環境的監控與可觀測性</title><link>https://tarrragon.github.io/blog/infra/air-gapped/air-gapped-monitoring/</link><pubDate>Fri, 26 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/infra/air-gapped/air-gapped-monitoring/</guid><description>&lt;p>斷網環境不能用 Datadog、New Relic、Sentry Cloud、PagerDuty Cloud 這些 SaaS 監控服務——它們全部需要往外發送資料。監控的三個核心能力（metric 收集、log 彙整、告警通知）全部要用 self-hosted 的開源工具在隔離網路內搭建。原則跟連網環境相同（metric 跟資源同生命週期、alarm 要連到動作），差別在工具的部署和儲存規劃要自己管。&lt;/p>
&lt;h2 id="metric-收集prometheus--grafana">Metric 收集：Prometheus + Grafana&lt;/h2>
&lt;p>Prometheus 是 pull-based 的 metric 收集系統——它主動去 scrape 各服務的 metric endpoint，不需要服務往外推資料。這個架構天然適合斷網：所有流量都在內網、不需要出站連線。&lt;/p>
&lt;h3 id="離線安裝">離線安裝&lt;/h3>
&lt;p>Prometheus 和 Grafana 都是單一二進位或容器映像，離線安裝跟&lt;a href="https://tarrragon.github.io/blog/infra/air-gapped/air-gapped-container/" data-link-title="斷網環境的容器與映像管理" data-link-desc="Private registry 架設、映像搬運（docker save/load、skopeo）、base image 更新週期、離線漏洞掃描">映像搬運&lt;/a>相同的流程：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1"># 外部：下載 release binary&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">wget https://github.com/prometheus/prometheus/releases/download/v2.53.0/prometheus-2.53.0.linux-amd64.tar.gz
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">wget https://dl.grafana.com/oss/release/grafana-11.1.0.linux-amd64.tar.gz
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="c1"># 搬運後解壓、設定 systemd service&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">tar xzf prometheus-2.53.0.linux-amd64.tar.gz
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">sudo mv prometheus-2.53.0.linux-amd64 /opt/prometheus&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>如果用容器部署，先把映像搬進內部 registry 再 pull：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1"># 內部：從內部 registry 啟動&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">docker run -d -p 9090:9090 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> -v /etc/prometheus:/etc/prometheus &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> -v /data/prometheus:/prometheus &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> registry.internal:5000/prometheus:v2.53.0&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="scrape-設定">Scrape 設定&lt;/h3>
&lt;p>Prometheus 的 &lt;code>prometheus.yml&lt;/code> 定義要 scrape 的目標。斷網環境通常用 static config（手動列出目標）而非 service discovery（需要雲端 API）：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="nt">scrape_configs&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">job_name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;node-exporter&amp;#39;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">static_configs&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">targets&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="s1">&amp;#39;server-01:9100&amp;#39;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="s1">&amp;#39;server-02:9100&amp;#39;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="s1">&amp;#39;db-01:9100&amp;#39;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">job_name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;app&amp;#39;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">static_configs&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">targets&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="s1">&amp;#39;app-01:8080&amp;#39;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="s1">&amp;#39;app-02:8080&amp;#39;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">metrics_path&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;/metrics&amp;#39;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>新增機器時手動把它加進 targets 清單。如果用 Consul（內網 service discovery），Prometheus 支援 Consul SD、可以自動發現新服務。&lt;/p>
&lt;h3 id="node-exporter">Node Exporter&lt;/h3>
&lt;p>每台需要監控的 Linux 機器裝一個 node_exporter（單一二進位、無依賴），暴露 CPU、記憶體、磁碟、網路等系統 metric。離線安裝同理——下載 binary、搬運、解壓、設成 service。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1"># 搬運後安裝&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">tar xzf node_exporter-1.8.1.linux-amd64.tar.gz
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">sudo cp node_exporter-1.8.1.linux-amd64/node_exporter /usr/local/bin/
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">sudo useradd --no-create-home --shell /bin/false node_exporter
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="c1"># 建立 systemd service（略）&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="log-收集loki-或-elk">Log 收集：Loki 或 ELK&lt;/h2>
&lt;h3 id="grafana-loki輕量">Grafana Loki（輕量）&lt;/h3>
&lt;p>Loki 是 Grafana 生態的 log 彙整系統，架構類似 Prometheus（pull/push 都支援），但儲存的是 log stream 而非 metric。它不索引 log 內容（只索引 label），所以儲存成本遠低於 Elasticsearch。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="c"># loki-config.yaml 基本設定&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">auth_enabled&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kc">false&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">server&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">http_listen_port&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">3100&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">storage_config&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">filesystem&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">directory&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">/data/loki/chunks&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">schema_config&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">configs&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="nt">from&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="ld">2024-01-01&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">store&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">tsdb&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">object_store&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">filesystem&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">schema&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">v13&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">index&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">prefix&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">index_&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">period&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">24h&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>搭配 Promtail（log 收集 agent）在每台機器上收集 log 並推送到 Loki：&lt;/p></description><content:encoded><![CDATA[<p>斷網環境不能用 Datadog、New Relic、Sentry Cloud、PagerDuty Cloud 這些 SaaS 監控服務——它們全部需要往外發送資料。監控的三個核心能力（metric 收集、log 彙整、告警通知）全部要用 self-hosted 的開源工具在隔離網路內搭建。原則跟連網環境相同（metric 跟資源同生命週期、alarm 要連到動作），差別在工具的部署和儲存規劃要自己管。</p>
<h2 id="metric-收集prometheus--grafana">Metric 收集：Prometheus + Grafana</h2>
<p>Prometheus 是 pull-based 的 metric 收集系統——它主動去 scrape 各服務的 metric endpoint，不需要服務往外推資料。這個架構天然適合斷網：所有流量都在內網、不需要出站連線。</p>
<h3 id="離線安裝">離線安裝</h3>
<p>Prometheus 和 Grafana 都是單一二進位或容器映像，離線安裝跟<a href="/blog/infra/air-gapped/air-gapped-container/" data-link-title="斷網環境的容器與映像管理" data-link-desc="Private registry 架設、映像搬運（docker save/load、skopeo）、base image 更新週期、離線漏洞掃描">映像搬運</a>相同的流程：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 外部：下載 release binary</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">wget https://github.com/prometheus/prometheus/releases/download/v2.53.0/prometheus-2.53.0.linux-amd64.tar.gz
</span></span><span class="line"><span class="ln">3</span><span class="cl">wget https://dl.grafana.com/oss/release/grafana-11.1.0.linux-amd64.tar.gz
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 搬運後解壓、設定 systemd service</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">tar xzf prometheus-2.53.0.linux-amd64.tar.gz
</span></span><span class="line"><span class="ln">7</span><span class="cl">sudo mv prometheus-2.53.0.linux-amd64 /opt/prometheus</span></span></code></pre></div><p>如果用容器部署，先把映像搬進內部 registry 再 pull：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 內部：從內部 registry 啟動</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">docker run -d -p 9090:9090 <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  -v /etc/prometheus:/etc/prometheus <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>  -v /data/prometheus:/prometheus <span class="se">\
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="se"></span>  registry.internal:5000/prometheus:v2.53.0</span></span></code></pre></div><h3 id="scrape-設定">Scrape 設定</h3>
<p>Prometheus 的 <code>prometheus.yml</code> 定義要 scrape 的目標。斷網環境通常用 static config（手動列出目標）而非 service discovery（需要雲端 API）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="nt">scrape_configs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">  </span>- <span class="nt">job_name</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;node-exporter&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="nt">static_configs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">      </span>- <span class="nt">targets</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">          </span>- <span class="s1">&#39;server-01:9100&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">          </span>- <span class="s1">&#39;server-02:9100&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">          </span>- <span class="s1">&#39;db-01:9100&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">  </span>- <span class="nt">job_name</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;app&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">    </span><span class="nt">static_configs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">      </span>- <span class="nt">targets</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">          </span>- <span class="s1">&#39;app-01:8080&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">          </span>- <span class="s1">&#39;app-02:8080&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">    </span><span class="nt">metrics_path</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;/metrics&#39;</span></span></span></code></pre></div><p>新增機器時手動把它加進 targets 清單。如果用 Consul（內網 service discovery），Prometheus 支援 Consul SD、可以自動發現新服務。</p>
<h3 id="node-exporter">Node Exporter</h3>
<p>每台需要監控的 Linux 機器裝一個 node_exporter（單一二進位、無依賴），暴露 CPU、記憶體、磁碟、網路等系統 metric。離線安裝同理——下載 binary、搬運、解壓、設成 service。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 搬運後安裝</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">tar xzf node_exporter-1.8.1.linux-amd64.tar.gz
</span></span><span class="line"><span class="ln">3</span><span class="cl">sudo cp node_exporter-1.8.1.linux-amd64/node_exporter /usr/local/bin/
</span></span><span class="line"><span class="ln">4</span><span class="cl">sudo useradd --no-create-home --shell /bin/false node_exporter
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 建立 systemd service（略）</span></span></span></code></pre></div><h2 id="log-收集loki-或-elk">Log 收集：Loki 或 ELK</h2>
<h3 id="grafana-loki輕量">Grafana Loki（輕量）</h3>
<p>Loki 是 Grafana 生態的 log 彙整系統，架構類似 Prometheus（pull/push 都支援），但儲存的是 log stream 而非 metric。它不索引 log 內容（只索引 label），所以儲存成本遠低於 Elasticsearch。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># loki-config.yaml 基本設定</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">auth_enabled</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="nt">server</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="nt">http_listen_port</span><span class="p">:</span><span class="w"> </span><span class="m">3100</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="nt">storage_config</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">  </span><span class="nt">filesystem</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="nt">directory</span><span class="p">:</span><span class="w"> </span><span class="l">/data/loki/chunks</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="nt">schema_config</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">  </span><span class="nt">configs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">    </span>- <span class="nt">from</span><span class="p">:</span><span class="w"> </span><span class="ld">2024-01-01</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">      </span><span class="nt">store</span><span class="p">:</span><span class="w"> </span><span class="l">tsdb</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">      </span><span class="nt">object_store</span><span class="p">:</span><span class="w"> </span><span class="l">filesystem</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">      </span><span class="nt">schema</span><span class="p">:</span><span class="w"> </span><span class="l">v13</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">      </span><span class="nt">index</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">        </span><span class="nt">prefix</span><span class="p">:</span><span class="w"> </span><span class="l">index_</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">        </span><span class="nt">period</span><span class="p">:</span><span class="w"> </span><span class="l">24h</span></span></span></code></pre></div><p>搭配 Promtail（log 收集 agent）在每台機器上收集 log 並推送到 Loki：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># promtail-config.yaml</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">clients</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span>- <span class="nt">url</span><span class="p">:</span><span class="w"> </span><span class="l">http://loki.internal:3100/loki/api/v1/push</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="nt">scrape_configs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">  </span>- <span class="nt">job_name</span><span class="p">:</span><span class="w"> </span><span class="l">system</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="nt">static_configs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">      </span>- <span class="nt">targets</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">localhost]</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">        </span><span class="nt">labels</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">          </span><span class="nt">job</span><span class="p">:</span><span class="w"> </span><span class="l">syslog</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">          </span><span class="nt">__path__</span><span class="p">:</span><span class="w"> </span><span class="l">/var/log/*.log</span></span></span></code></pre></div><h3 id="elk-stack功能豐富">ELK Stack（功能豐富）</h3>
<p>Elasticsearch + Logstash + Kibana 是功能最完整的 log 平台，但資源消耗大（Elasticsearch 建議至少 4GB RAM 起跳）。適合需要全文搜索 log 內容的場景。</p>
<p>離線安裝：Elastic 提供離線安裝包（<code>.deb</code> / <code>.rpm</code>），或用 Docker 映像。三個組件都要搬運。</p>
<p>選型判準：5 台以下的小環境用 Loki（輕量、跟 Prometheus + Grafana 同一套 dashboard）。需要全文搜索、已有 ELK 經驗的團隊用 ELK。</p>
<h2 id="告警沒有外部-webhook-怎麼通知">告警：沒有外部 webhook 怎麼通知</h2>
<p>連網環境的告警通常發到 Slack webhook、PagerDuty API、或 email relay service。斷網環境這些路徑都不通。</p>
<h3 id="內部-smtp">內部 SMTP</h3>
<p>如果隔離網路內有 email server（很多企業內網有 Exchange 或 Postfix），Prometheus Alertmanager 可以發 email 告警：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># alertmanager.yml</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">route</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="nt">receiver</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;email-team&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="nt">receivers</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">  </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;email-team&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="nt">email_configs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">      </span>- <span class="nt">to</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;oncall@internal.corp&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">        </span><span class="nt">from</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;alertmanager@internal.corp&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">        </span><span class="nt">smarthost</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;smtp.internal.corp:25&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">        </span><span class="nt">require_tls</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span></span></span></code></pre></div><h3 id="內部即時通訊">內部即時通訊</h3>
<p>如果內網有 Mattermost（Slack 的 self-hosted 替代）或 Rocket.Chat，Alertmanager 可以用 webhook 發送到這些工具的 incoming webhook endpoint。</p>
<h3 id="實體告警">實體告警</h3>
<p>極端情境（沒有 email、沒有 chat）：Alertmanager 把告警寫到檔案或資料庫、搭配值班制度定期查看。或用 Grafana 的 dashboard + 控制室大螢幕，值班人員直接看板。</p>
<p>告警的設計原則跟連網環境相同——symptom-based（錯誤率、延遲）優先於 cause-based（CPU、記憶體），閾值設計避免告警疲勞。差別在通知的到達速度可能慢一些（email 比 Slack push 慢），所以閾值要稍微保守（提早告警）。</p>
<h2 id="metric-與-log-的儲存規劃">Metric 與 Log 的儲存規劃</h2>
<p>SaaS 監控的儲存是雲端自動擴展的。Self-hosted 的儲存要自己規劃——磁碟滿了 Prometheus 就停止收集、Loki 就停止寫入。</p>
<h3 id="容量估算">容量估算</h3>
<p>Prometheus 的儲存量取決於 series 數量 × scrape 間隔 × 保留天數。粗估公式：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">每日儲存 ≈ active_series × sample_size(2B) × (86400 / scrape_interval) × compression_ratio(~0.1)</span></span></code></pre></div><p>1 萬個 active series、15 秒 scrape interval、保留 30 天 ≈ 約 5GB。保留 90 天 ≈ 約 15GB。</p>
<p>Loki 的儲存量取決於 log 流量。粗估：每天 10GB 的 raw log 在 Loki 壓縮後約 1-2GB，保留 30 天 ≈ 30-60GB。</p>
<h3 id="retention-設定">Retention 設定</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln">1</span><span class="cl"><span class="c"># prometheus.yml</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="nt">global</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">  </span><span class="nt">scrape_interval</span><span class="p">:</span><span class="w"> </span><span class="l">15s</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="nt">storage</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">  </span><span class="nt">tsdb</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">    </span><span class="nt">retention.time</span><span class="p">:</span><span class="w"> </span><span class="l">30d</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">    </span><span class="nt">retention.size</span><span class="p">:</span><span class="w"> </span><span class="l">10GB </span><span class="w"> </span><span class="c"># 以先到的為準</span></span></span></code></pre></div><p>超過容量時 Prometheus 自動刪除最舊的資料。設定 retention 前先確認磁碟空間足夠——斷網環境擴容磁碟的流程（採購 + 安裝）可能需要週到月級的時間。</p>
<h2 id="ntp-時間同步">NTP 時間同步</h2>
<p>斷網環境容易被忽略的一個問題是時間同步。沒有 NTP server（<code>pool.ntp.org</code>）可連的機器，時鐘會漂移——幾天後各台機器的時間差可能達到秒級。當 Prometheus 收到的 metric timestamp 跟 Loki 收到的 log timestamp 有幾秒落差，事故排查時 metric 跟 log 對不上。</p>
<p>解法是在隔離網路內架一台 NTP server，所有機器從它同步：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 內部 NTP server（chrony）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># /etc/chrony/chrony.conf</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="nb">local</span> stratum <span class="m">10</span>         <span class="c1"># 沒有外部來源時、自己當 stratum 10</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">allow 10.0.0.0/16        <span class="c1"># 允許內部網段同步</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"># 其他機器指向內部 NTP</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">server ntp.internal iburst</span></span></code></pre></div><p>如果隔離網路的閘道可以開 NTP（UDP 123），讓閘道從外部 NTP 同步、內部機器從閘道同步，時間精度可以維持在毫秒級。</p>
<p>時程參考：Prometheus + Grafana + Alertmanager 的初次建置約需 1-2 天。Loki + Promtail 約需半天到一天。NTP server 約需 2 小時。後續維護主要是 Prometheus/Loki 版本更新的搬運（每次 1-2 小時）和儲存容量監控。</p>
<h2 id="跨分類引用">跨分類引用</h2>
<ul>
<li>→ <a href="/blog/infra/air-gapped/air-gapped-principles/" data-link-title="斷網環境的通用原則" data-link-desc="離線套件管理、內容搬運、變更追蹤的共通操作模式 — 所有斷網情境都要先建立的基礎能力">斷網環境的通用原則</a>：監控工具的離線安裝走 content ferry 模式</li>
<li>→ <a href="/blog/infra/air-gapped/air-gapped-container/" data-link-title="斷網環境的容器與映像管理" data-link-desc="Private registry 架設、映像搬運（docker save/load、skopeo）、base image 更新週期、離線漏洞掃描">斷網環境的容器管理</a>：Prometheus/Grafana/Loki 的容器映像搬運</li>
<li>→ <a href="/blog/infra/06-observability-logging/" data-link-title="模組六：可觀測性與 log 一併寫進 code" data-link-desc="log group、metric、alarm 跟基礎設施同生命週期管理，出事時追得到查得到">模組六：可觀測性與 log</a>：連網環境的可觀測性 IaC</li>
<li>→ <a href="/blog/infra/takeover/legacy-external-monitoring/" data-link-title="無 SSH 環境的監控與告警" data-link-desc="無 SSH 環境沒辦法裝 agent、沒辦法串 log pipeline，用外部 HTTP check、錯誤追蹤服務與效能基線建立最低成本的監控能力">無 SSH 環境的監控與告警</a>：另一個極端——完全外部監控</li>
<li>→ <a href="/blog/monitoring/04-collector/" data-link-title="模組四：Collector 設計" data-link-desc="收 → 驗 → 存 → 查 → 觸發的完整鏈路 — Go 單一 binary、可插拔 Storage Backend、rule engine">Monitoring 04：Collector 架構與部署</a>：SDK 和 Collector 的應用層監控，斷網環境需要把 Collector endpoint 指向 self-hosted backend</li>
<li>→ <a href="/blog/monitoring/06-commercial-comparison/self-hosted-vs-commercial/" data-link-title="自架 vs 商業的判斷決策表" data-link-desc="使用者數、網路範圍、功能需求、合規要求四個維度判斷該自架還是用商業方案">Monitoring 06：Self-hosted vs Commercial</a>：斷網環境只能走 self-hosted 路線</li>
</ul>
]]></content:encoded></item><item><title>LGTM Stack 組合運維：Loki + Grafana + Tempo + Mimir</title><link>https://tarrragon.github.io/blog/backend/04-observability/vendors/grafana-stack/lgtm-stack-operations/</link><pubDate>Mon, 22 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/04-observability/vendors/grafana-stack/lgtm-stack-operations/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack&lt;/a> 的 vendor deep article，深化 overview 的元件組合段。初次接觸 Grafana Stack 的讀者建議先讀 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack 服務頁&lt;/a>。&lt;/p>&lt;/blockquote>
&lt;h2 id="定位">定位&lt;/h2>
&lt;p>Grafana Stack（LGTM = Loki + Grafana + Tempo + Mimir）是自架觀測平台的完整選項，四個元件各自承擔一類訊號的儲存跟查詢。理解每個元件的責任邊界、部署模式跟故障特性，才能避免「裝了四個元件但不知道哪個壞了」的黑盒問題。&lt;/p>
&lt;h2 id="四元件的責任分工">四元件的責任分工&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>元件&lt;/th>
 &lt;th>訊號類型&lt;/th>
 &lt;th>查詢語言&lt;/th>
 &lt;th>儲存後端&lt;/th>
 &lt;th>角色&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Loki&lt;/td>
 &lt;td>Log&lt;/td>
 &lt;td>LogQL&lt;/td>
 &lt;td>Object storage + BoltDB&lt;/td>
 &lt;td>Log aggregation、grep 替代品&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Mimir&lt;/td>
 &lt;td>Metric&lt;/td>
 &lt;td>PromQL&lt;/td>
 &lt;td>Object storage&lt;/td>
 &lt;td>Prometheus 的可擴展長期儲存&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Tempo&lt;/td>
 &lt;td>Trace&lt;/td>
 &lt;td>TraceQL&lt;/td>
 &lt;td>Object storage&lt;/td>
 &lt;td>Trace 儲存、span 搜尋&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Grafana&lt;/td>
 &lt;td>視覺化&lt;/td>
 &lt;td>—&lt;/td>
 &lt;td>—&lt;/td>
 &lt;td>Dashboard、alert、data source&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Grafana 是查詢 / 視覺化層，Loki / Mimir / Tempo 是儲存 / 查詢層。Grafana 本身不存觀測資料，它連接 data source（Loki / Mimir / Tempo / Prometheus / Elasticsearch）做查詢跟渲染。&lt;/p>
&lt;p>四個元件獨立部署、獨立擴展、各自有健康指標。一個元件故障不影響其他元件 — Loki 掛了時 Grafana 的 metric dashboard 跟 trace 查詢仍然正常，只有 log panel 會報錯。&lt;/p>
&lt;h2 id="部署模式">部署模式&lt;/h2>
&lt;h3 id="monolithic-mode">Monolithic mode&lt;/h3>
&lt;p>四個元件（或其中幾個）跑在同一個 process / container。適合小規模（每天數 GB log、數十萬 metric series、少量 trace）。部署最簡單 — 一個 docker-compose 或 Helm chart 起全套。&lt;/p>
&lt;p>限制是沒辦法獨立擴展 — log 量大但 metric 量小時，monolithic mode 不能只加 Loki 的資源。&lt;/p>
&lt;h3 id="microservices-mode">Microservices mode&lt;/h3>
&lt;p>每個元件拆成獨立的 deployment、各自 autoscaling。Loki 拆成 distributor / ingester / querier / compactor；Mimir 拆成類似的元件；Tempo 也有對應的分層。&lt;/p>
&lt;p>適合中到大規模。部署跟維運複雜度顯著上升 — 每個元件的每個子服務都需要獨立的 health check、autoscaling 設定、persistent volume。&lt;/p>
&lt;h3 id="選擇判準">選擇判準&lt;/h3>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>條件&lt;/th>
 &lt;th>建議模式&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>團隊 &amp;lt; 5 人、日 log &amp;lt; 10 GB&lt;/td>
 &lt;td>Monolithic&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>需要獨立擴展某一類訊號&lt;/td>
 &lt;td>Microservices&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>不想自管、預算足夠&lt;/td>
 &lt;td>Grafana Cloud&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>已有 Prometheus、只需要加 log / trace&lt;/td>
 &lt;td>漸進式加 Loki + Tempo&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;h2 id="常見故障模式">常見故障模式&lt;/h2>
&lt;h3 id="lokiingester-oom">Loki：ingester OOM&lt;/h3>
&lt;p>Loki ingester 把 log chunks 保存在記憶體，高流量時容易 OOM。觸發條件是突然的 log 量爆增（部署後 error storm、某服務開了 debug log level）。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack</a> 的 vendor deep article，深化 overview 的元件組合段。初次接觸 Grafana Stack 的讀者建議先讀 <a href="/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack 服務頁</a>。</p></blockquote>
<h2 id="定位">定位</h2>
<p>Grafana Stack（LGTM = Loki + Grafana + Tempo + Mimir）是自架觀測平台的完整選項，四個元件各自承擔一類訊號的儲存跟查詢。理解每個元件的責任邊界、部署模式跟故障特性，才能避免「裝了四個元件但不知道哪個壞了」的黑盒問題。</p>
<h2 id="四元件的責任分工">四元件的責任分工</h2>
<table>
  <thead>
      <tr>
          <th>元件</th>
          <th>訊號類型</th>
          <th>查詢語言</th>
          <th>儲存後端</th>
          <th>角色</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Loki</td>
          <td>Log</td>
          <td>LogQL</td>
          <td>Object storage + BoltDB</td>
          <td>Log aggregation、grep 替代品</td>
      </tr>
      <tr>
          <td>Mimir</td>
          <td>Metric</td>
          <td>PromQL</td>
          <td>Object storage</td>
          <td>Prometheus 的可擴展長期儲存</td>
      </tr>
      <tr>
          <td>Tempo</td>
          <td>Trace</td>
          <td>TraceQL</td>
          <td>Object storage</td>
          <td>Trace 儲存、span 搜尋</td>
      </tr>
      <tr>
          <td>Grafana</td>
          <td>視覺化</td>
          <td>—</td>
          <td>—</td>
          <td>Dashboard、alert、data source</td>
      </tr>
  </tbody>
</table>
<p>Grafana 是查詢 / 視覺化層，Loki / Mimir / Tempo 是儲存 / 查詢層。Grafana 本身不存觀測資料，它連接 data source（Loki / Mimir / Tempo / Prometheus / Elasticsearch）做查詢跟渲染。</p>
<p>四個元件獨立部署、獨立擴展、各自有健康指標。一個元件故障不影響其他元件 — Loki 掛了時 Grafana 的 metric dashboard 跟 trace 查詢仍然正常，只有 log panel 會報錯。</p>
<h2 id="部署模式">部署模式</h2>
<h3 id="monolithic-mode">Monolithic mode</h3>
<p>四個元件（或其中幾個）跑在同一個 process / container。適合小規模（每天數 GB log、數十萬 metric series、少量 trace）。部署最簡單 — 一個 docker-compose 或 Helm chart 起全套。</p>
<p>限制是沒辦法獨立擴展 — log 量大但 metric 量小時，monolithic mode 不能只加 Loki 的資源。</p>
<h3 id="microservices-mode">Microservices mode</h3>
<p>每個元件拆成獨立的 deployment、各自 autoscaling。Loki 拆成 distributor / ingester / querier / compactor；Mimir 拆成類似的元件；Tempo 也有對應的分層。</p>
<p>適合中到大規模。部署跟維運複雜度顯著上升 — 每個元件的每個子服務都需要獨立的 health check、autoscaling 設定、persistent volume。</p>
<h3 id="選擇判準">選擇判準</h3>
<table>
  <thead>
      <tr>
          <th>條件</th>
          <th>建議模式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>團隊 &lt; 5 人、日 log &lt; 10 GB</td>
          <td>Monolithic</td>
      </tr>
      <tr>
          <td>需要獨立擴展某一類訊號</td>
          <td>Microservices</td>
      </tr>
      <tr>
          <td>不想自管、預算足夠</td>
          <td>Grafana Cloud</td>
      </tr>
      <tr>
          <td>已有 Prometheus、只需要加 log / trace</td>
          <td>漸進式加 Loki + Tempo</td>
      </tr>
  </tbody>
</table>
<h2 id="常見故障模式">常見故障模式</h2>
<h3 id="lokiingester-oom">Loki：ingester OOM</h3>
<p>Loki ingester 把 log chunks 保存在記憶體，高流量時容易 OOM。觸發條件是突然的 log 量爆增（部署後 error storm、某服務開了 debug log level）。</p>
<p>判讀指標：<code>loki_ingester_memory_chunks</code>、<code>process_resident_memory_bytes</code>。修復方向：調整 chunk flush interval（更頻繁寫入 object storage、降低記憶體壓力）、加 ingester replica、或在 pipeline 層（OTel Collector）做 log volume rate limit。</p>
<h3 id="mimircompactor-卡住">Mimir：compactor 卡住</h3>
<p>Mimir compactor 負責合併 ingester 寫入的 block。Compactor 卡住時，block 數量持續增長、query 需要掃描更多 block、延遲上升。</p>
<p>判讀指標：<code>cortex_compactor_runs_completed_total</code> 停滯、<code>cortex_bucket_blocks_count</code> 持續增長。修復方向：檢查 object storage 的寫入權限跟延遲、增加 compactor 資源（CPU / memory）、或暫時停止 ingestion 讓 compactor 追上。</p>
<h3 id="tempotrace-not-found">Tempo：trace not found</h3>
<p>使用者用 trace ID 查詢時回 &ldquo;trace not found&rdquo;，但 trace 確實存在。常見原因是 Tempo 的 bloom filter / compacted block index 還沒包含該 trace（ingestion 到可查詢有延遲），或 trace 被 retention policy 刪除。</p>
<p>判讀方式：查 trace 的 timestamp 是否在 retention 範圍內、查 <code>tempo_ingester_traces_created_total</code> 確認 ingestion 正常、查 compactor 是否正常運行。</p>
<h3 id="grafanadashboard-provisioning-漂移">Grafana：dashboard provisioning 漂移</h3>
<p>用 provisioning（YAML / JSON 檔案）管理 dashboard 時，手動在 UI 修改的 dashboard 會在下次 provisioning 同步時被覆蓋。團隊成員在 UI 調整了 panel、下次重啟 Grafana 後修改消失。</p>
<p>修復方向：dashboard 修改統一透過 git → provisioning pipeline（GitOps），UI 只用於臨時調整跟探索。把 provisioning 的 <code>allowUiUpdates</code> 設為 false、強制所有變更走 git。</p>
<h2 id="dashboard-provisioning">Dashboard Provisioning</h2>
<p>Dashboard 的管理方式影響長期維護成本。手動在 UI 建立 dashboard 的起步最快，但隨 dashboard 數量增長會出現版本不一致、無法 rollback、owner 不明的問題。</p>
<h3 id="infrastructure-as-code">Infrastructure as Code</h3>
<p>Dashboard JSON 存在 git repo、透過 provisioning 同步到 Grafana。變更走 PR review、有版本歷史、可以 rollback。</p>
<p>Grafana 的 provisioning 機制讀 YAML config，指定 dashboard JSON 的來源（local file / HTTP / API）。Helm chart 部署時把 dashboard JSON 放在 ConfigMap 或 persistent volume。</p>
<h3 id="grafonnet--jsonnet">Grafonnet / Jsonnet</h3>
<p>用 Jsonnet（Grafana 的 dashboard-as-code library）產生 dashboard JSON。適合大量相似 dashboard 的場景 — 每個服務一個 dashboard，結構相同但 data source 跟 label 不同。</p>
<p>Grafonnet 的學習曲線比直接寫 JSON 高，但在 dashboard 數量 &gt; 20 個時開始有維護效率的回報。</p>
<h2 id="下一步路由">下一步路由</h2>
<ul>
<li><a href="/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack 服務頁</a>：overview 跟日常操作</li>
<li><a href="/blog/backend/04-observability/vendors/prometheus/" data-link-title="Prometheus" data-link-desc="Pull-based metrics 主流 OSS、PromQL 與 alerting">Prometheus 服務頁</a>：Mimir 的上游 metric 來源</li>
<li><a href="/blog/backend/04-observability/vendors/opentelemetry/collector-deployment-patterns/" data-link-title="OTel Collector 部署模式：agent / gateway / sidecar 與 pipeline 設計" data-link-desc="說明 OpenTelemetry Collector 三種部署位置的責任分工、receivers/processors/exporters pipeline 設計，以及 collector 失效、記憶體壓力與 backpressure 的故障演練與容量邊界">OTel Collector 部署模式</a>：LGTM 的 ingestion 入口</li>
<li><a href="/blog/backend/04-observability/telemetry-pipeline/" data-link-title="4.11 Telemetry Pipeline 架構" data-link-desc="把 log / metric / trace 的 agent → collector → ingest → storage → query 分層治理">4.11 telemetry pipeline</a>：pipeline 各層的治理</li>
<li><a href="/blog/backend/04-observability/observability-operating-model/" data-link-title="4.18 Observability Operating Model" data-link-desc="定義 platform / service team / on-call 對訊號、dashboard、alert 與成本的 ownership">4.18 operating model</a>：dashboard / alert 的 ownership</li>
</ul>
]]></content:encoded></item><item><title>Grafana Loki 設計與操作限制</title><link>https://tarrragon.github.io/blog/backend/04-observability/vendors/grafana-stack/loki-design-operational-limits/</link><pubDate>Tue, 23 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/04-observability/vendors/grafana-stack/loki-design-operational-limits/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack&lt;/a> 的 vendor deep article，深化 overview「Loki 設計與限制」段。初次接觸 Grafana Stack 的讀者建議先讀 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack 服務頁&lt;/a>。&lt;/p>&lt;/blockquote>
&lt;h2 id="問題情境">問題情境&lt;/h2>
&lt;p>團隊從 ELK stack 或 CloudWatch Logs 遷到 Grafana Stack 時，Loki 是 log backend 的預設選擇。遷移後最常遇到的衝擊是查詢模式的根本差異：Elasticsearch 做 full-text index（寫入時索引每個欄位、查詢時任意搜尋），Loki 只 index labels（寫入時只索引 stream labels、查詢時先篩 stream 再 grep content）。&lt;/p>
&lt;p>這個差異是刻意的設計選擇 — Loki 的目標是「Prometheus for logs」：用跟 Prometheus metrics 相同的 label 體系管理 logs，讓 log 查詢跟 metric 查詢使用同一組 label selector。代價是失去 full-text search 的即時性。理解這個設計哲學才能正確設計 label、寫出有效率的 LogQL、避免常見的效能陷阱。&lt;/p>
&lt;h2 id="核心概念">核心概念&lt;/h2>
&lt;h3 id="like-prometheus-but-for-logs">Like Prometheus, but for logs&lt;/h3>
&lt;p>Prometheus 用 label set 識別 time series — &lt;code>{job=&amp;quot;checkout&amp;quot;, instance=&amp;quot;10.0.1.5&amp;quot;}&lt;/code> 是一條 series。Loki 用相同概念識別 log stream — &lt;code>{job=&amp;quot;checkout&amp;quot;, namespace=&amp;quot;production&amp;quot;}&lt;/code> 是一條 stream。同一條 stream 的所有 log entries 存在同一組 chunks。&lt;/p>
&lt;p>Elasticsearch 的索引模式是「寫入時建 inverted index、查詢時走索引」。Loki 的索引模式是「寫入時只記錄 stream label → chunk 的 mapping、查詢時先用 label 選 stream、再在 chunk 內做 grep」。&lt;/p>
&lt;p>這代表：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>有 label filter 的查詢很快&lt;/strong> — Loki 只掃對應 stream 的 chunks&lt;/li>
&lt;li>&lt;strong>沒有 label filter 的查詢很慢&lt;/strong> — Loki 要掃所有 stream 的 chunks（相當於 full scan）&lt;/li>
&lt;li>&lt;strong>Label cardinality 跟 Prometheus 一樣敏感&lt;/strong> — 高 cardinality label 產生大量 stream、每個 stream 的 chunk 很小、index 膨脹&lt;/li>
&lt;/ul>
&lt;h3 id="stream-與-chunk">Stream 與 chunk&lt;/h3>
&lt;p>一條 stream = 一組唯一的 label set。每條 stream 的 log entries 依時間排序存在 chunks 裡。Chunk 是 Loki 的最小儲存單位。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">Stream: {job=&amp;#34;checkout&amp;#34;, namespace=&amp;#34;production&amp;#34;}
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> └─ Chunk 1: [2026-06-22T00:00 ~ 2026-06-22T01:00] (compressed)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> └─ Chunk 2: [2026-06-22T01:00 ~ 2026-06-22T02:00] (compressed)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl"> └─ ...&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Chunk 存在 object storage（S3 / GCS / MinIO），index 存在 key-value store（BoltDB / TSDB，3.0 起預設 TSDB）。Object storage 便宜（相比 Elasticsearch 的 SSD），這是 Loki 成本優勢的來源。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack</a> 的 vendor deep article，深化 overview「Loki 設計與限制」段。初次接觸 Grafana Stack 的讀者建議先讀 <a href="/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack 服務頁</a>。</p></blockquote>
<h2 id="問題情境">問題情境</h2>
<p>團隊從 ELK stack 或 CloudWatch Logs 遷到 Grafana Stack 時，Loki 是 log backend 的預設選擇。遷移後最常遇到的衝擊是查詢模式的根本差異：Elasticsearch 做 full-text index（寫入時索引每個欄位、查詢時任意搜尋），Loki 只 index labels（寫入時只索引 stream labels、查詢時先篩 stream 再 grep content）。</p>
<p>這個差異是刻意的設計選擇 — Loki 的目標是「Prometheus for logs」：用跟 Prometheus metrics 相同的 label 體系管理 logs，讓 log 查詢跟 metric 查詢使用同一組 label selector。代價是失去 full-text search 的即時性。理解這個設計哲學才能正確設計 label、寫出有效率的 LogQL、避免常見的效能陷阱。</p>
<h2 id="核心概念">核心概念</h2>
<h3 id="like-prometheus-but-for-logs">Like Prometheus, but for logs</h3>
<p>Prometheus 用 label set 識別 time series — <code>{job=&quot;checkout&quot;, instance=&quot;10.0.1.5&quot;}</code> 是一條 series。Loki 用相同概念識別 log stream — <code>{job=&quot;checkout&quot;, namespace=&quot;production&quot;}</code> 是一條 stream。同一條 stream 的所有 log entries 存在同一組 chunks。</p>
<p>Elasticsearch 的索引模式是「寫入時建 inverted index、查詢時走索引」。Loki 的索引模式是「寫入時只記錄 stream label → chunk 的 mapping、查詢時先用 label 選 stream、再在 chunk 內做 grep」。</p>
<p>這代表：</p>
<ul>
<li><strong>有 label filter 的查詢很快</strong> — Loki 只掃對應 stream 的 chunks</li>
<li><strong>沒有 label filter 的查詢很慢</strong> — Loki 要掃所有 stream 的 chunks（相當於 full scan）</li>
<li><strong>Label cardinality 跟 Prometheus 一樣敏感</strong> — 高 cardinality label 產生大量 stream、每個 stream 的 chunk 很小、index 膨脹</li>
</ul>
<h3 id="stream-與-chunk">Stream 與 chunk</h3>
<p>一條 stream = 一組唯一的 label set。每條 stream 的 log entries 依時間排序存在 chunks 裡。Chunk 是 Loki 的最小儲存單位。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Stream: {job=&#34;checkout&#34;, namespace=&#34;production&#34;}
</span></span><span class="line"><span class="ln">2</span><span class="cl">  └─ Chunk 1: [2026-06-22T00:00 ~ 2026-06-22T01:00] (compressed)
</span></span><span class="line"><span class="ln">3</span><span class="cl">  └─ Chunk 2: [2026-06-22T01:00 ~ 2026-06-22T02:00] (compressed)
</span></span><span class="line"><span class="ln">4</span><span class="cl">  └─ ...</span></span></code></pre></div><p>Chunk 存在 object storage（S3 / GCS / MinIO），index 存在 key-value store（BoltDB / TSDB，3.0 起預設 TSDB）。Object storage 便宜（相比 Elasticsearch 的 SSD），這是 Loki 成本優勢的來源。</p>
<h3 id="跟-elasticsearch-的根本差異">跟 Elasticsearch 的根本差異</h3>
<table>
  <thead>
      <tr>
          <th>面向</th>
          <th>Loki</th>
          <th>Elasticsearch</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>索引對象</td>
          <td>只索引 labels（stream metadata）</td>
          <td>索引所有欄位（full-text + structured）</td>
      </tr>
      <tr>
          <td>查詢模式</td>
          <td>Label selector → stream → grep content</td>
          <td>Query DSL / KQL → inverted index lookup</td>
      </tr>
      <tr>
          <td>寫入成本</td>
          <td>低（不建 content index）</td>
          <td>高（建 inverted index + doc values）</td>
      </tr>
      <tr>
          <td>查詢成本</td>
          <td>取決於 stream 篩選效率（label 越精準越快）</td>
          <td>取決於 index 覆蓋度（indexed field 查詢快）</td>
      </tr>
      <tr>
          <td>儲存成本</td>
          <td>低（object storage）</td>
          <td>高（SSD / local disk）</td>
      </tr>
      <tr>
          <td>Full-text search</td>
          <td>不支援（只有 line filter grep）</td>
          <td>原生支援</td>
      </tr>
      <tr>
          <td>適用場景</td>
          <td>已有 Prometheus/Grafana 生態的 log aggregation</td>
          <td>需要 full-text search 的 log analytics / SIEM</td>
      </tr>
  </tbody>
</table>
<p>判讀：如果團隊的 log 查詢模式是「先選 service/namespace/pod、再看時間範圍內的 log entries」，Loki 足夠。如果查詢模式是「在所有 log 裡搜某個 error message 或 request ID」，Elasticsearch 的 full-text index 更適合。</p>
<h2 id="配置-step-by-step">配置 step-by-step</h2>
<h3 id="label-設計原則">Label 設計原則</h3>
<p>Label 設計是 Loki 最重要的操作決策。原則跟 Prometheus 相同：低 cardinality、穩定、有查詢意義。</p>
<table>
  <thead>
      <tr>
          <th>Label</th>
          <th>Cardinality</th>
          <th>適合當 label</th>
          <th>理由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>job</code></td>
          <td>低（服務數量）</td>
          <td>適合</td>
          <td>篩選到特定服務</td>
      </tr>
      <tr>
          <td><code>namespace</code></td>
          <td>低</td>
          <td>適合</td>
          <td>篩選到特定環境</td>
      </tr>
      <tr>
          <td><code>pod_name</code></td>
          <td>中（pod 數量）</td>
          <td>視情境</td>
          <td>K8s 環境常用但 pod 頻繁重建會產生大量短命 stream</td>
      </tr>
      <tr>
          <td><code>level</code>（info/warn/error）</td>
          <td>低（3-5 值）</td>
          <td>適合</td>
          <td>快速篩選 error log</td>
      </tr>
      <tr>
          <td><code>request_id</code></td>
          <td>極高（per-request）</td>
          <td>不適合</td>
          <td>每個 request 一條 stream、chunk 極小、index 爆炸</td>
      </tr>
      <tr>
          <td><code>user_id</code></td>
          <td>高</td>
          <td>不適合</td>
          <td>同上</td>
      </tr>
      <tr>
          <td><code>trace_id</code></td>
          <td>極高</td>
          <td>不適合</td>
          <td>用 Tempo 查 trace、不用 Loki label</td>
      </tr>
  </tbody>
</table>
<p>request_id / user_id / trace_id 不應該是 label，它們應該在 log content 裡用 structured JSON 欄位表達，查詢時用 LogQL 的 line filter 或 parser 提取。</p>
<h3 id="logql-常見查詢模式">LogQL 常見查詢模式</h3>
<p><strong>Stream selector + line filter</strong>（最基本）：</p>





<pre tabindex="0"><code class="language-logql" data-lang="logql">{job=&#34;checkout&#34;, namespace=&#34;production&#34;} |= &#34;error&#34; |= &#34;timeout&#34;</code></pre><p>先選 stream、再 grep 包含 &ldquo;error&rdquo; 和 &ldquo;timeout&rdquo; 的 log lines。<code>|=</code> 是包含、<code>!=</code> 是不包含、<code>|~</code> 是 regex。</p>
<p><strong>Structured metadata parser</strong>（JSON log）：</p>





<pre tabindex="0"><code class="language-logql" data-lang="logql">{job=&#34;checkout&#34;} | json | status_code &gt;= 500 | line_format &#34;{{.method}} {{.path}} {{.status_code}}&#34;</code></pre><p><code>| json</code> 解析 JSON log entry 的欄位，後續可以用欄位做 filter 和格式化。</p>
<p><strong>Metric 聚合</strong>（log → metric）：</p>





<pre tabindex="0"><code class="language-logql" data-lang="logql">sum by (status_code) (rate({job=&#34;checkout&#34;} | json | __error__=&#34;&#34; [5m]))</code></pre><p>計算每 5 分鐘每個 status_code 的 log entry 速率。這是 Loki 的「metric from logs」能力 — 不需要額外的 metrics pipeline，直接從 log 產生 time series。</p>
<h3 id="loki-config-核心段">Loki config 核心段</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># loki-config.yaml</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">schema_config</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="nt">configs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span>- <span class="nt">from</span><span class="p">:</span><span class="w"> </span><span class="ld">2024-01-01</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">      </span><span class="nt">store</span><span class="p">:</span><span class="w"> </span><span class="l">tsdb</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">      </span><span class="nt">object_store</span><span class="p">:</span><span class="w"> </span><span class="l">s3</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">      </span><span class="nt">schema</span><span class="p">:</span><span class="w"> </span><span class="l">v13</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">      </span><span class="nt">index</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">        </span><span class="nt">prefix</span><span class="p">:</span><span class="w"> </span><span class="l">loki_index_</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">        </span><span class="nt">period</span><span class="p">:</span><span class="w"> </span><span class="l">24h</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="nt">storage_config</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">  </span><span class="nt">tsdb_shipper</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">    </span><span class="nt">active_index_directory</span><span class="p">:</span><span class="w"> </span><span class="l">/loki/index</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">    </span><span class="nt">cache_location</span><span class="p">:</span><span class="w"> </span><span class="l">/loki/cache</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">  </span><span class="nt">aws</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w">    </span><span class="nt">s3</span><span class="p">:</span><span class="w"> </span><span class="l">s3://loki-chunks-bucket</span><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w">    </span><span class="nt">region</span><span class="p">:</span><span class="w"> </span><span class="l">us-east-1</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w"></span><span class="nt">limits_config</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w">  </span><span class="nt">ingestion_rate_mb</span><span class="p">:</span><span class="w"> </span><span class="m">10</span><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w">  </span><span class="nt">ingestion_burst_size_mb</span><span class="p">:</span><span class="w"> </span><span class="m">20</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w">  </span><span class="nt">max_streams_per_user</span><span class="p">:</span><span class="w"> </span><span class="m">10000</span><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w">  </span><span class="nt">max_label_name_length</span><span class="p">:</span><span class="w"> </span><span class="m">1024</span><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w">  </span><span class="nt">max_label_value_length</span><span class="p">:</span><span class="w"> </span><span class="m">2048</span></span></span></code></pre></div><p><code>limits_config</code> 是防護網。<code>max_streams_per_user</code> 限制每個 tenant 的 stream 數量，超過時新 stream 的 log 被拒（HTTP 429）。這是 label cardinality 爆炸的最後防線。</p>
<h2 id="故障與邊界">故障與邊界</h2>
<h3 id="label-cardinality-爆炸">Label cardinality 爆炸</h3>
<p><strong>觸發條件</strong>：label 包含高 cardinality 值（pod UID、request ID、container ID）。每個唯一 label set 產生一條 stream，stream 數量快速增長。</p>
<p><strong>表現</strong>：<code>loki_ingester_memory_streams</code> 持續上升、ingester memory 增長、最終觸發 <code>max_streams_per_user</code> 限制（429 error）。跟 Prometheus series explosion 是同一個問題的 log 版本。</p>
<p><strong>修法</strong>：檢查產出大量 stream 的 label。Loki 的 <code>/loki/api/v1/labels</code> 和 <code>/loki/api/v1/label/{name}/values</code> API 可以列出所有 label 值。找到高 cardinality label 後，從 promtail / alloy 的 pipeline 中移除該 label、改放進 log content 的 structured field。</p>
<h3 id="stream-rate-limit">Stream rate limit</h3>
<p><strong>觸發條件</strong>：單一 stream 的 ingestion rate 超過 <code>per_stream_rate_limit</code>（預設 3 MB/s）。通常是某個 service 大量噴 debug log。</p>
<p><strong>表現</strong>：Loki 回傳 429 + <code>rate limit exceeded</code> error。部分 log entries 被丟棄。</p>
<p><strong>修法</strong>：先解決 log 噴量問題（降低 debug log level 或加 sampling）。如果噴量合理（高 QPS 服務），調高 <code>per_stream_rate_limit</code> 或拆分 stream（加一層 label 分散流量）。</p>
<h3 id="大時間範圍查詢-timeout">大時間範圍查詢 timeout</h3>
<p><strong>觸發條件</strong>：LogQL 查詢沒有精確的 label filter、時間範圍 &gt; 24 小時。Loki 要掃描大量 chunks、query timeout（預設 3 分鐘）觸發。</p>
<p><strong>表現</strong>：Grafana 顯示 query timeout error。</p>
<p><strong>修法</strong>：查詢時先用 label selector 縮小 stream 範圍（<code>{job=&quot;checkout&quot;, namespace=&quot;production&quot;}</code> 而非 <code>{namespace=&quot;production&quot;}</code>），再用 line filter 進一步篩。如果業務需要長時間範圍的 log analytics，考慮用 LogQL 的 metric aggregation（<code>rate(...)</code> / <code>count_over_time(...)</code>）替代原始 log 掃描。</p>
<h3 id="chunk-target-size-與-ingestion-rate-的關係">Chunk target size 與 ingestion rate 的關係</h3>
<p><code>chunk_target_size</code>（預設 1.5 MB）控制 chunk 的大小。ingestion rate 低的 stream 可能幾個小時才填滿一個 chunk — 這段期間 chunk 停在 ingester memory 裡。大量低 ingestion rate 的 stream（= 高 cardinality label）會讓 ingester 同時持有大量未 flush 的 chunks，佔用記憶體。</p>
<p>修法方向：降低 <code>chunk_idle_period</code>（預設 30 分鐘，時間到即使 chunk 未滿也 flush），或減少低 cardinality stream 的數量。</p>
<h2 id="容量與成本">容量與成本</h2>
<p>Loki 的成本結構跟 Elasticsearch 根本不同：</p>
<table>
  <thead>
      <tr>
          <th>成本項</th>
          <th>Loki</th>
          <th>Elasticsearch</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>儲存</td>
          <td>Object storage（S3/GCS）— 便宜</td>
          <td>SSD / local disk — 貴</td>
      </tr>
      <tr>
          <td>Index</td>
          <td>小（只索引 labels）</td>
          <td>大（inverted index + doc values）</td>
      </tr>
      <tr>
          <td>查詢 compute</td>
          <td>每次查詢 grep chunks — CPU 密集</td>
          <td>走 index — 相對輕</td>
      </tr>
      <tr>
          <td>適合的 workload</td>
          <td>高 volume、低 query frequency</td>
          <td>高 query frequency、需要 full-text</td>
      </tr>
  </tbody>
</table>
<p>Loki 在「每天寫 TB 級 log、偶爾查一下」的場景成本遠低於 Elasticsearch。但在「每天查數百次、需要快速 full-text search」的場景，Elasticsearch 的 pre-indexed 查詢效能更好，Loki 每次 grep 的 compute cost 反而更高。</p>
<p>成本治理的判讀：監控 <code>loki_ingester_bytes_received_total</code>（ingestion volume）和 <code>loki_querier_query_duration_seconds</code>（query cost）。如果 query duration 持續上升，先檢查是 label filter 不夠精確還是 query 時間範圍太大。</p>
<h2 id="整合與下一步">整合與下一步</h2>
<ul>
<li><a href="/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack 服務頁</a>：overview 與全棧操作</li>
<li><a href="../lgtm-stack-operations/">LGTM Stack Operations</a>：Loki 在 LGTM 全棧中的部署位置</li>
<li><a href="/blog/backend/04-observability/audit-log-governance/" data-link-title="4.12 Audit Log 邊界與 PII 治理" data-link-desc="把稽核訊號從 operational log 拆出、按法規與不變性治理">4.12 Audit Log Governance</a>：Loki 不適合 audit log 的 compliance 查詢（無 immutable storage 保證、無 fine-grained access control）— 合規需求用 BigQuery 或 dedicated audit backend</li>
<li><a href="/blog/backend/04-observability/cases/healthcare-access-traceability-and-retention/" data-link-title="Healthcare：存取可追溯性與保留邊界" data-link-desc="在資料主權限制下，建立可追溯存取證據與分層保留策略。">Healthcare 存取追溯案例</a>：分層 retention 在 Loki 用 tenant-level retention policy 實現</li>
<li><a href="/blog/backend/04-observability/log-schema/" data-link-title="4.1 log schema 與搜尋規劃" data-link-desc="整理 log 欄位、索引與搜尋策略">4.1 Log Schema</a>：log 欄位設計影響 Loki 的 label 設計與 parser 效率</li>
<li><a href="/blog/backend/04-observability/vendors/elastic-stack/ilm-log-pipeline/" data-link-title="Index Lifecycle Management 與 Log Pipeline" data-link-desc="說明 Elasticsearch ILM policy 設計、data stream / rollover、Beats vs Elastic Agent 採集選擇、ingest pipeline 與 shard sizing、cross-cluster 策略與 cost governance">Elasticsearch ILM 與 Log Pipeline</a>：需要 full-text search 時的替代方案</li>
</ul>
]]></content:encoded></item><item><title>Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability</title><link>https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/migrate-to-grafana-stack/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/migrate-to-grafana-stack/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link 到 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog&lt;/a>（source）跟 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack&lt;/a>（target）。跟前三篇 migration（&lt;a href="https://tarrragon.github.io/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic&lt;/a> phased / &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &amp;#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB&lt;/a> drop-in / &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &amp;#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora&lt;/a> hybrid）對照、本篇是 &lt;em>cost-driven multi-tool migration&lt;/em> — 不是換一個產品、是把 &lt;em>一站式 SaaS&lt;/em> 拆成 &lt;em>五個專責 OSS / cloud component&lt;/em>。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link 到 <a href="/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog</a>（source）跟 <a href="/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack</a>（target）。跟前三篇 migration（<a href="/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic</a> phased / <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB</a> drop-in / <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a> hybrid）對照、本篇是 <em>cost-driven multi-tool migration</em> — 不是換一個產品、是把 <em>一站式 SaaS</em> 拆成 <em>五個專責 OSS / cloud component</em>。</p></blockquote>
<h2 id="50kmonth-bill-拆解先看錢花在哪再決定怎麼遷">$50K/month bill 拆解：先看錢花在哪、再決定怎麼遷</h2>
<p>中型 SaaS（100-500 host、5K-50K metric series、TB-level log/day）的 Datadog 月帳單長這樣：</p>
<table>
  <thead>
      <tr>
          <th>計費項</th>
          <th>平均單價</th>
          <th>中型 SaaS 估算 / month</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Infrastructure host</td>
          <td>$15-23 / host</td>
          <td>200 host × $20 = $4,000</td>
      </tr>
      <tr>
          <td>APM host</td>
          <td>$31 / host</td>
          <td>100 host × $31 = $3,100</td>
      </tr>
      <tr>
          <td>Custom metrics</td>
          <td>$0.05 / 100 series</td>
          <td>30K series × $0.05 = $1,500</td>
      </tr>
      <tr>
          <td>Log ingest</td>
          <td>$0.10 / GB ingested</td>
          <td>50TB × $0.10 = $5,000</td>
      </tr>
      <tr>
          <td>Log retention（15-day）</td>
          <td>$1.27 / million events</td>
          <td>50G event × $1.27 = $6,350</td>
      </tr>
      <tr>
          <td>Log indexing</td>
          <td>$1.70 / million events</td>
          <td>50G × $1.70 = $8,500</td>
      </tr>
      <tr>
          <td>Network</td>
          <td>$5 / host</td>
          <td>200 × $5 = $1,000</td>
      </tr>
      <tr>
          <td>RUM / Session</td>
          <td>$1.50 / 1000 session</td>
          <td>30M session × $1.5 = $4,500</td>
      </tr>
      <tr>
          <td>Synthetics</td>
          <td>$5 / 10K test runs</td>
          <td>50K test = $25</td>
      </tr>
      <tr>
          <td>Total</td>
          <td>-</td>
          <td><strong>$34,000 / month</strong>（保守估）</td>
      </tr>
  </tbody>
</table>
<p>擴張到 500 host / 100TB log 的 production：$80K-150K / month 範圍。Grafana stack（self-hosted on K8s + Grafana Cloud 部分服務）對等 capacity 通常 $8K-30K / month — <em>2.5-5x cost reduction</em>。</p>
<p>但 cost 不是唯一 driver。其他 driver：</p>
<ul>
<li><strong>Multi-cloud / hybrid</strong>：Datadog 集中、Grafana 可分散部署符合資料 residency</li>
<li><strong>OpenTelemetry-first</strong>：Grafana stack 對 OTel 是 native、Datadog 仍 vendor-specific agent</li>
<li><strong>Long-term retention</strong>：Loki 用 S3 cold tier 跑 1 年 retention 比 Datadog 便宜 10-50x</li>
</ul>
<h2 id="五個責任五個-component不是替換一個產品">五個責任、五個 component：不是替換一個產品</h2>
<p>Datadog 是 <em>一站式 SaaS</em>、單一 agent + 單一 UI 包 5 個責任。Grafana stack 把責任拆給 5 個專責 component：</p>
<table>
  <thead>
      <tr>
          <th>責任</th>
          <th>Datadog 處理</th>
          <th>Grafana Stack 對應</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Metric</td>
          <td>Datadog metric</td>
          <td>Mimir（Prometheus-compatible long-term）</td>
      </tr>
      <tr>
          <td>Log</td>
          <td>Datadog Logs</td>
          <td>Loki（label-indexed log）</td>
      </tr>
      <tr>
          <td>Trace</td>
          <td>Datadog APM</td>
          <td>Tempo（trace-only object storage）</td>
      </tr>
      <tr>
          <td>Dashboard</td>
          <td>Datadog dashboard</td>
          <td>Grafana</td>
      </tr>
      <tr>
          <td>Agent / shipper</td>
          <td>Datadog Agent</td>
          <td>Alloy（OTel-based collector）+ Grafana Agent / Promtail</td>
      </tr>
  </tbody>
</table>
<p>Migration 是 <em>五個獨立 stream</em>、不是單一 cutover。SRE 對「一個 agent 包所有」的心智模型要拆。</p>
<h2 id="migration-結構每個-component-各自-phased整體-staggered">Migration 結構：每個 component 各自 phased、整體 staggered</h2>
<p>不像前三篇 migration 是線性流程、本篇是 <em>5 個 parallel migration stream</em> + 跨 stream coordination：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">           Phase 0           Phase 1            Phase 2          Phase 3
</span></span><span class="line"><span class="ln">2</span><span class="cl">           Audit             Deploy             Dual-ship        Cutover
</span></span><span class="line"><span class="ln">3</span><span class="cl">Metric    [audit]──→        [deploy Mimir]──→ [dual-ship]──→  [cutover]
</span></span><span class="line"><span class="ln">4</span><span class="cl">APM       [audit]──→        [deploy Tempo]──→ [dual-ship]──→  [cutover]
</span></span><span class="line"><span class="ln">5</span><span class="cl">Log       [audit]──→        [deploy Loki]──→  [dual-ship]──→  [cutover]
</span></span><span class="line"><span class="ln">6</span><span class="cl">Dashboard [audit]──→        [deploy Grafana]──→ [rebuild]──→   [cutover]
</span></span><span class="line"><span class="ln">7</span><span class="cl">Alert     [audit]──→        [deploy Alertmgr]──→ [parallel]──→ [cutover]</span></span></code></pre></div><p>每個 stream 獨立做 dual-ship + cutover、不必同步；通常 <em>Metric 先遷</em>（cardinality 議題暴露最快）、然後 Log、最後 APM（trace correlation 最依賴 dashboard / alert）。</p>
<h2 id="agent-migrationdatadog-agent--otel-collector--alloy">Agent migration：Datadog Agent → OTel Collector / Alloy</h2>
<p>Datadog Agent 是 vendor-specific binary、抽出來換成 OpenTelemetry Collector / Grafana Alloy：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># alloy config (HCL-like)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="l">prometheus.scrape &#34;k8s_pods&#34; {</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="l">targets = discovery.kubernetes.pods.targets</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="l">forward_to = [prometheus.remote_write.mimir.receiver]</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span>}<span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="l">prometheus.remote_write &#34;mimir&#34; {</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">  </span><span class="l">endpoint {</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="l">url = &#34;https://mimir.internal/api/v1/push&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">  </span>}<span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span>}<span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="l">loki.source.kubernetes &#34;pods&#34; {</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">  </span><span class="l">targets = discovery.kubernetes.pods.targets</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">  </span><span class="l">forward_to = [loki.write.production.receiver]</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span>}<span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w"></span><span class="l">otelcol.receiver.otlp &#34;default&#34; {</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w">  </span><span class="l">grpc {}</span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w">  </span><span class="l">output {</span><span class="w">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w">    </span><span class="l">traces = [otelcol.exporter.otlp.tempo.input]</span><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w">  </span>}<span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w"></span>}</span></span></code></pre></div><p>Migration 期間 <em>dual-shipper</em> 是標準作法：</p>
<ul>
<li>Datadog Agent 跟 Alloy 並存（短期 capacity 兩倍）</li>
<li>同 host 同時 ship 兩端、觀察一致性</li>
<li>漸進 disable Datadog Agent 的 metric / log / APM 子模組</li>
</ul>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1cardinality-爆mimir-端-series-暴增">Case 1：Cardinality 爆，Mimir 端 series 暴增</h3>
<p><strong>徵兆</strong>：Datadog 端 30K series、ship 到 Mimir 後 series 變 500K、Mimir indexer OOM。</p>
<p><strong>根因</strong>：Datadog 內部對 tag 做 <em>自動 aggregation</em> 跟 <em>low-cardinality enforcement</em>；Prometheus / Mimir 對 <em>每個 unique label set</em> 算一個 series、application code 的 high-cardinality label（user_id / request_id）直接爆。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Audit 階段</strong> 跑 <code>topk(100, count by (__name__) ({__name__=~&quot;.+&quot;}))</code> 找 high-cardinality metric</li>
<li><strong>drop high-cardinality label</strong>：Alloy / OTel collector 端 <code>relabel</code> 規則 drop user_id 等 unbounded label</li>
<li><strong>改 histogram bucket</strong>：高 cardinality 通常來自 label combination、改用 fixed-bucket histogram</li>
<li><strong>適當改 metric 為 log</strong>：請求 ID 是 trace context、不該是 metric label</li>
</ol>
<h3 id="case-2log-volume-cost-預估失準">Case 2：Log volume cost 預估失準</h3>
<p><strong>徵兆</strong>：Loki 部署 1 個月後 S3 帳單比預估高 2x；object storage 跟 query GB-scan 都超預期。</p>
<p><strong>根因</strong>：Datadog 對 log 做自動 sampling / aggregation、bill 是 indexed event；Loki 是 <em>全量 raw ingest</em> + S3 cold storage、按實際 byte 計費。raw log volume 比 indexed event 高 3-10x。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Ingest-side sampling</strong>：Alloy / Promtail 端 sample debug / info log、只 ingest warn / error 全量</li>
<li><strong>Log structure</strong>：JSON log 比 text log 壓縮率高、Loki S3 size 少 50%</li>
<li><strong>Retention tier</strong>：hot 7 天 S3 standard / cold 1 年 S3 Glacier、retention budget 控制</li>
</ol>
<h3 id="case-3datadog-dashboard-不能直接轉-grafana">Case 3：Datadog dashboard 不能直接轉 Grafana</h3>
<p><strong>徵兆</strong>：Migration 計畫設「dashboard 自動轉換」、實際跑 Datadog API export → Grafana import、80% dashboard 缺 widget / metric 對不上。</p>
<p><strong>根因</strong>：</p>
<ul>
<li>Datadog query syntax 跟 Grafana / Mimir 的 PromQL 不直接相容</li>
<li>Datadog widget type（top-list / hostmap）Grafana 沒對應</li>
<li>Tag-based aggregation 對應 Prometheus label 但語法不同</li>
</ul>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>接受重建</strong>：production-grade dashboard 必須人工重建、不要期待自動轉</li>
<li><strong>Prioritize</strong>：先重建 <em>SOC 用 / production-critical</em> 30%、其他 deprecate</li>
<li><strong>migration window 增 4-6 週</strong>：dashboard rebuild 是 underestimated effort</li>
</ol>
<h3 id="case-4alert-routing-換邏輯pagerduty-integration-不通">Case 4：Alert routing 換邏輯，PagerDuty integration 不通</h3>
<p><strong>徵兆</strong>：Cutover 後 alert 不送 PagerDuty、SOC 半小時才發現；alert 端 webhook 配置正確、但 payload format 跟 Datadog 不同、PagerDuty 端 rule 過濾掉。</p>
<p><strong>根因</strong>：</p>
<ul>
<li>Datadog alert payload 含 <code>event_type=alert</code>、PagerDuty integration 用這個 routing</li>
<li>Alertmanager 預設 payload 結構不同</li>
<li>PagerDuty rule 端針對 Datadog event 寫 schema、Alertmanager event 不 match</li>
</ul>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-cutover test</strong>：Alertmanager → PagerDuty 跑 dry-run、send test alert 驗證</li>
<li><strong>PagerDuty Service</strong>：建獨立 Grafana-source Service、不共用 Datadog Service</li>
<li><strong>Alertmanager template</strong>：用 webhook 自定 JSON template、payload 接近 Datadog 結構</li>
</ol>
<h3 id="case-5slo-definition-跟-monitor-type-對不上">Case 5：SLO definition 跟 monitor type 對不上</h3>
<p><strong>徵兆</strong>：Datadog SLO 跑 99.9% availability、轉到 Grafana SLO + Mimir 後實際 9X% 數字不一致；SOC 跑 dashboard 比對 5 個 SLO、4 個誤差 0.1-0.3%。</p>
<p><strong>根因</strong>：</p>
<ul>
<li>Datadog SLO 計算 over time window 用內部 query；Grafana SLO 用 PromQL 寫公式</li>
<li>Datadog 對 <code>success_rate</code> 處理 missing data 跟 PromQL 預設不同</li>
<li>Time bucket boundary 處理差異</li>
</ul>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>重定義 SLO 在 PromQL</strong>：不嘗試「複製」、是「重定義」、認真寫 PromQL 表達式</li>
<li><strong>接受 ±0.1% drift</strong>：production-critical SLO 跑 dual-track 1-2 個月、tune PromQL 到 acceptable drift</li>
<li><strong>SLO migration 不是 dashboard migration 子集</strong>：獨立 stream、留更多時間</li>
</ol>
<h2 id="capacity--cost-對照">Capacity / cost 對照</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Datadog</th>
          <th>Grafana Stack（self-hosted on K8s）</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Setup cost</td>
          <td>低（SaaS）</td>
          <td>中高（K8s deploy + storage backend）</td>
      </tr>
      <tr>
          <td>Operational cost (200 host)</td>
          <td>$34K / month</td>
          <td>$8-12K / month（含 S3 + K8s）</td>
      </tr>
      <tr>
          <td>Operational cost (500 host)</td>
          <td>$80-150K / month</td>
          <td>$15-30K / month</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>0.1-0.3</td>
          <td>1-2 FTE（K8s + storage + Grafana operator）</td>
      </tr>
      <tr>
          <td>Long-term retention</td>
          <td>$1.27 / million event for 15+ day</td>
          <td>S3 + Loki：~$0.02 / GB / month</td>
      </tr>
      <tr>
          <td>Multi-cloud / hybrid</td>
          <td>受 Datadog region 限</td>
          <td>自由部署</td>
      </tr>
      <tr>
          <td>Vendor lock-in</td>
          <td>高</td>
          <td>低（OSS + OTel）</td>
      </tr>
      <tr>
          <td>Time to value</td>
          <td>1-2 週</td>
          <td>4-8 週</td>
      </tr>
      <tr>
          <td>Migration cost (one-time)</td>
          <td>-</td>
          <td>1-3 FTE × 3 個月</td>
      </tr>
  </tbody>
</table>
<p><strong>Break-even point</strong>：~150 host 規模、3 年 amortized 後 self-hosted cheaper；&lt; 100 host 規模 SaaS 較 ROI 高。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-opentelemetry-對齊">跟 OpenTelemetry 對齊</h3>
<p>Migration 是 <em>OTel-first 轉型</em> 的機會：</p>
<ul>
<li>Application code 用 OTel SDK、避免 Datadog SDK lock-in</li>
<li>Trace context propagation 走 W3C Trace Context</li>
<li>未來換 backend 不用再改 application</li>
</ul>
<h3 id="跟-splunk--elastic-對照">跟 <a href="/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic</a> 對照</h3>
<p>兩篇都是 <em>cost-driven SaaS migration</em>、但細節差：</p>
<ul>
<li>Splunk → Elastic 是 SIEM 領域、schema translation 是核心議題</li>
<li>Datadog → Grafana 是 multi-tool 拆分、agent + dashboard 重建是核心</li>
<li>共同 pattern：dual-ship → parallel run → cutover</li>
</ul>
<h3 id="反向遷移grafana-stack--datadog">反向遷移（Grafana Stack → Datadog）</h3>
<p>存在但少數 — 主要是 <em>operational complexity reduction</em>（不想自管 Mimir / Loki）；schema 對位方向相反、agent 換回 Datadog Agent。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Grafana Cloud 混合</strong>：部分 component（Tempo）用 Grafana Cloud SaaS、其他 self-host、混合架構</li>
<li><strong>OpenTelemetry Collector 跟 Alloy 取捨</strong>：兩者都是 OTel-based、Alloy 是 Grafana 自家 fork</li>
<li><strong>Vector vs Alloy vs Fluentd</strong>：log shipper 戰場、cost / 功能 / OTel 整合度比較</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source vendor：<a href="/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog</a></li>
<li>Target vendor：<a href="/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack</a></li>
<li>平行 vendor：<a href="/blog/backend/04-observability/vendors/elastic-stack/" data-link-title="Elastic Stack" data-link-desc="ELK：Elasticsearch / Logstash / Kibana &#43; Beats / APM">Elastic Stack</a> / <a href="/blog/backend/04-observability/vendors/opentelemetry/" data-link-title="OpenTelemetry" data-link-desc="可觀測性開放標準、SDK 與 Collector">OpenTelemetry</a></li>
<li>平行 migration playbook：<a href="/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic Security</a> / <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB</a> / <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a></li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a></li>
</ul>
]]></content:encoded></item></channel></rss>