<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Msk on Tarragon</title><link>https://tarrragon.github.io/blog/tags/msk/</link><description>Recent content in Msk on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Tue, 19 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/msk/index.xml" rel="self" type="application/rss+xml"/><item><title>Self-managed Kafka → AWS MSK：把 $15K/month operational cost 拆解到 managed</title><link>https://tarrragon.github.io/blog/backend/03-message-queue/vendors/kafka/migrate-to-msk/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/03-message-queue/vendors/kafka/migrate-to-msk/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link &lt;a href="https://tarrragon.github.io/blog/backend/03-message-queue/vendors/kafka/" data-link-title="Apache Kafka" data-link-desc="Distributed event streaming platform、log-based 模型">Kafka&lt;/a> 跟 AWS MSK。跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit&lt;/a> 後對映 &lt;em>Operational = High（self-managed → AWS managed）→ Type C operational redesign hybrid&lt;/em>。&lt;/p>&lt;/blockquote>
&lt;h2 id="15kmonth-operational-cost-拆解">$15K/month operational cost 拆解&lt;/h2>
&lt;p>跟 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/migrate-to-grafana-stack/" data-link-title="Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability" data-link-desc="Datadog 五層計費（host APM / metric / log ingest / log retention / RUM）拆解、對位 Grafana Stack（Mimir / Loki / Tempo / Grafana / Alloy）的 5 層責任；OTel-based agent migration、5 個 production 踩雷（cardinality 爆 / log volume cost / dashboard 不直接轉 / alert routing 換邏輯 / SLO definition 差異）、cost reality check">Datadog → Grafana Stack&lt;/a>（H cost variant）同 framing — 用 cost 拆解開頭、不是「為什麼遷」driver list：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Self-managed Kafka cost 項&lt;/th>
 &lt;th>中型 (3 broker + 3 ZK + monitoring) / month&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>EC2 (3× r6g.xlarge broker)&lt;/td>
 &lt;td>$660&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>EBS (3× 1TB io2)&lt;/td>
 &lt;td>$1,500&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>EC2 (3× t3.medium ZK / KRaft)&lt;/td>
 &lt;td>$90&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Monitoring (Prometheus + Grafana on EC2)&lt;/td>
 &lt;td>$200&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Backup S3 (1TB)&lt;/td>
 &lt;td>$25&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Cross-AZ traffic&lt;/td>
 &lt;td>$300&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>Operational FTE (0.5)&lt;/strong>&lt;/td>
 &lt;td>&lt;strong>$5,000-8,000&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Patching window cost&lt;/td>
 &lt;td>$200 (downtime opportunity)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Total infrastructure&lt;/td>
 &lt;td>$7,975-10,975&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Total with FTE&lt;/td>
 &lt;td>&lt;strong>$13,000-18,975&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>最大成本塊是 operational FTE、不是 infrastructure&lt;/strong>。MSK 把 50-80% operational 工作轉嫁 AWS、留 application + cost monitoring 給 SRE。&lt;/p>
&lt;p>跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit&lt;/a>：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link <a href="/blog/backend/03-message-queue/vendors/kafka/" data-link-title="Apache Kafka" data-link-desc="Distributed event streaming platform、log-based 模型">Kafka</a> 跟 AWS MSK。跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit</a> 後對映 <em>Operational = High（self-managed → AWS managed）→ Type C operational redesign hybrid</em>。</p></blockquote>
<h2 id="15kmonth-operational-cost-拆解">$15K/month operational cost 拆解</h2>
<p>跟 <a href="/blog/backend/04-observability/vendors/datadog/migrate-to-grafana-stack/" data-link-title="Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability" data-link-desc="Datadog 五層計費（host APM / metric / log ingest / log retention / RUM）拆解、對位 Grafana Stack（Mimir / Loki / Tempo / Grafana / Alloy）的 5 層責任；OTel-based agent migration、5 個 production 踩雷（cardinality 爆 / log volume cost / dashboard 不直接轉 / alert routing 換邏輯 / SLO definition 差異）、cost reality check">Datadog → Grafana Stack</a>（H cost variant）同 framing — 用 cost 拆解開頭、不是「為什麼遷」driver list：</p>
<table>
  <thead>
      <tr>
          <th>Self-managed Kafka cost 項</th>
          <th>中型 (3 broker + 3 ZK + monitoring) / month</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>EC2 (3× r6g.xlarge broker)</td>
          <td>$660</td>
      </tr>
      <tr>
          <td>EBS (3× 1TB io2)</td>
          <td>$1,500</td>
      </tr>
      <tr>
          <td>EC2 (3× t3.medium ZK / KRaft)</td>
          <td>$90</td>
      </tr>
      <tr>
          <td>Monitoring (Prometheus + Grafana on EC2)</td>
          <td>$200</td>
      </tr>
      <tr>
          <td>Backup S3 (1TB)</td>
          <td>$25</td>
      </tr>
      <tr>
          <td>Cross-AZ traffic</td>
          <td>$300</td>
      </tr>
      <tr>
          <td><strong>Operational FTE (0.5)</strong></td>
          <td><strong>$5,000-8,000</strong></td>
      </tr>
      <tr>
          <td>Patching window cost</td>
          <td>$200 (downtime opportunity)</td>
      </tr>
      <tr>
          <td>Total infrastructure</td>
          <td>$7,975-10,975</td>
      </tr>
      <tr>
          <td>Total with FTE</td>
          <td><strong>$13,000-18,975</strong></td>
      </tr>
  </tbody>
</table>
<p><strong>最大成本塊是 operational FTE、不是 infrastructure</strong>。MSK 把 50-80% operational 工作轉嫁 AWS、留 application + cost monitoring 給 SRE。</p>
<p>跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit</a>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>同 Kafka protocol、client SDK 不改</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>Self-managed → AWS managed、HA / patch / backup 全託管</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>同 Kafka log-based</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>同 1 個 Kafka cluster</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>Auth config 改（IAM / SASL）、其他不變</td>
          <td>Low-Medium</td>
      </tr>
      <tr>
          <td>Data topology</td>
          <td>同 broker + partition 配置</td>
          <td>Low</td>
      </tr>
  </tbody>
</table>
<p>Operational = High（其他 Low-Medium）→ <strong>Type C operational redesign hybrid</strong>。</p>
<h2 id="為什麼遷fte--availability--consistency-三條-driver">為什麼遷：FTE / availability / consistency 三條 driver</h2>
<ul>
<li><strong>Operational FTE</strong>：Kafka self-managed + ZooKeeper / KRaft + Prometheus 端到端 ops 是 0.5-1 FTE、MSK 把 patch / HA / backup 全託管</li>
<li><strong>Availability</strong>：MSK 自動 multi-AZ broker + auto-recovery、self-managed 自管 broker 故障 RTO 30 分鐘-2 小時</li>
<li><strong>Consistency with cloud stack</strong>：已 deep on AWS（RDS / S3 / Lambda）、MSK 進 same VPC + IAM auth、降低 cross-vendor 設置成本</li>
</ul>
<p>反向 driver（MSK → self-managed）：</p>
<ul>
<li>Throughput / GB 規模大時 MSK 跨 broker cost 反轉（cost &gt; self-managed）</li>
<li>需要 Kafka 客製化（custom plugin / kraft early adopter / 非 AWS region）</li>
<li>Multi-cloud / hybrid 架構不想 vendor lock</li>
</ul>
<h2 id="operational-redesign-對位">Operational redesign 對位</h2>
<p>跟 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a> / <a href="/blog/backend/01-database/vendors/mongodb/migrate-to-atlas/" data-link-title="MongoDB → Atlas：Atlas 不是 MongoDB &#43; managed、是另一個 product" data-link-desc="Atlas 號稱「MongoDB managed」但 operational model 完全不同（auto-scaling / VPC peering / IAM-driven access / 內建 backup / billing 模型）；本文採用 Type C operational redesign hybrid 結構、4-phase operational migration &#43; drop-in cutover、5 個 production 踩雷（連線數限制 / IP whitelist / backup retention / IAM token 過期 / billing 暴漲）">MongoDB → Atlas</a> 同 Type C pattern：</p>
<table>
  <thead>
      <tr>
          <th>Operational concept</th>
          <th>Self-managed Kafka</th>
          <th>MSK</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cluster bootstrap</td>
          <td>手動配置 broker + ZK + brokers.properties</td>
          <td>UI / Terraform 一鍵建</td>
      </tr>
      <tr>
          <td>HA</td>
          <td>自管 replica + ISR + broker placement</td>
          <td>自動 multi-AZ + auto-recovery</td>
      </tr>
      <tr>
          <td>Patching</td>
          <td>Rolling restart 手動 / 工具</td>
          <td>MSK 自動 monthly maintenance window</td>
      </tr>
      <tr>
          <td>Backup</td>
          <td>自管 MirrorMaker / cluster snapshot</td>
          <td>MSK 內建 backup（S3、自動）</td>
      </tr>
      <tr>
          <td>Authentication</td>
          <td>SASL/SCRAM / mTLS 自管</td>
          <td>IAM auth（推薦）/ SASL/SCRAM via Secrets Manager</td>
      </tr>
      <tr>
          <td>Monitoring</td>
          <td>Prometheus + JMX exporter 自建</td>
          <td>CloudWatch + open monitoring + Prometheus</td>
      </tr>
      <tr>
          <td>Sizing</td>
          <td>手動 broker instance class</td>
          <td>MSK broker size（kafka.m5.large+）</td>
      </tr>
      <tr>
          <td>Configuration</td>
          <td>server.properties 全控</td>
          <td>Configuration set（限制可調 parameter）</td>
      </tr>
      <tr>
          <td>Cluster topology</td>
          <td>自管 placement / rack awareness</td>
          <td>MSK 自動 multi-AZ + rack-aware</td>
      </tr>
      <tr>
          <td>Tiered storage</td>
          <td>Kafka 3.6+ 自管</td>
          <td>MSK Tiered Storage（auto-tier 到 S3）</td>
      </tr>
  </tbody>
</table>
<p>每行 operational concept 都需要 migration plan、application code 不變但 <em>運維知識體系全換</em>。</p>
<h2 id="4-phase-migrationtype-c-標準流程">4-phase migration（Type C 標準流程）</h2>
<h3 id="phase-0pre-migration-audit">Phase 0：Pre-migration audit</h3>
<ul>
<li><strong>Workload sizing → MSK broker class</strong>：當前 throughput / partition count / topic count</li>
<li><strong>Application connection pattern audit</strong>：客戶端 producer / consumer 用 SASL / mTLS / plaintext？哪個 application</li>
<li><strong>Topic config audit</strong>：retention / replication factor / cleanup policy</li>
<li><strong>Backup pattern audit</strong>：有 MirrorMaker / cross-cluster mirror 嗎</li>
</ul>
<h3 id="phase-1msk-cluster-建置2-3-週">Phase 1：MSK cluster 建置（2-3 週）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_msk_cluster&#34; &#34;main&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  cluster_name</span>           <span class="o">=</span> <span class="s2">&#34;production&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  kafka_version</span>          <span class="o">=</span> <span class="s2">&#34;3.6.0&#34;</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  number_of_broker_nodes</span> <span class="o">=</span> <span class="m">3</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">  <span class="k">broker_node_group_info</span> {
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">    instance_type</span>   <span class="o">=</span> <span class="s2">&#34;kafka.m5.large&#34;</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="n">    client_subnets</span>  <span class="o">=</span> <span class="k">var</span><span class="p">.</span><span class="k">private_subnets</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">    security_groups</span> <span class="o">=</span> <span class="p">[</span><span class="k">aws_security_group</span><span class="p">.</span><span class="k">msk</span><span class="p">.</span><span class="k">id</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="k">storage_info</span> {
</span></span><span class="line"><span class="ln">11</span><span class="cl">      <span class="k">ebs_storage_info</span> {
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="n">        volume_size</span> <span class="o">=</span> <span class="m">1000</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">        <span class="k">provisioned_throughput</span> {
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="n">          enabled</span>           <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="n">          volume_throughput</span> <span class="o">=</span> <span class="m">500</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">        }
</span></span><span class="line"><span class="ln">17</span><span class="cl">      }
</span></span><span class="line"><span class="ln">18</span><span class="cl">    }
</span></span><span class="line"><span class="ln">19</span><span class="cl">  }
</span></span><span class="line"><span class="ln">20</span><span class="cl">
</span></span><span class="line"><span class="ln">21</span><span class="cl">  <span class="k">client_authentication</span> {
</span></span><span class="line"><span class="ln">22</span><span class="cl">    <span class="k">sasl</span> {
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="n">      iam</span> <span class="o">=</span> <span class="kt">true</span><span class="c1">        # IAM auth (推薦)
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="c1"></span><span class="n">      scram</span> <span class="o">=</span> <span class="kt">false</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl">    }
</span></span><span class="line"><span class="ln">26</span><span class="cl">  }
</span></span><span class="line"><span class="ln">27</span><span class="cl">
</span></span><span class="line"><span class="ln">28</span><span class="cl">  <span class="k">configuration_info</span> {
</span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="n">    arn</span>      <span class="o">=</span> <span class="k">aws_msk_configuration</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="n">    revision</span> <span class="o">=</span> <span class="k">aws_msk_configuration</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">latest_revision</span>
</span></span><span class="line"><span class="ln">31</span><span class="cl">  }
</span></span><span class="line"><span class="ln">32</span><span class="cl">
</span></span><span class="line"><span class="ln">33</span><span class="cl">  <span class="k">encryption_info</span> {
</span></span><span class="line"><span class="ln">34</span><span class="cl">    <span class="k">encryption_in_transit</span> {
</span></span><span class="line"><span class="ln">35</span><span class="cl"><span class="n">      client_broker</span> <span class="o">=</span> <span class="s2">&#34;TLS&#34;</span>
</span></span><span class="line"><span class="ln">36</span><span class="cl">    }
</span></span><span class="line"><span class="ln">37</span><span class="cl">  }
</span></span><span class="line"><span class="ln">38</span><span class="cl">
</span></span><span class="line"><span class="ln">39</span><span class="cl">  <span class="k">logging_info</span> {
</span></span><span class="line"><span class="ln">40</span><span class="cl">    <span class="k">broker_logs</span> {
</span></span><span class="line"><span class="ln">41</span><span class="cl">      <span class="k">cloudwatch_logs</span> {
</span></span><span class="line"><span class="ln">42</span><span class="cl"><span class="n">        enabled</span>   <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">43</span><span class="cl"><span class="n">        log_group</span> <span class="o">=</span> <span class="k">aws_cloudwatch_log_group</span><span class="p">.</span><span class="k">msk</span><span class="p">.</span><span class="k">name</span>
</span></span><span class="line"><span class="ln">44</span><span class="cl">      }
</span></span><span class="line"><span class="ln">45</span><span class="cl">    }
</span></span><span class="line"><span class="ln">46</span><span class="cl">  }
</span></span><span class="line"><span class="ln">47</span><span class="cl">}</span></span></code></pre></div><h3 id="phase-2data-migrationmirrormaker-20">Phase 2：Data migration（MirrorMaker 2.0）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Self-managed Kafka ──(MM2)──→ MSK
</span></span><span class="line"><span class="ln">2</span><span class="cl">                       │
</span></span><span class="line"><span class="ln">3</span><span class="cl">                consumer offset sync
</span></span><span class="line"><span class="ln">4</span><span class="cl">                       │
</span></span><span class="line"><span class="ln">5</span><span class="cl">                topic config sync</span></span></code></pre></div><p>MM2 跑 1-7 天、依 topic 量 + retention 期間；replica.lag 對齊後進 cutover。</p>
<h3 id="phase-3cutover">Phase 3：Cutover</h3>
<ul>
<li>Application 端切 bootstrap.servers 從 self-managed → MSK</li>
<li>Producer 漸進切（10% → 50% → 100%）</li>
<li>Consumer 切換時 offset 從 MM2 sync 過的位置開始</li>
<li>Self-managed cluster read-only standby 2 週</li>
</ul>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1iam-auth-沒設application-連不上">Case 1：IAM auth 沒設、application 連不上</h3>
<p><strong>徵兆</strong>：cutover 後 application 報 <code>SaslAuthenticationException: Access denied</code>；MSK 端 cloudWatch log 顯示 IAM principal 不認。</p>
<p><strong>根因</strong>：MSK IAM auth 要求 client 跑 <em>MSK IAM auth library</em>（Java 用 <code>aws-msk-iam-auth</code>、Python 用 <code>aws-msk-iam-sasl-signer-python</code>）；application 端用 standard Kafka client、不知道怎麼 sign IAM signature。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Python kafka-python + IAM auth</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">from</span> <span class="nn">aws_msk_iam_sasl_signer</span> <span class="kn">import</span> <span class="n">MSKAuthTokenProvider</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="kn">from</span> <span class="nn">kafka</span> <span class="kn">import</span> <span class="n">KafkaProducer</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="k">class</span> <span class="nc">AwsMskIamProvider</span><span class="p">(</span><span class="n">MSKAuthTokenProvider</span><span class="p">):</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">    <span class="k">def</span> <span class="nf">token</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">generate_auth_token</span><span class="p">(</span><span class="s1">&#39;us-east-1&#39;</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">producer</span> <span class="o">=</span> <span class="n">KafkaProducer</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="n">bootstrap_servers</span><span class="o">=</span><span class="s1">&#39;b-1.mycluster.kafka.us-east-1.amazonaws.com:9098&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">    <span class="n">security_protocol</span><span class="o">=</span><span class="s1">&#39;SASL_SSL&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">    <span class="n">sasl_mechanism</span><span class="o">=</span><span class="s1">&#39;OAUTHBEARER&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">    <span class="n">sasl_oauth_token_provider</span><span class="o">=</span><span class="n">AwsMskIamProvider</span><span class="p">(),</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="p">)</span></span></span></code></pre></div><p>EKS pod 必須有 IAM role（IRSA）對 MSK cluster <code>kafka-cluster:Connect</code> action。</p>
<h3 id="case-2version-pinning360-跟-self-managed-行為差">Case 2：Version pinning、3.6.0 跟 self-managed 行為差</h3>
<p><strong>徵兆</strong>：cutover 到 MSK 3.6.0 後、某些 consumer 跑舊 client 失敗；新 broker 改 default <code>inter.broker.protocol.version</code> 但 client 不認。</p>
<p><strong>根因</strong>：MSK 升 Kafka version 後 broker config 變動、舊 client（&lt; 2.8）跟新 broker 協議不對；self-managed 端可能用更舊 broker version 跑、看不出問題。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-migration</strong>：所有 client 升 Kafka client library 2.8+</li>
<li><strong>MSK kafka_version 對齊 self-managed</strong>：先建 MSK 3.0 / 3.5、跟 self-managed 一致、cutover 後再升</li>
<li><strong>Phase rollout</strong>：用 <em>Tiered Storage</em> + retention 策略保留舊資料、新 producer / consumer 用新 version</li>
</ol>
<h3 id="case-3metric-pipeline-失效soc-dashboard-無數據">Case 3：Metric pipeline 失效、SOC dashboard 無數據</h3>
<p><strong>徵兆</strong>：cutover 後 Grafana dashboard 顯示 MSK metric 0；舊 JMX exporter 抓不到 MSK；CloudWatch 有 metric 但 SOC 端不接 CloudWatch。</p>
<p><strong>根因</strong>：MSK 不暴露 JMX、metric 走 CloudWatch / open monitoring (Prometheus + Grafana)、跟自建 JMX-based pipeline 不對等。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Open monitoring enabled</strong>：MSK config 設 <code>open_monitoring.prometheus.jmx_exporter.enabled = true</code>、跑 Prometheus 對 MSK broker 拉 metric</li>
<li><strong>CloudWatch → Prometheus</strong>：用 <code>cloudwatch-exporter</code> 拉 CloudWatch metric 進 Prometheus</li>
<li><strong>Dashboard refresh</strong>：Grafana dashboard 對 MSK-specific metric name 重寫（<code>kafka_server_*</code> → <code>aws_kafka_*</code> 或統一 alias）</li>
</ol>
<h3 id="case-4cross-cluster-mirrormm2--msk配置複雜">Case 4：Cross-cluster mirror（MM2 → MSK）配置複雜</h3>
<p><strong>徵兆</strong>：MM2 跑了 1 週、self-managed 跟 MSK consumer offset 沒同步；application 切過去後 <em>重新讀整批舊資料</em>、duplicate processing。</p>
<p><strong>根因</strong>：MM2 consumer offset sync 需要 <em>跨 cluster</em> mapping、source 端 offset 跟 target 端 offset 不直通；MM2 預設 offset sync 沒打開。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-properties" data-lang="properties"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># MM2 config</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">source.consumer.bootstrap.servers</span><span class="o">=</span><span class="s">self-managed-kafka:9092</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">target.consumer.bootstrap.servers</span><span class="o">=</span><span class="s">msk-cluster:9098</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">target.security.protocol</span><span class="o">=</span><span class="s">SASL_SSL</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">sync.group.offsets.enabled</span><span class="o">=</span><span class="s">true       # 必須打開</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="na">emit.checkpoints.enabled</span><span class="o">=</span><span class="s">true</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="na">checkpoints.topic.replication.factor</span><span class="o">=</span><span class="s">3</span></span></span></code></pre></div><p><strong>Architecture</strong>：consumer 切換時讀 <em>MM2 checkpoint</em> topic、不直接讀 internal offset；application 端用 <em>idempotent</em> + <em>dedup key</em>、avoid duplicate processing。</p>
<h3 id="case-5msk-billing-暴漲tiered-storage--cross-az-沒控">Case 5：MSK billing 暴漲、Tiered Storage / cross-AZ 沒控</h3>
<p><strong>徵兆</strong>：MSK 第一個月帳單比預估高 50%；breakdown 後發現 cross-AZ traffic（producer/consumer 跨 AZ）+ Tiered Storage 退到 S3 的 hot tier。</p>
<p><strong>根因</strong>：</p>
<ul>
<li>MSK auto multi-AZ deployment 不可避免 cross-AZ traffic、producer 寫 partition leader 可能跨 AZ</li>
<li>Tiered Storage 對 hot data（retention &lt; 24 小時）會多 storage cost；cold data 才 cost-effective</li>
</ul>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Application AZ-aware routing</strong>：producer 走 same-AZ broker（用 rack-aware producer config）、降 cross-AZ</li>
<li><strong>Retention 對齊 hot tier</strong>：&lt; 24 小時 retention 用 broker local storage、24 小時+ 才走 Tiered Storage</li>
<li><strong>Reserved instance</strong>：MSK 不直接 reserved、但 EBS / data transfer 可預付、降 10-20%</li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Self-managed Kafka</th>
          <th>MSK</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cluster cost (3 broker)</td>
          <td>$660 EC2 + $1500 EBS = $2,160</td>
          <td>$2,500-3,500（含 storage + multi-AZ）</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>0.5-1 FTE = $5K-10K</td>
          <td>0.1-0.3 FTE = $1K-3K</td>
      </tr>
      <tr>
          <td>Patch / maintenance</td>
          <td>Manual + downtime opportunity</td>
          <td>Auto + maintenance window scheduled</td>
      </tr>
      <tr>
          <td>Backup</td>
          <td>Self-managed MirrorMaker</td>
          <td>Built-in（S3 archive、auto）</td>
      </tr>
      <tr>
          <td>Metric / monitoring</td>
          <td>Prometheus + Grafana self-deploy</td>
          <td>CloudWatch + open monitoring</td>
      </tr>
      <tr>
          <td>Cross-AZ traffic</td>
          <td>Limited by VPC layout</td>
          <td>Auto multi-AZ、cross-AZ traffic cost 注意</td>
      </tr>
      <tr>
          <td>Tiered storage</td>
          <td>Kafka 3.6+ self-managed</td>
          <td>MSK built-in tiered storage</td>
      </tr>
      <tr>
          <td>Total (3 broker, 中型)</td>
          <td>$7K-11K / mo (含 FTE)</td>
          <td>$3.5K-6.5K / mo (含 FTE)</td>
      </tr>
      <tr>
          <td>Migration cost</td>
          <td>-</td>
          <td>1-3 FTE × 1-2 個月</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：&lt; 50 broker organization MSK ROI 通常 6-12 月持平、之後省 FTE；50+ broker 大 organization 自管 cost 可能反而低。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-kafka--nats-migration-對位">跟 <a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS migration</a> 對位</h3>
<p>兩條 Kafka 出路：</p>
<ul>
<li>MSK：operational simplification、protocol drop-in、cost 中等漲；適合 <em>繼續用 Kafka paradigm</em> 的 organization</li>
<li>NATS：paradigm shift、application 必須改、適合 <em>單純 messaging 不要 event sourcing</em> 的 use case</li>
</ul>
<p>多數 organization 不需要 paradigm shift、MSK 更合理；真正需要 lightweight messaging 才走 NATS。</p>
<h3 id="跟-confluent-cloud-對位">跟 <a href="https://www.confluent.io/confluent-cloud/">Confluent Cloud</a> 對位</h3>
<p>Confluent Cloud 是另一個 managed Kafka、跨 cloud（AWS / GCP / Azure）；MSK 是 AWS-only、但跟 IAM / VPC 整合更深。Multi-cloud organization 走 Confluent、AWS-deep organization 走 MSK。</p>
<h3 id="跟-iam--secrets-manager-整合">跟 IAM / Secrets Manager 整合</h3>
<p>MSK + IAM auth + Secrets Manager（連 <a href="/blog/backend/07-security-data-protection/vendors/hashicorp-vault/migrate-to-aws-secrets-manager/" data-link-title="Vault → AWS Secrets Manager：「secret」不是「secret」、identity model 才是核心差異" data-link-desc="Vault → AWS Secrets Manager migration 表面是 secret store 替換、實際核心是 identity model 對位（Vault token &#43; policy vs AWS IAM &#43; resource policy）；驗證 [#128](/report/data-topology-as-audit-dimension/) self-aware limitation 提出的 identity axis 候選 — identity 是否獨立 audit 軸；5 個 production 踩雷（IAM principal 對位 / dynamic credential 對等失敗 / lease lifecycle 模型不同 / audit log 結構差 / 計費模型反轉）">Vault → AWS Secrets Manager migration</a>）是 AWS-deep stack 的標準組合；short-lived credential + IRSA 是 production best practice。</p>
<h3 id="反向-migrationmsk--self-managed">反向 migration（MSK → self-managed）</h3>
<p>少見、通常是 <em>cost 反轉</em>（大 scale）或 <em>multi-cloud strategy</em>；流程鏡像對稱、注意 MSK Tiered Storage data 不直接 export、需要 <em>先 disable tiered storage</em> + recall data。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>MSK Connect</strong>：managed Kafka Connect、降 connector 運維、但 plugin ecosystem 比 self-managed Connect 少</li>
<li><strong>MSK Serverless</strong>：burst workload 適合、steady workload 反而貴</li>
<li><strong>Cost monitoring playbook</strong>：MSK billing 拆解每月跑一次、catch unexpected egress / tiered storage cost</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source vendor：<a href="/blog/backend/03-message-queue/vendors/kafka/" data-link-title="Apache Kafka" data-link-desc="Distributed event streaming platform、log-based 模型">Kafka</a></li>
<li>平行 migration playbook (Type C)：<a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a> / <a href="/blog/backend/01-database/vendors/mongodb/migrate-to-atlas/" data-link-title="MongoDB → Atlas：Atlas 不是 MongoDB &#43; managed、是另一個 product" data-link-desc="Atlas 號稱「MongoDB managed」但 operational model 完全不同（auto-scaling / VPC peering / IAM-driven access / 內建 backup / billing 模型）；本文採用 Type C operational redesign hybrid 結構、4-phase operational migration &#43; drop-in cutover、5 個 production 踩雷（連線數限制 / IP whitelist / backup retention / IAM token 過期 / billing 暴漲）">MongoDB → Atlas</a></li>
<li>平行 H cost variant：<a href="/blog/backend/04-observability/vendors/datadog/migrate-to-grafana-stack/" data-link-title="Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability" data-link-desc="Datadog 五層計費（host APM / metric / log ingest / log retention / RUM）拆解、對位 Grafana Stack（Mimir / Loki / Tempo / Grafana / Alloy）的 5 層責任；OTel-based agent migration、5 個 production 踩雷（cardinality 爆 / log volume cost / dashboard 不直接轉 / alert routing 換邏輯 / SLO definition 差異）、cost reality check">Datadog → Grafana Stack</a></li>
<li>平行 paradigm shift：<a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS</a></li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a></li>
</ul>
]]></content:encoded></item></channel></rss>