<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Deployment-Platform on Tarragon</title><link>https://tarrragon.github.io/blog/tags/deployment-platform/</link><description>Recent content in Deployment-Platform on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Tue, 19 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/deployment-platform/index.xml" rel="self" type="application/rss+xml"/><item><title>Docker Swarm → Kubernetes：5 個 Swarm production cluster 撞牆數據</title><link>https://tarrragon.github.io/blog/backend/05-deployment-platform/vendors/kubernetes/migrate-from-docker-swarm/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/05-deployment-platform/vendors/kubernetes/migrate-from-docker-swarm/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link Docker Swarm 跟 &lt;a href="https://tarrragon.github.io/blog/backend/05-deployment-platform/vendors/kubernetes/" data-link-title="Kubernetes" data-link-desc="Container orchestration 主流、GKE / EKS / AKS / 自管">Kubernetes&lt;/a>。跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit&lt;/a> 後對映 &lt;em>Paradigm = High（Swarm 簡單 container orchestration → K8s declarative resource model）→ Type E paradigm shift&lt;/em>。&lt;/p>&lt;/blockquote>
&lt;h2 id="5-個-swarm-production-cluster-撞牆數據">5 個 Swarm production cluster 撞牆數據&lt;/h2>
&lt;p>從 2020-2024 觀察 5 個中型 organization 的 Swarm production cluster lifecycle、典型撞牆點：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Cluster&lt;/th>
 &lt;th>規模 (peak)&lt;/th>
 &lt;th>撞牆點&lt;/th>
 &lt;th>觸發遷移時間&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>A (SaaS startup)&lt;/td>
 &lt;td>80 service / 12 node&lt;/td>
 &lt;td>service discovery latency 升、無 sidecar mesh&lt;/td>
 &lt;td>2022&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>B (E-commerce)&lt;/td>
 &lt;td>150 service / 25 node&lt;/td>
 &lt;td>rolling update + canary 邏輯自寫複雜&lt;/td>
 &lt;td>2023&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>C (Fintech)&lt;/td>
 &lt;td>60 service / 15 node&lt;/td>
 &lt;td>secret rotation + RBAC 自管、合規難&lt;/td>
 &lt;td>2023&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>D (Media)&lt;/td>
 &lt;td>200 service / 40 node&lt;/td>
 &lt;td>autoscaling 自寫、預測流量失敗&lt;/td>
 &lt;td>2024&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>E (Logistics)&lt;/td>
 &lt;td>100 service / 20 node&lt;/td>
 &lt;td>multi-region 不支援&lt;/td>
 &lt;td>2024&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>5 個共同 pattern：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Swarm 簡單但 ceiling 100-200 service / 20-40 node&lt;/strong>&lt;/li>
&lt;li>&lt;strong>跨 service 治理（mesh / RBAC / secret / autoscale）需要 &lt;em>外掛&lt;/em> 工具、複雜度反超 K8s&lt;/strong>&lt;/li>
&lt;li>&lt;strong>無 multi-region native&lt;/strong>、災備受限&lt;/li>
&lt;li>&lt;strong>生態縮、社群活躍度低、新 feature 緩&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>撞牆點不是「Swarm 跑不動」、是「Swarm 不會幫你解 &lt;em>跨 service 治理&lt;/em> 問題、要自寫」。Kubernetes 不是 simpler、是 &lt;em>把治理問題納入框架&lt;/em>。&lt;/p>
&lt;h2 id="為什麼遷ceiling--ecosystem--multi-region-三條-driver">為什麼遷：ceiling / ecosystem / multi-region 三條 driver&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Driver&lt;/th>
 &lt;th>觸發&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Ceiling&lt;/td>
 &lt;td>Swarm 跑 100-200 service 後 service discovery latency / scheduling 跟不上&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Ecosystem&lt;/td>
 &lt;td>K8s ecosystem (Helm / Operator / mesh / GitOps) 成熟、Swarm 對等工具缺&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Multi-region&lt;/td>
 &lt;td>Swarm 不支援、K8s 多 cluster federation 成熟&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>反向 driver（K8s → Swarm）：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link Docker Swarm 跟 <a href="/blog/backend/05-deployment-platform/vendors/kubernetes/" data-link-title="Kubernetes" data-link-desc="Container orchestration 主流、GKE / EKS / AKS / 自管">Kubernetes</a>。跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit</a> 後對映 <em>Paradigm = High（Swarm 簡單 container orchestration → K8s declarative resource model）→ Type E paradigm shift</em>。</p></blockquote>
<h2 id="5-個-swarm-production-cluster-撞牆數據">5 個 Swarm production cluster 撞牆數據</h2>
<p>從 2020-2024 觀察 5 個中型 organization 的 Swarm production cluster lifecycle、典型撞牆點：</p>
<table>
  <thead>
      <tr>
          <th>Cluster</th>
          <th>規模 (peak)</th>
          <th>撞牆點</th>
          <th>觸發遷移時間</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>A (SaaS startup)</td>
          <td>80 service / 12 node</td>
          <td>service discovery latency 升、無 sidecar mesh</td>
          <td>2022</td>
      </tr>
      <tr>
          <td>B (E-commerce)</td>
          <td>150 service / 25 node</td>
          <td>rolling update + canary 邏輯自寫複雜</td>
          <td>2023</td>
      </tr>
      <tr>
          <td>C (Fintech)</td>
          <td>60 service / 15 node</td>
          <td>secret rotation + RBAC 自管、合規難</td>
          <td>2023</td>
      </tr>
      <tr>
          <td>D (Media)</td>
          <td>200 service / 40 node</td>
          <td>autoscaling 自寫、預測流量失敗</td>
          <td>2024</td>
      </tr>
      <tr>
          <td>E (Logistics)</td>
          <td>100 service / 20 node</td>
          <td>multi-region 不支援</td>
          <td>2024</td>
      </tr>
  </tbody>
</table>
<p>5 個共同 pattern：</p>
<ul>
<li><strong>Swarm 簡單但 ceiling 100-200 service / 20-40 node</strong></li>
<li><strong>跨 service 治理（mesh / RBAC / secret / autoscale）需要 <em>外掛</em> 工具、複雜度反超 K8s</strong></li>
<li><strong>無 multi-region native</strong>、災備受限</li>
<li><strong>生態縮、社群活躍度低、新 feature 緩</strong></li>
</ul>
<p>撞牆點不是「Swarm 跑不動」、是「Swarm 不會幫你解 <em>跨 service 治理</em> 問題、要自寫」。Kubernetes 不是 simpler、是 <em>把治理問題納入框架</em>。</p>
<h2 id="為什麼遷ceiling--ecosystem--multi-region-三條-driver">為什麼遷：ceiling / ecosystem / multi-region 三條 driver</h2>
<table>
  <thead>
      <tr>
          <th>Driver</th>
          <th>觸發</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Ceiling</td>
          <td>Swarm 跑 100-200 service 後 service discovery latency / scheduling 跟不上</td>
      </tr>
      <tr>
          <td>Ecosystem</td>
          <td>K8s ecosystem (Helm / Operator / mesh / GitOps) 成熟、Swarm 對等工具缺</td>
      </tr>
      <tr>
          <td>Multi-region</td>
          <td>Swarm 不支援、K8s 多 cluster federation 成熟</td>
      </tr>
  </tbody>
</table>
<p>反向 driver（K8s → Swarm）：</p>
<ul>
<li>純 internal tool / 小規模（&lt; 30 service）、K8s 過度複雜</li>
<li>Edge / IoT scenario、Swarm footprint 小</li>
</ul>
<h2 id="6-維-audit">6 維 audit</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td><strong>High</strong>（docker-compose stack.yml → K8s YAML、syntax 完全不同）</td>
      </tr>
      <tr>
          <td>Operational</td>
          <td>Medium（Swarm 自管 → K8s self-host or managed）</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td><strong>High</strong>（簡單 container orchestration → declarative resource model）</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>Low（同 1 個 orchestration 系統）</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>Low（container image 不變）</td>
      </tr>
      <tr>
          <td>Data topology</td>
          <td>Low</td>
      </tr>
  </tbody>
</table>
<p>Schema + Paradigm 雙 High → <strong>Type E paradigm shift</strong> 為主、Schema 高維獨立段。</p>
<h2 id="paradigm-對位">Paradigm 對位</h2>
<table>
  <thead>
      <tr>
          <th>概念</th>
          <th>Swarm</th>
          <th>K8s</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Workload unit</td>
          <td>Service</td>
          <td>Deployment + Pod + Service</td>
      </tr>
      <tr>
          <td>Stack 定義</td>
          <td>stack.yml (docker-compose 格式)</td>
          <td>YAML manifest (multiple resources)</td>
      </tr>
      <tr>
          <td>Networking</td>
          <td>Overlay network (built-in)</td>
          <td>CNI plugin (Calico / Cilium / etc)</td>
      </tr>
      <tr>
          <td>Service discovery</td>
          <td>DNS-based built-in</td>
          <td>DNS-based (CoreDNS) + Service object</td>
      </tr>
      <tr>
          <td>Load balancing</td>
          <td>Built-in routing mesh</td>
          <td>Service + Ingress + LoadBalancer</td>
      </tr>
      <tr>
          <td>Secret management</td>
          <td>Docker secrets</td>
          <td>K8s Secret + 外部 Vault / Secrets Manager</td>
      </tr>
      <tr>
          <td>Rolling update</td>
          <td><code>docker service update --image ...</code></td>
          <td>Deployment + rolling update + readiness probe</td>
      </tr>
      <tr>
          <td>Autoscaling</td>
          <td>手動 scale</td>
          <td>HPA (Horizontal Pod Autoscaler)</td>
      </tr>
      <tr>
          <td>RBAC</td>
          <td>Limited (Swarm enterprise)</td>
          <td>First-class (Role / RoleBinding / ServiceAccount)</td>
      </tr>
      <tr>
          <td>Persistent storage</td>
          <td>Volume + driver plugin</td>
          <td>PV / PVC + CSI driver</td>
      </tr>
      <tr>
          <td>Service mesh</td>
          <td>無 (要外掛 Traefik)</td>
          <td>Istio / Linkerd / Cilium</td>
      </tr>
      <tr>
          <td>GitOps</td>
          <td>無 native</td>
          <td>Argo CD / Flux (first-class)</td>
      </tr>
  </tbody>
</table>
<h2 id="schema-gapdocker-compose-vs-k8s-yaml">Schema gap：docker-compose vs K8s YAML</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># Docker Swarm stack.yml</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">version</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;3.8&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="nt">services</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="nt">webapp</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="nt">image</span><span class="p">:</span><span class="w"> </span><span class="l">myapp:1.0</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="nt">deploy</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">      </span><span class="nt">replicas</span><span class="p">:</span><span class="w"> </span><span class="m">3</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">      </span><span class="nt">update_config</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">        </span><span class="nt">parallelism</span><span class="p">:</span><span class="w"> </span><span class="m">1</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">      </span><span class="nt">restart_policy</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">        </span><span class="nt">condition</span><span class="p">:</span><span class="w"> </span><span class="kc">on</span>-<span class="l">failure</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">    </span><span class="nt">networks</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">      </span>- <span class="l">frontend</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">    </span><span class="nt">ports</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">      </span>- <span class="s2">&#34;8080:8080&#34;</span></span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># K8s equivalent (Deployment + Service + Ingress)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">apps/v1</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">Deployment</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">webapp</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">  </span><span class="nt">replicas</span><span class="p">:</span><span class="w"> </span><span class="m">3</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">  </span><span class="nt">strategy</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l">RollingUpdate</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">    </span><span class="nt">rollingUpdate</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">      </span><span class="nt">maxSurge</span><span class="p">:</span><span class="w"> </span><span class="m">1</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">      </span><span class="nt">maxUnavailable</span><span class="p">:</span><span class="w"> </span><span class="m">0</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">  </span><span class="nt">selector</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">    </span><span class="nt">matchLabels</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">app</span><span class="p">:</span><span class="w"> </span><span class="l">webapp }</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">  </span><span class="nt">template</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">    </span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w">      </span><span class="nt">labels</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">app</span><span class="p">:</span><span class="w"> </span><span class="l">webapp }</span><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w">    </span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w">      </span><span class="nt">containers</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w">        </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">webapp</span><span class="w">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w">          </span><span class="nt">image</span><span class="p">:</span><span class="w"> </span><span class="l">myapp:1.0</span><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w">          </span><span class="nt">ports</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w">            </span>- <span class="nt">containerPort</span><span class="p">:</span><span class="w"> </span><span class="m">8080</span><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w">          </span><span class="nt">readinessProbe</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w">            </span><span class="nt">httpGet</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="w">              </span><span class="nt">path</span><span class="p">:</span><span class="w"> </span><span class="l">/healthz</span><span class="w">
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="w">              </span><span class="nt">port</span><span class="p">:</span><span class="w"> </span><span class="m">8080</span><span class="w">
</span></span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="w">          </span><span class="nt">resources</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="w">            </span><span class="nt">requests</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="w">              </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="l">100m</span><span class="w">
</span></span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="w">              </span><span class="nt">memory</span><span class="p">:</span><span class="w"> </span><span class="l">128Mi</span><span class="w">
</span></span></span><span class="line"><span class="ln">32</span><span class="cl"><span class="w">            </span><span class="nt">limits</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="w">              </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="l">500m</span><span class="w">
</span></span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="w">              </span><span class="nt">memory</span><span class="p">:</span><span class="w"> </span><span class="l">512Mi</span><span class="w">
</span></span></span><span class="line"><span class="ln">35</span><span class="cl"><span class="w"></span><span class="nn">---</span><span class="w">
</span></span></span><span class="line"><span class="ln">36</span><span class="cl"><span class="w"></span><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">v1</span><span class="w">
</span></span></span><span class="line"><span class="ln">37</span><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">Service</span><span class="w">
</span></span></span><span class="line"><span class="ln">38</span><span class="cl"><span class="w"></span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">39</span><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">webapp</span><span class="w">
</span></span></span><span class="line"><span class="ln">40</span><span class="cl"><span class="w"></span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">41</span><span class="cl"><span class="w">  </span><span class="nt">selector</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">app</span><span class="p">:</span><span class="w"> </span><span class="l">webapp }</span><span class="w">
</span></span></span><span class="line"><span class="ln">42</span><span class="cl"><span class="w">  </span><span class="nt">ports</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">43</span><span class="cl"><span class="w">    </span>- <span class="nt">port</span><span class="p">:</span><span class="w"> </span><span class="m">8080</span><span class="w">
</span></span></span><span class="line"><span class="ln">44</span><span class="cl"><span class="w">      </span><span class="nt">targetPort</span><span class="p">:</span><span class="w"> </span><span class="m">8080</span></span></span></code></pre></div><p>1 Swarm service → 2-3 K8s resource（Deployment + Service + 可能 Ingress / HPA）；application 不改但 <em>deployment 端工作量 5-10x</em>。</p>
<h2 id="migration-流程">Migration 流程</h2>
<h3 id="partial-migration--混合架構">Partial migration + 混合架構</h3>
<p>跟 <a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS</a> / <a href="/blog/backend/05-deployment-platform/vendors/consul/migrate-from-etcd/" data-link-title="etcd → Consul：KV &#43; N 個 extras feature matrix" data-link-desc="etcd → Consul 是 Type E paradigm shift expansion — 從 pure KV store 升到 service mesh / discovery / health check / multi-DC；本文用對照表 &#43; paradigm expansion 路線、5 個 production 踩雷（API 對位 / lock semantics / watch event model / multi-DC topology / ACL system）">etcd → Consul</a> 同 Type E pattern：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">1. Audit application：列所有 Swarm stack + service
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">2. 分類處理 plan:
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">   - 簡單 stateless: 先切 K8s (低風險)
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">   - Stateful (DB / queue): 評估 K8s operator 或保留 Swarm
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">   - Critical service: 雙跑期確認 K8s 行為對等
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">3. K8s cluster 建置:
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">   - Managed (EKS / GKE / AKS) vs self-host (kubeadm)
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">   - 配 ingress controller / cert-manager / monitoring
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">4. Application 遷移 (per stack)
</span></span><span class="line"><span class="ln">10</span><span class="cl">   - 寫 K8s YAML / Helm chart
</span></span><span class="line"><span class="ln">11</span><span class="cl">   - 配 readiness/liveness probe / resource request
</span></span><span class="line"><span class="ln">12</span><span class="cl">   - Networking + secret 對位
</span></span><span class="line"><span class="ln">13</span><span class="cl">5. Cutover + Swarm decommission
</span></span><span class="line"><span class="ln">14</span><span class="cl">   - 部分 stack 切完、評估 Swarm 是否保留 (legacy / edge)
</span></span><span class="line"><span class="ln">15</span><span class="cl">   - 多數 organization 完全 decommission Swarm</span></span></code></pre></div><p>整體 3-6 個月、依 stack 數量跟 application 複雜度。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1networking-model-差cross-service-connectivity-失效">Case 1：Networking model 差、cross-service connectivity 失效</h3>
<p><strong>徵兆</strong>：cutover 後 service A 連 service B 失敗、Swarm 端 <code>tasks.service_b</code> DNS 對位 K8s 端 <code>service-b.namespace.svc.cluster.local</code> 不通。</p>
<p><strong>根因</strong>：Swarm overlay network 內 service-to-service 用 short name (<code>service_b</code>)、K8s 用 FQDN；application 端 service URL 寫死。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>Application 端用 short name + cluster DNS search domain</li>
<li>K8s 端設 <code>dnsPolicy: ClusterFirst</code> 預設、確認 <code>kubectl get svc -A</code> 對應</li>
<li>NetworkPolicy 預設 deny-all、明示 allow rule</li>
</ol>
<h3 id="case-2secret-rotation-從-swarm-secrets-換-vault--secrets-manager">Case 2：Secret rotation 從 Swarm secrets 換 Vault / Secrets Manager</h3>
<p><strong>徵兆</strong>：原本 Swarm 用 <code>docker secret</code> 旋轉 secret、切 K8s 後 K8s Secret 是 <em>static value</em>、rotation 不自動。</p>
<p><strong>根因</strong>：K8s Secret 是 K8s-native 但 <em>not auto-rotated</em>、需要外部 Vault / Secrets Manager + agent (vault-agent-injector / external-secrets-operator)。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>K8s 端 deploy external-secrets-operator + AWS Secrets Manager / Vault integration</li>
<li>Application 端 mount file or env variable、不在 code 寫死</li>
<li>Rotation 走 vendor-side、K8s 端 sidecar 自動 reload</li>
</ol>
<h3 id="case-3readiness-probe-沒設rolling-update-期間-traffic-loss">Case 3：Readiness probe 沒設、rolling update 期間 traffic loss</h3>
<p><strong>徵兆</strong>：cutover 後 deploy 期間 application 5-10% request 失敗；發現 pod startup 完成前就接 traffic。</p>
<p><strong>根因</strong>：Swarm 簡單 restart_policy 沒對等 probe 概念；K8s 預設 deploy 後 immediate ready、若沒 readiness probe、startup 時間長的 application 會在未 ready 時接流量。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>必加 readiness probe</strong>：HTTP / TCP / exec check</li>
<li><strong>配 initial delay</strong>：JVM application 預留 30-60s</li>
<li><strong>配 <code>minReadySeconds</code></strong>：deployment 端設 30s 確保 stable</li>
</ol>
<h3 id="case-4hpa-預設不啟autoscaling-失效">Case 4：HPA 預設不啟、autoscaling 失效</h3>
<p><strong>徵兆</strong>：Swarm 端寫了 cron-based autoscale script、切 K8s 後 script 失效、流量高峰沒 scale up。</p>
<p><strong>根因</strong>：K8s HPA 不是預設啟動、需要 <em>明示配置</em> + metrics-server install。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">autoscaling/v2</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">HorizontalPodAutoscaler</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">webapp-hpa</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">  </span><span class="nt">scaleTargetRef</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">apps/v1</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">    </span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">Deployment</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">webapp</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">  </span><span class="nt">minReplicas</span><span class="p">:</span><span class="w"> </span><span class="m">3</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">  </span><span class="nt">maxReplicas</span><span class="p">:</span><span class="w"> </span><span class="m">20</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">  </span><span class="nt">metrics</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">    </span>- <span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l">Resource</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">      </span><span class="nt">resource</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">        </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">cpu</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">        </span><span class="nt">target</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w">          </span><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l">Utilization</span><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w">          </span><span class="nt">averageUtilization</span><span class="p">:</span><span class="w"> </span><span class="m">70</span></span></span></code></pre></div><p>裝 metrics-server / Keda（event-driven autoscaling）+ 配 HPA per Deployment。</p>
<h3 id="case-5yaml-維護地獄helm--kustomize-配置遲">Case 5：YAML 維護地獄、Helm / Kustomize 配置遲</h3>
<p><strong>徵兆</strong>：cutover 後 K8s YAML 從 5 個檔（Swarm stack）變 50+ 個 K8s manifest；每個 application 端要改一個 config 都要動 N 個 file。</p>
<p><strong>根因</strong>：K8s YAML 是 <em>very verbose</em>、不像 docker-compose 簡潔；缺 templating 跟 environment 抽象。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Helm chart</strong>：對 application 包成 chart、用 <code>values.yaml</code> 抽象環境差異</li>
<li><strong>Kustomize</strong>：base + overlay pattern、不靠 templating</li>
<li><strong>GitOps with Argo CD / Flux</strong>：宣告式部署、降 manual kubectl 操作</li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Docker Swarm</th>
          <th>Kubernetes (managed)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cluster cost (mid-tier)</td>
          <td>$300-800 / mo</td>
          <td>$500-1500 / mo（EKS/GKE/AKS control plane + nodes）</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>0.3-0.8</td>
          <td>0.5-1.5（除非 managed、降到 0.3-0.7）</td>
      </tr>
      <tr>
          <td>Ecosystem maturity</td>
          <td>低、衰退</td>
          <td>高、active growth</td>
      </tr>
      <tr>
          <td>Multi-region</td>
          <td>不支援</td>
          <td>多 cluster federation 成熟</td>
      </tr>
      <tr>
          <td>Migration cost</td>
          <td>-</td>
          <td>2-4 FTE × 3-6 個月</td>
      </tr>
      <tr>
          <td>Long-term ROI</td>
          <td>Negative（社群縮）</td>
          <td>Positive（feature growth）</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：&lt; 30 service 小 organization 可不切；50+ service 開始撞 Swarm ceiling、值得評估；100+ service / multi-region 必切。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-service-mesh-整合">跟 Service mesh 整合</h3>
<p>Cutover 後 <em>順便</em> 評估 Istio / Linkerd / Cilium service mesh、cover mTLS / observability / traffic policy；不要在 Swarm migration 後立刻上 mesh、分階段。</p>
<h3 id="跟-gitops-整合">跟 GitOps 整合</h3>
<p>K8s + Argo CD / Flux 是 <em>natural pair</em>；migration 時直接走 GitOps、避免 manual kubectl 操作累積。</p>
<h3 id="跟-vault--aws-secrets-manager-對齊">跟 <a href="/blog/backend/07-security-data-protection/vendors/hashicorp-vault/migrate-to-aws-secrets-manager/" data-link-title="Vault → AWS Secrets Manager：「secret」不是「secret」、identity model 才是核心差異" data-link-desc="Vault → AWS Secrets Manager migration 表面是 secret store 替換、實際核心是 identity model 對位（Vault token &#43; policy vs AWS IAM &#43; resource policy）；驗證 [#128](/report/data-topology-as-audit-dimension/) self-aware limitation 提出的 identity axis 候選 — identity 是否獨立 audit 軸；5 個 production 踩雷（IAM principal 對位 / dynamic credential 對等失敗 / lease lifecycle 模型不同 / audit log 結構差 / 計費模型反轉）">Vault → AWS Secrets Manager</a> 對齊</h3>
<p>Swarm secrets → K8s Secret → external secrets management 是 <em>3-step 演進</em>、不是 1-step；migration 期間先用 K8s Secret、之後切 Vault / Secrets Manager。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Target vendor：<a href="/blog/backend/05-deployment-platform/vendors/kubernetes/" data-link-title="Kubernetes" data-link-desc="Container orchestration 主流、GKE / EKS / AKS / 自管">Kubernetes</a></li>
<li>平行 migration playbook (Type E)：<a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS</a> / <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-memcached/" data-link-title="Redis → Memcached：Memcached 不是 simpler Redis、是 cache paradigm" data-link-desc="Redis → Memcached 是 Type E paradigm reduction migration — 從 multi-paradigm（KV &#43; 資料結構 &#43; pub/sub &#43; Lua &#43; streams）退到 pure cache；不是「remove Redis features」、是「重新分配 Redis-specific feature 到對應 specialized 服務」；5 個 production 踩雷 &#43; paradigm reduction 路線">Redis → Memcached</a> / <a href="/blog/backend/05-deployment-platform/vendors/consul/migrate-from-etcd/" data-link-title="etcd → Consul：KV &#43; N 個 extras feature matrix" data-link-desc="etcd → Consul 是 Type E paradigm shift expansion — 從 pure KV store 升到 service mesh / discovery / health check / multi-DC；本文用對照表 &#43; paradigm expansion 路線、5 個 production 踩雷（API 對位 / lock semantics / watch event model / multi-DC topology / ACL system）">etcd → Consul</a> / <a href="/blog/backend/04-observability/vendors/honeycomb/migrate-from-sentry/" data-link-title="Sentry → Honeycomb：trace 不是 error、是不同 observability paradigm" data-link-desc="Sentry → Honeycomb 是 paradigm shift — Sentry 主軸是 error tracking &#43; transaction trace、Honeycomb 主軸是 high-cardinality wide-event observability；本文釐清 paradigm 邊界、5 個 production 踩雷（event schema 對位 / sampling 行為 / error grouping 失效 / cost 模型差 / alert paradigm shift）">Sentry → Honeycomb</a></li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a></li>
</ul>
]]></content:encoded></item><item><title>Terraform → OpenTofu：HCL 跟 state file 級 drop-in、CI runner 切 binary 完成</title><link>https://tarrragon.github.io/blog/backend/05-deployment-platform/vendors/terraform/migrate-to-opentofu/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/05-deployment-platform/vendors/terraform/migrate-to-opentofu/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link &lt;a href="https://tarrragon.github.io/blog/backend/05-deployment-platform/vendors/terraform/" data-link-title="Terraform / OpenTofu" data-link-desc="Infrastructure as Code 主流工具">Terraform&lt;/a>（source）跟 OpenTofu（target）。Type B drop-in migration 標準形態、跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit&lt;/a> 後對映 &lt;em>6 維皆 Low → Type B drop-in&lt;/em>；本文驗證 skill 的 Type B anatomy 在 IaC 領域成立。&lt;/p>&lt;/blockquote>
&lt;h2 id="hcl--state-file--provider-三層-diff-sample">HCL / state file / provider 三層 diff sample&lt;/h2>
&lt;p>跟前批 &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &amp;#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB&lt;/a> 同為 Type B drop-in、本文用 code-led entry — 直接給 3 種 diff sample 證明「真 drop-in」：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-hcl" data-lang="hcl">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1"># 1. HCL syntax: 完全相同 (Terraform 1.5.x baseline)
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">resource&lt;/span> &lt;span class="s2">&amp;#34;aws_s3_bucket&amp;#34; &amp;#34;logs&amp;#34;&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="n"> bucket&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;myapp-logs&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="n"> tags&lt;/span> &lt;span class="o">=&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="n"> Env&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;production&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl"> }
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">}
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">8&lt;/span>&lt;span class="cl">&lt;span class="err">#&lt;/span> &lt;span class="k">兩家&lt;/span> &lt;span class="k">binary&lt;/span> &lt;span class="k">都接受&lt;/span>&lt;span class="err">、&lt;/span>&lt;span class="k">執行結果一致&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>




&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="c1"># 2. State file: 完全相同 schema&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">$ cat terraform.tfstate &lt;span class="p">|&lt;/span> jq &lt;span class="s1">&amp;#39;.version, .terraform_version&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="m">4&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="s2">&amp;#34;1.5.7&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="c1"># 切 OpenTofu 後 re-init、state 保留&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">$ tofu init
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">$ cat terraform.tfstate &lt;span class="p">|&lt;/span> jq &lt;span class="s1">&amp;#39;.version, .terraform_version&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="m">4&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="s2">&amp;#34;1.6.0&amp;#34;&lt;/span> &lt;span class="c1"># tool version 標記變、其他不變&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>




&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-hcl" data-lang="hcl">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="c1"># 3. Provider: registry 路徑唯一明顯差異
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">terraform&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl"> &lt;span class="k">required_providers&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="n"> aws&lt;/span> &lt;span class="o">=&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="n"> source&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;hashicorp/aws&amp;#34;&lt;/span>&lt;span class="c1"> # 兩家共用 source 字串
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="n"> version&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;~&amp;gt; 5.0&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl"> }
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl"> }
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">}&lt;span class="c1">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="c1"># Terraform 從 registry.terraform.io 拉
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="err">#&lt;/span> &lt;span class="k">OpenTofu&lt;/span> &lt;span class="k">預設從&lt;/span> &lt;span class="k">registry&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">opentofu&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">org&lt;/span> &lt;span class="k">拉&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="k">fallback&lt;/span> &lt;span class="k">到&lt;/span> &lt;span class="k">terraform&lt;/span> &lt;span class="k">registry&lt;/span>&lt;span class="p">)&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>3 層 diff sample 顯示：HCL / state schema / 主流 provider 配置完全相容；唯一明顯差異在 &lt;em>registry routing&lt;/em>。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link <a href="/blog/backend/05-deployment-platform/vendors/terraform/" data-link-title="Terraform / OpenTofu" data-link-desc="Infrastructure as Code 主流工具">Terraform</a>（source）跟 OpenTofu（target）。Type B drop-in migration 標準形態、跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit</a> 後對映 <em>6 維皆 Low → Type B drop-in</em>；本文驗證 skill 的 Type B anatomy 在 IaC 領域成立。</p></blockquote>
<h2 id="hcl--state-file--provider-三層-diff-sample">HCL / state file / provider 三層 diff sample</h2>
<p>跟前批 <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB</a> 同為 Type B drop-in、本文用 code-led entry — 直接給 3 種 diff sample 證明「真 drop-in」：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 1. HCL syntax: 完全相同 (Terraform 1.5.x baseline)
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">resource</span> <span class="s2">&#34;aws_s3_bucket&#34; &#34;logs&#34;</span> {
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="n">  bucket</span> <span class="o">=</span> <span class="s2">&#34;myapp-logs&#34;</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="n">  tags</span> <span class="o">=</span> {
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="n">    Env</span> <span class="o">=</span> <span class="s2">&#34;production&#34;</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">  }
</span></span><span class="line"><span class="ln">7</span><span class="cl">}
</span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="err">#</span> <span class="k">兩家</span> <span class="k">binary</span> <span class="k">都接受</span><span class="err">、</span><span class="k">執行結果一致</span></span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 2. State file: 完全相同 schema</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">$ cat terraform.tfstate <span class="p">|</span> jq <span class="s1">&#39;.version, .terraform_version&#39;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="m">4</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="s2">&#34;1.5.7&#34;</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"># 切 OpenTofu 後 re-init、state 保留</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">$ tofu init
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">$ cat terraform.tfstate <span class="p">|</span> jq <span class="s1">&#39;.version, .terraform_version&#39;</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="m">4</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="s2">&#34;1.6.0&#34;</span>  <span class="c1"># tool version 標記變、其他不變</span></span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 3. Provider: registry 路徑唯一明顯差異
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">terraform</span> {
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  <span class="k">required_providers</span> {
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">    aws</span> <span class="o">=</span> {
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">      source</span>  <span class="o">=</span> <span class="s2">&#34;hashicorp/aws&#34;</span><span class="c1">     # 兩家共用 source 字串
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="n">      version</span> <span class="o">=</span> <span class="s2">&#34;~&gt; 5.0&#34;</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">    }
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  }
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">}<span class="c1">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"># Terraform 從 registry.terraform.io 拉
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="err">#</span> <span class="k">OpenTofu</span> <span class="k">預設從</span> <span class="k">registry</span><span class="p">.</span><span class="k">opentofu</span><span class="p">.</span><span class="k">org</span> <span class="k">拉</span> <span class="p">(</span><span class="k">fallback</span> <span class="k">到</span> <span class="k">terraform</span> <span class="k">registry</span><span class="p">)</span></span></span></code></pre></div><p>3 層 diff sample 顯示：HCL / state schema / 主流 provider 配置完全相容；唯一明顯差異在 <em>registry routing</em>。</p>
<p>跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit</a>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>HCL 完全相容、CLI command 對映 (terraform → tofu)</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>同 workflow (init / plan / apply)</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>同 IaC declarative</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>同 single binary</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>無（不是 application、是 infrastructure tool）</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Data topology</td>
          <td>同 single state file backend</td>
          <td>Low</td>
      </tr>
  </tbody>
</table>
<p>6 維皆 Low → Type B drop-in。</p>
<h2 id="為什麼遷license--governance--community-三條-driver">為什麼遷：license / governance / community 三條 driver</h2>
<p>跟前批 <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB</a> 不同（cost / performance driver）、Terraform → OpenTofu 主要 driver 在 governance：</p>
<table>
  <thead>
      <tr>
          <th>Driver</th>
          <th>觸發場景</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>License</strong></td>
          <td>Terraform 在 2023-08 改 BSL（Business Source License）、商業使用限制；OpenTofu 維持 MPL 2.0 開源</td>
      </tr>
      <tr>
          <td><strong>Vendor neutrality</strong></td>
          <td>多雲 / 多客戶情境想避免 HashiCorp lock-in、用 Linux Foundation 治理的 OpenTofu</td>
      </tr>
      <tr>
          <td><strong>Community / feature</strong></td>
          <td>OpenTofu 1.6+ 加 state encryption、跟 Terraform 商業版差異化、社群驅動 feature</td>
      </tr>
  </tbody>
</table>
<p>反向 driver（OpenTofu → Terraform）：</p>
<ul>
<li>Terraform Cloud / Enterprise 特定 feature 依賴（policy as code 用 Sentinel、跟 OpenTofu 自家 OPA 不對等）</li>
<li>既有 module 在 Terraform registry 維護、未同步 OpenTofu registry</li>
</ul>
<h2 id="相容性-audit">相容性 audit</h2>
<p>Pre-cutover 必跑：</p>
<table>
  <thead>
      <tr>
          <th>議題</th>
          <th>處理方式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Terraform version pin（<code>required_version = &quot;&gt;= 1.5.0, &lt; 1.6.0&quot;</code>）</td>
          <td>改 <code>&gt;= 1.6.0</code> 涵蓋 OpenTofu / 移除 upper bound</td>
      </tr>
      <tr>
          <td>Provider 來源 (registry path)</td>
          <td>主流 provider（aws / azurerm / gcp / k8s）都同源、自家 / 第三方 provider 確認 OpenTofu registry mirror</td>
      </tr>
      <tr>
          <td>Terraform Cloud / Enterprise feature</td>
          <td>Sentinel policy → OpenTofu OPA / Conftest；workspace API 對等性逐項 check</td>
      </tr>
      <tr>
          <td>CLI binary name 在 CI pipeline</td>
          <td><code>terraform plan</code> → <code>tofu plan</code>、或 alias <code>terraform=tofu</code> 保留兼容</td>
      </tr>
      <tr>
          <td>State backend (S3 / GCS / Azure / Consul / Terraform Cloud)</td>
          <td>S3/GCS/Azure 完全相容；Consul backend 兩家都支援；Terraform Cloud 走自家 remote backend、不直通</td>
      </tr>
      <tr>
          <td>Module source</td>
          <td>git-based module 完全相容；registry module 確認 OpenTofu registry 有 mirror</td>
      </tr>
  </tbody>
</table>
<p>Audit output：列「100% drop-in」block + 「需處理」block；後者通常 &lt; 5% 範圍。</p>
<h2 id="step-by-step-cutover">Step-by-step cutover</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 1. Install OpenTofu (跨 OS)</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">brew install opentofu                <span class="c1"># macOS</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">snap install --classic opentofu      <span class="c1"># Ubuntu</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># https://opentofu.org/docs/intro/install/</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"># 2. 在 workspace 跑 tofu init</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">$ <span class="nb">cd</span> terraform-workspace/
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">$ tofu init -upgrade
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># 升級 provider / module、re-init backend、保留 state</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"># 3. Plan diff（應該 = 0 changes）</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">$ tofu plan
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"># Plan: 0 to add, 0 to change, 0 to destroy.</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"># 如果有 diff、表示 provider version 不對齊、檢查 lock file</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="c1"># 4. Apply（保險起見、staging 先跑）</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">$ tofu apply
</span></span><span class="line"><span class="ln">18</span><span class="cl">
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="c1"># 5. CI / CD pipeline 切 binary</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="c1"># Before</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">terraform init
</span></span><span class="line"><span class="ln">22</span><span class="cl">terraform plan -out<span class="o">=</span>tfplan
</span></span><span class="line"><span class="ln">23</span><span class="cl">terraform apply tfplan
</span></span><span class="line"><span class="ln">24</span><span class="cl">
</span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="c1"># After</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl">tofu init
</span></span><span class="line"><span class="ln">27</span><span class="cl">tofu plan -out<span class="o">=</span>tfplan
</span></span><span class="line"><span class="ln">28</span><span class="cl">tofu apply tfplan
</span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="c1"># 或保留 terraform 字面、用 alias / symlink</span></span></span></code></pre></div><p>整個 cutover 通常 &lt; 1 天（單 workspace）；多 workspace organization 視規模 1-4 週逐個切。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1provider-version-driftstaging-plan-出現意外-diff">Case 1：Provider version drift、staging plan 出現意外 diff</h3>
<p><strong>徵兆</strong>：<code>tofu plan</code> 顯示 100+ resource 有 in-place update、實際業務沒改任何 config。</p>
<p><strong>根因</strong>：<code>.terraform.lock.hcl</code> 鎖的 provider version 在 Terraform / OpenTofu registry 不一致（同 version 但 binary checksum 微差）；OpenTofu 在 init 時拉新 checksum、視為「provider 變了」。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>預先對齊</strong>：<code>tofu init -upgrade</code> 重建 lock file、把 OpenTofu 端 checksum 寫進去</li>
<li><strong>CI lockfile commit</strong>：lock file 進版控、不同 binary 端跑前先 lockfile 對齊</li>
<li><strong>若 plan 仍有差異</strong>：通常是 provider 內部 schema 對 nil 值處理不同、用 <code>lifecycle.ignore_changes</code> 暫忽略、後續逐項 fix</li>
</ol>
<h3 id="case-2state-file-lock-機制微差">Case 2：State file lock 機制微差</h3>
<p><strong>徵兆</strong>：兩個 CI pipeline 同時跑 <code>tofu apply</code>、其中一個應該 lock 拒絕、實際兩個都跑、production 端 race condition。</p>
<p><strong>根因</strong>：Terraform DynamoDB lock 跟 OpenTofu lock 用相同 schema 但 lock_id 規則略不同；舊 lock entry 殘留時 OpenTofu 端解析失敗、視為「無 lock」繼續跑。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>DynamoDB lock table 手動清舊 entry</strong>：cutover 期間先 <code>aws dynamodb delete-item</code> 清舊 lock</li>
<li><strong>單向流量切換</strong>：cutover 期間 freeze 所有 CI、只一個 pipeline 跑、避免 race</li>
<li><strong>架構</strong>：用 <em>fully replicated lock backend</em>（如 Consul）avoid backend-specific lock 怪異</li>
</ol>
<h3 id="case-3terraform-cloud-workspace-不能直接搬">Case 3：Terraform Cloud workspace 不能直接搬</h3>
<p><strong>徵兆</strong>：team 已用 Terraform Cloud workspace 跑 100+ pipeline、想切 OpenTofu、發現 <code>terraform login</code> / workspace API / VCS integration 全 HashiCorp-specific。</p>
<p><strong>根因</strong>：OpenTofu 沒對等 Terraform Cloud 服務；自家 backend 用 S3 + Atlantis / Spacelift / env0 等第三方 platform 對接、不是 1:1 替代。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>保留 Terraform Cloud 跑 production</strong>（OpenTofu 不替代）、用 OpenTofu 跑 dev / sandbox</li>
<li><strong>遷出 Terraform Cloud</strong>：state 遷 S3 + 用 Atlantis 跑 PR-based plan/apply（mature open source）</li>
<li><strong>評估 Spacelift / env0</strong> 商業替代、支援 OpenTofu + 對等 workspace feature</li>
</ol>
<h3 id="case-4ci-pipeline-寫死-terraform-binary-name">Case 4：CI pipeline 寫死 <code>terraform</code> binary name</h3>
<p><strong>徵兆</strong>：cutover 後 CI 跑 <code>terraform plan</code> 報「command not found」；team 100+ pipeline / GitHub Action / GitLab CI / shell script 都寫死 <code>terraform</code>。</p>
<p><strong>根因</strong>：rollout 計畫沒 grep 全 organization 找 binary name 引用。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Alias 策略</strong>：CI image 內 <code>ln -s /usr/local/bin/tofu /usr/local/bin/terraform</code>、保留兼容 1-3 個月</li>
<li><strong>逐步改 <code>tofu</code></strong>：跟著 IaC team 修 pipeline file、target 100% 改完才 remove alias</li>
<li><strong>架構</strong>：避免在 pipeline / script 寫死 binary、用 env variable <code>IAC_BINARY=${IAC_BINARY:-tofu}</code></li>
</ol>
<h3 id="case-5registry-routing自家-module-拉不到">Case 5：Registry routing、自家 module 拉不到</h3>
<p><strong>徵兆</strong>：cutover 後 <code>tofu init</code> 對自家 private module 報「not found」；同 module 在 Terraform 端跑得好好的。</p>
<p><strong>根因</strong>：private module 註冊在 <em>Terraform Cloud private registry</em>、OpenTofu 預設不知道這個 endpoint；需要顯式設 registry source URL。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>顯式 source URL</strong>：<code>source = &quot;app.terraform.io/myorg/myapp/aws&quot;</code> 改 git source 或自架 module registry</li>
<li><strong>架構</strong>：用 git-based module source（<code>source = &quot;git::ssh://git@github.com/myorg/myapp.git&quot;</code>）、避開 registry lock-in</li>
<li><strong>長期</strong>：自家 module 同時 publish 到 OpenTofu registry / Terraform Cloud / git、跨 tool 兼容</li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Terraform</th>
          <th>OpenTofu</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Binary cost</td>
          <td>免費 (community edition)</td>
          <td>免費（永遠）</td>
      </tr>
      <tr>
          <td>Terraform Cloud cost</td>
          <td>$20 / user / month、enterprise 高</td>
          <td>無對等服務（用 Atlantis / Spacelift / env0）</td>
      </tr>
      <tr>
          <td>State storage</td>
          <td>S3 / 自家 backend、低</td>
          <td>S3 / 自家 backend、低</td>
      </tr>
      <tr>
          <td>Migration cost</td>
          <td>-</td>
          <td>1-5 person-day（含 audit + cutover + CI 改）</td>
      </tr>
      <tr>
          <td>License risk</td>
          <td>BSL 限制商業使用</td>
          <td>MPL 2.0 開源、無 license risk</td>
      </tr>
      <tr>
          <td>Long-term governance</td>
          <td>HashiCorp 單一供應商</td>
          <td>Linux Foundation + 多廠商貢獻</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：純 IaC 用戶切 OpenTofu 風險低 + 省 license 風險；重度依賴 Terraform Cloud feature 的 organization 保留或評估 commercial alternatives（Spacelift / env0）。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-atlantis--spacelift--env0-整合">跟 <a href="https://www.runatlantis.io/">Atlantis / Spacelift / env0</a> 整合</h3>
<p>OpenTofu 沒對等 Terraform Cloud、需要 third-party orchestrator：</p>
<ul>
<li><strong>Atlantis</strong>：自架、開源、輕量、適合中小型 team</li>
<li><strong>Spacelift</strong>：SaaS、policy as code、支援 OpenTofu first-class</li>
<li><strong>env0</strong>：SaaS、cost estimation、workflow 完整</li>
</ul>
<h3 id="跟-terragrunt-整合">跟 <a href="https://terragrunt.gruntwork.io/">Terragrunt</a> 整合</h3>
<p>Terragrunt（OpenTofu / Terraform 共用 wrapper）已支援 OpenTofu 1.6+；多環境配置抽象保留、底層 binary 切換無感。</p>
<h3 id="反向-migrationopentofu--terraform">反向 migration（OpenTofu → Terraform）</h3>
<p>罕見、通常是 organization 走商業合約綁 HashiCorp Enterprise 才會做；流程鏡像對稱、注意 OpenTofu 1.6+ 自家 feature（state encryption / provider for_each）在 Terraform 端可能缺。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>State encryption（OpenTofu 1.7+）</strong>：sensitive state 加密、Terraform 商業版才有對等 feature</li>
<li><strong>跨 IaC tool（Pulumi / CDK）</strong>：Pulumi / AWS CDK 是不同 paradigm（imperative）、不在本 migration scope</li>
<li><strong>Provider ecosystem 長期分裂</strong>：兩家 registry 自我演化、需要 quarterly review provider compat</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source vendor：<a href="/blog/backend/05-deployment-platform/vendors/terraform/" data-link-title="Terraform / OpenTofu" data-link-desc="Infrastructure as Code 主流工具">Terraform</a></li>
<li>平行 migration playbook（Type B）：<a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB</a></li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> / <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 Process content 結構由最大差異維度決定</a></li>
</ul>
]]></content:encoded></item><item><title>etcd → Consul：KV + N 個 extras feature matrix</title><link>https://tarrragon.github.io/blog/backend/05-deployment-platform/vendors/consul/migrate-from-etcd/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/05-deployment-platform/vendors/consul/migrate-from-etcd/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link &lt;a href="https://etcd.io/">etcd&lt;/a> 跟 &lt;a href="https://tarrragon.github.io/blog/backend/05-deployment-platform/vendors/consul/" data-link-title="Consul" data-link-desc="Service registry / mesh / KV / DNS">Consul&lt;/a>。跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit&lt;/a> 後對映 &lt;em>Paradigm = High（pure KV → service mesh paradigm）→ Type E paradigm shift&lt;/em>；跟 &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/migrate-to-memcached/" data-link-title="Redis → Memcached：Memcached 不是 simpler Redis、是 cache paradigm" data-link-desc="Redis → Memcached 是 Type E paradigm reduction migration — 從 multi-paradigm（KV &amp;#43; 資料結構 &amp;#43; pub/sub &amp;#43; Lua &amp;#43; streams）退到 pure cache；不是「remove Redis features」、是「重新分配 Redis-specific feature 到對應 specialized 服務」；5 個 production 踩雷 &amp;#43; paradigm reduction 路線">Redis → Memcached&lt;/a>（paradigm reduction）對偶、本文是 &lt;em>paradigm expansion&lt;/em>（upgrade）方向。&lt;/p>&lt;/blockquote>
&lt;h2 id="kv--n-個-extrasfeature-matrix">KV + N 個 extras：feature matrix&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>概念&lt;/th>
 &lt;th>etcd&lt;/th>
 &lt;th>Consul&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>核心 paradigm&lt;/td>
 &lt;td>Pure KV with Raft consensus&lt;/td>
 &lt;td>Service mesh（KV + 6 個其他）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Data store&lt;/td>
 &lt;td>KV with versioned values + watch&lt;/td>
 &lt;td>KV + service catalog + health checks + sessions&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>API style&lt;/td>
 &lt;td>gRPC + HTTP/REST&lt;/td>
 &lt;td>HTTP/REST + gRPC（Connect）+ DNS&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Service discovery&lt;/td>
 &lt;td>無（application 自管）&lt;/td>
 &lt;td>Built-in（DNS / HTTP API）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Health check&lt;/td>
 &lt;td>無&lt;/td>
 &lt;td>Built-in（HTTP / TCP / script / TTL）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Service mesh&lt;/td>
 &lt;td>無&lt;/td>
 &lt;td>Connect（mTLS + intentions + service-to-service）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Multi-DC&lt;/td>
 &lt;td>不支援（per-cluster only）&lt;/td>
 &lt;td>Built-in WAN federation&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>ACL system&lt;/td>
 &lt;td>RBAC (etcd 3.5+)&lt;/td>
 &lt;td>Token-based ACL + namespaces (Enterprise)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Lock primitive&lt;/td>
 &lt;td>Lease + transaction&lt;/td>
 &lt;td>Session + KV check-and-set&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Watch event model&lt;/td>
 &lt;td>Event stream（gRPC stream）&lt;/td>
 &lt;td>Long-polling blocking query (X-Consul-Index)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Distributed config&lt;/td>
 &lt;td>KV + watch&lt;/td>
 &lt;td>KV + watch + template rendering (consul-template)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Use case 對映&lt;/td>
 &lt;td>K8s control plane / 純 distributed KV&lt;/td>
 &lt;td>Service mesh + service discovery + config + KV&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>核心差異不在「Consul 多功能」、在「Consul 是 service mesh paradigm」&lt;/strong>：service discovery / health check / Connect mTLS 是 first-class、KV 只是其中一個 sub-feature。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link <a href="https://etcd.io/">etcd</a> 跟 <a href="/blog/backend/05-deployment-platform/vendors/consul/" data-link-title="Consul" data-link-desc="Service registry / mesh / KV / DNS">Consul</a>。跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit</a> 後對映 <em>Paradigm = High（pure KV → service mesh paradigm）→ Type E paradigm shift</em>；跟 <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-memcached/" data-link-title="Redis → Memcached：Memcached 不是 simpler Redis、是 cache paradigm" data-link-desc="Redis → Memcached 是 Type E paradigm reduction migration — 從 multi-paradigm（KV &#43; 資料結構 &#43; pub/sub &#43; Lua &#43; streams）退到 pure cache；不是「remove Redis features」、是「重新分配 Redis-specific feature 到對應 specialized 服務」；5 個 production 踩雷 &#43; paradigm reduction 路線">Redis → Memcached</a>（paradigm reduction）對偶、本文是 <em>paradigm expansion</em>（upgrade）方向。</p></blockquote>
<h2 id="kv--n-個-extrasfeature-matrix">KV + N 個 extras：feature matrix</h2>
<table>
  <thead>
      <tr>
          <th>概念</th>
          <th>etcd</th>
          <th>Consul</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>核心 paradigm</td>
          <td>Pure KV with Raft consensus</td>
          <td>Service mesh（KV + 6 個其他）</td>
      </tr>
      <tr>
          <td>Data store</td>
          <td>KV with versioned values + watch</td>
          <td>KV + service catalog + health checks + sessions</td>
      </tr>
      <tr>
          <td>API style</td>
          <td>gRPC + HTTP/REST</td>
          <td>HTTP/REST + gRPC（Connect）+ DNS</td>
      </tr>
      <tr>
          <td>Service discovery</td>
          <td>無（application 自管）</td>
          <td>Built-in（DNS / HTTP API）</td>
      </tr>
      <tr>
          <td>Health check</td>
          <td>無</td>
          <td>Built-in（HTTP / TCP / script / TTL）</td>
      </tr>
      <tr>
          <td>Service mesh</td>
          <td>無</td>
          <td>Connect（mTLS + intentions + service-to-service）</td>
      </tr>
      <tr>
          <td>Multi-DC</td>
          <td>不支援（per-cluster only）</td>
          <td>Built-in WAN federation</td>
      </tr>
      <tr>
          <td>ACL system</td>
          <td>RBAC (etcd 3.5+)</td>
          <td>Token-based ACL + namespaces (Enterprise)</td>
      </tr>
      <tr>
          <td>Lock primitive</td>
          <td>Lease + transaction</td>
          <td>Session + KV check-and-set</td>
      </tr>
      <tr>
          <td>Watch event model</td>
          <td>Event stream（gRPC stream）</td>
          <td>Long-polling blocking query (X-Consul-Index)</td>
      </tr>
      <tr>
          <td>Distributed config</td>
          <td>KV + watch</td>
          <td>KV + watch + template rendering (consul-template)</td>
      </tr>
      <tr>
          <td>Use case 對映</td>
          <td>K8s control plane / 純 distributed KV</td>
          <td>Service mesh + service discovery + config + KV</td>
      </tr>
  </tbody>
</table>
<p><strong>核心差異不在「Consul 多功能」、在「Consul 是 service mesh paradigm」</strong>：service discovery / health check / Connect mTLS 是 first-class、KV 只是其中一個 sub-feature。</p>
<p>跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit</a>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>KV API 對位 + 多 N 個 extra API</td>
          <td>Medium</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>兩者 Raft-based、ops similar</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>Pure KV → service mesh</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Components</td>
          <td>同 1 cluster</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>KV API 改 + 新增 service registration / health</td>
          <td>Medium</td>
      </tr>
      <tr>
          <td>Data topology</td>
          <td>單 DC → multi-DC（如果用 federation）</td>
          <td>Low-Medium</td>
      </tr>
  </tbody>
</table>
<p>Paradigm = High（其他 Low-Medium）→ <strong>Type E paradigm shift</strong>；KV 是 sub-feature、不是 migration scope 全部。</p>
<h2 id="為什麼遷3-條-expansion-driver">為什麼遷：3 條 expansion driver</h2>
<ul>
<li><strong>Service mesh adoption</strong>：本來用 etcd 跑 K8s control plane、現在 application 端要 service mesh（mTLS / intentions / 流量切換）、Consul 一站式 cover</li>
<li><strong>Multi-DC strategy</strong>：etcd 不支援跨 DC、要 active-passive failover；Consul WAN federation 支援 active-active 多 DC</li>
<li><strong>Configuration management</strong>：consul-template + envconsul 比 etcd watch + 自寫 reloader 簡單</li>
</ul>
<p>反向 driver（Consul → etcd）：</p>
<ul>
<li>純 K8s control plane scenario、不需要 service discovery / health check / mesh、etcd 簡單足夠</li>
<li>Resource constraint：Consul agent 比 etcd 更吃資源、low-end VM 上不夠</li>
</ul>
<h2 id="paradigm-expansion-路線">Paradigm expansion 路線</h2>
<p>跟 <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-memcached/" data-link-title="Redis → Memcached：Memcached 不是 simpler Redis、是 cache paradigm" data-link-desc="Redis → Memcached 是 Type E paradigm reduction migration — 從 multi-paradigm（KV &#43; 資料結構 &#43; pub/sub &#43; Lua &#43; streams）退到 pure cache；不是「remove Redis features」、是「重新分配 Redis-specific feature 到對應 specialized 服務」；5 個 production 踩雷 &#43; paradigm reduction 路線">Redis → Memcached paradigm reduction</a>（移除 features）對偶、Consul 是 <em>補進 features</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">etcd KV pattern         → Consul KV API (1:1 對位)
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">etcd watch              → Consul blocking query / consul-template
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">etcd lease + lock       → Consul session + KV CAS
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">(額外加進)
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">無                      → Consul service registration (services.json / API)
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">無                      → Consul health check (HTTP / TCP / TTL)
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">無                      → Consul service discovery (DNS / HTTP)
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">無                      → Consul Connect (mTLS + intentions)
</span></span><span class="line"><span class="ln">10</span><span class="cl">無                      → Consul WAN federation (multi-DC)
</span></span><span class="line"><span class="ln">11</span><span class="cl">無                      → Consul ACL token + policy</span></span></code></pre></div><p>Migration 不只是 KV API 對位、是 <em>application 增能</em>。</p>
<h2 id="api-對位">API 對位</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># etcd basic KV</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">etcdctl put /myapp/config/db_url <span class="s1">&#39;postgres://...&#39;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">etcdctl get /myapp/config/db_url
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># Consul KV (對位)</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">consul kv put myapp/config/db_url <span class="s1">&#39;postgres://...&#39;</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">consul kv get myapp/config/db_url</span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># etcd watch</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">etcdctl watch --prefix /myapp/config/
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># Consul blocking query (long polling)</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">curl <span class="s1">&#39;http://consul:8500/v1/kv/myapp/config?recurse&amp;index=5&amp;wait=10s&#39;</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"># X-Consul-Index header 為 watch cursor</span></span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># etcd transaction (multi-key atomic)</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">etcdctl txn <span class="s">&lt;&lt;EOF
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="s">compares:
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="s">mod(&#34;/myapp/lock&#34;) = &#34;0&#34;
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="s">success requests:
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="s">put /myapp/lock &#34;owner1&#34;
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="s">EOF</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># Consul session + KV CAS (對位)</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="nv">SESSION_ID</span><span class="o">=</span><span class="k">$(</span>curl -X PUT <span class="s1">&#39;http://consul:8500/v1/session/create&#39;</span> <span class="p">|</span> jq -r .ID<span class="k">)</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">curl -X PUT <span class="s1">&#39;http://consul:8500/v1/kv/myapp/lock?acquire=&#39;</span><span class="nv">$SESSION_ID</span> -d <span class="s1">&#39;owner1&#39;</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"># 若失敗 lock 已被別人持有</span></span></span></code></pre></div><h2 id="application-重設計">Application 重設計</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Before: etcd</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">etcd3</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">etcd</span> <span class="o">=</span> <span class="n">etcd3</span><span class="o">.</span><span class="n">client</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="s1">&#39;etcd&#39;</span><span class="p">,</span> <span class="n">port</span><span class="o">=</span><span class="mi">2379</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">etcd</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="s1">&#39;/myapp/config/db_url&#39;</span><span class="p">,</span> <span class="s1">&#39;postgres://...&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">db_url</span> <span class="o">=</span> <span class="n">etcd</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">&#39;/myapp/config/db_url&#39;</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># After: Consul (KV-only)</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="kn">import</span> <span class="nn">consul</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">c</span> <span class="o">=</span> <span class="n">consul</span><span class="o">.</span><span class="n">Consul</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="s1">&#39;consul&#39;</span><span class="p">,</span> <span class="n">port</span><span class="o">=</span><span class="mi">8500</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">c</span><span class="o">.</span><span class="n">kv</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="s1">&#39;myapp/config/db_url&#39;</span><span class="p">,</span> <span class="s1">&#39;postgres://...&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">_</span><span class="p">,</span> <span class="n">kv</span> <span class="o">=</span> <span class="n">c</span><span class="o">.</span><span class="n">kv</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">&#39;myapp/config/db_url&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="n">db_url</span> <span class="o">=</span> <span class="n">kv</span><span class="p">[</span><span class="s1">&#39;Value&#39;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"># (額外加進) After: Consul service discovery</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="n">c</span><span class="o">.</span><span class="n">agent</span><span class="o">.</span><span class="n">service</span><span class="o">.</span><span class="n">register</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">    <span class="n">name</span><span class="o">=</span><span class="s1">&#39;myapp&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">    <span class="n">service_id</span><span class="o">=</span><span class="s1">&#39;myapp-1&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">    <span class="n">address</span><span class="o">=</span><span class="s1">&#39;10.0.0.10&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">    <span class="n">port</span><span class="o">=</span><span class="mi">8080</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">    <span class="n">check</span><span class="o">=</span><span class="n">consul</span><span class="o">.</span><span class="n">Check</span><span class="o">.</span><span class="n">http</span><span class="p">(</span><span class="s1">&#39;http://10.0.0.10:8080/health&#39;</span><span class="p">,</span> <span class="s1">&#39;10s&#39;</span><span class="p">,</span> <span class="s1">&#39;5s&#39;</span><span class="p">,</span> <span class="s1">&#39;30s&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="c1"># DNS-based discovery (其他 service 找 myapp)</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="c1"># dig +short myapp.service.consul SRV</span></span></span></code></pre></div><h2 id="migration-流程">Migration 流程</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">1. Pre-migration audit
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">   - 列 etcd 使用的所有 application
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">   - 評估每個 application 是否 *需要* Consul extras（service discovery / health / mesh）
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">   - 純 KV use case 標 *low-effort migration*、用得到 extras 標 *value-add migration*
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">2. Consul cluster build
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">   - 跨 DC 設計（WAN federation 規劃）
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">   - ACL system 配置（不要 default open）
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">   - 性能 sizing（Consul agent 比 etcd 重）
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl">3. Application migration（per-app）
</span></span><span class="line"><span class="ln">12</span><span class="cl">   - 純 KV: SDK 換、API 對位、cutover
</span></span><span class="line"><span class="ln">13</span><span class="cl">   - Service discovery: 加 registration + health check + DNS lookup
</span></span><span class="line"><span class="ln">14</span><span class="cl">   - Service mesh: 加 Connect proxy + intentions
</span></span><span class="line"><span class="ln">15</span><span class="cl">
</span></span><span class="line"><span class="ln">16</span><span class="cl">4. Dual-run period
</span></span><span class="line"><span class="ln">17</span><span class="cl">   - etcd 仍跑、application 漸進切到 Consul
</span></span><span class="line"><span class="ln">18</span><span class="cl">   - 每 application cutover 後驗證
</span></span><span class="line"><span class="ln">19</span><span class="cl">
</span></span><span class="line"><span class="ln">20</span><span class="cl">5. etcd decommission
</span></span><span class="line"><span class="ln">21</span><span class="cl">   - 確認所有 application 已切
</span></span><span class="line"><span class="ln">22</span><span class="cl">   - K8s control plane（如果是 etcd 唯一 user）保留不切</span></span></code></pre></div><p>整體 2-4 個月、依 application 數量跟 extras 採用程度。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1kv-api-對位看似-11watch-event-model-不同">Case 1：KV API 對位看似 1:1、watch event model 不同</h3>
<p><strong>徵兆</strong>：application 端從 etcd watch 切 Consul blocking query 後、event 處理 latency 從 50ms 漲到 1-5s；應用以為 event push 即時、實際變 polling。</p>
<p><strong>根因</strong>：etcd watch 是 gRPC stream、event 即時 push；Consul blocking query 是 long-polling、有 <code>wait</code> timeout、event 在 timeout 內到才即時收到。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>降 <code>wait</code> timeout</strong> 跟業務需求對齊（default 5min、可設 10s）</li>
<li><strong>多 instance 並發 polling</strong>：N 個 application instance 各自 polling、降單點 event 延遲</li>
<li><strong>架構</strong>：critical event 用 Consul event API（<code>PUT /v1/event/fire/&lt;name&gt;</code>）+ blocking query event endpoint、跟 KV change 分開</li>
<li><strong>保留 etcd for critical watch</strong>：mission-critical watch 用 etcd 不切</li>
</ol>
<h3 id="case-2session-based-lock-跟-etcd-lease-差">Case 2：Session-based lock 跟 etcd lease 差</h3>
<p><strong>徵兆</strong>：原本 etcd lease 5s TTL、lease holder application 失聯時 5s 內 lock 自動釋放；切 Consul session 後、session TTL 仍生效、但 health check 整合複雜、偶發 lock not released。</p>
<p><strong>根因</strong>：Consul session 有兩種模式 — <code>delete</code>（session expire 時 release lock）vs <code>release</code>（release lock 但 KV 保留）；TTL 配 health check 時行為複雜。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 明示 session behavior</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="n">session_id</span> <span class="o">=</span> <span class="n">c</span><span class="o">.</span><span class="n">session</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">    <span class="n">name</span><span class="o">=</span><span class="s1">&#39;myapp-lock&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">    <span class="n">ttl</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span>           <span class="c1"># 15s TTL</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">    <span class="n">behavior</span><span class="o">=</span><span class="s1">&#39;delete&#39;</span> <span class="c1"># session 過期時 lock 自動 release</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="n">c</span><span class="o">.</span><span class="n">kv</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="s1">&#39;myapp/lock&#39;</span><span class="p">,</span> <span class="s1">&#39;owner1&#39;</span><span class="p">,</span> <span class="n">acquire</span><span class="o">=</span><span class="n">session_id</span><span class="p">)</span></span></span></code></pre></div><p>session TTL 範圍 10s-86400s、不能 &lt; 10s（etcd 可以 1s）；critical low-latency lock 不適用 Consul。</p>
<h3 id="case-3multi-dc-failoverkv-寫到-wrong-dc">Case 3：Multi-DC failover、KV 寫到 wrong DC</h3>
<p><strong>徵兆</strong>：跨 DC 部署後、某 application 寫 KV、但 read 不到；發現 application 端 hardcode 一個 DC 端點、write 到 us-east 但 read 來自 us-west。</p>
<p><strong>根因</strong>：Consul WAN federation 跨 DC 不自動同步 KV；KV 是 <em>per-DC</em>、跨 DC sync 需要 <em>Consul Enterprise license</em> 或自管 <em>consul-replicate</em>。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>每 application instance 連 local DC Consul</strong>：write/read 同 DC</li>
<li><strong>KV replication 跨 DC</strong>：用 consul-replicate 自管、或升 Enterprise</li>
<li><strong>Architecture</strong>：跨 DC 共享 config 改用 <em>DB-backed config</em>（持久 + 跨 DC）+ Consul KV 只存 DC-local config</li>
</ol>
<h3 id="case-4acl-system-預設-opencutover-後曝險">Case 4：ACL system 預設 open、cutover 後曝險</h3>
<p><strong>徵兆</strong>：Consul cluster 上線 1 個月後 SOC 跑 audit、發現任何 application 都能 read 任何 KV；ACL 沒設、所有 token 都全權限。</p>
<p><strong>根因</strong>：Consul ACL 預設 disabled、需要 <em>bootstrap</em>；很多 setup tutorial 簡化跳過 ACL、cutover 後沒補。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Bootstrap ACL system</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">consul acl bootstrap
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"># 生成 management token、保留為 root credential</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"># 建 policy</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">consul acl policy create -name <span class="s1">&#39;myapp-readonly&#39;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="se"></span>  -rules <span class="s1">&#39;key_prefix &#34;myapp/&#34; { policy = &#34;read&#34; }&#39;</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># 建 token 給 application</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">consul acl token create -policy-name <span class="s1">&#39;myapp-readonly&#39;</span></span></span></code></pre></div><p>Production setup 第一步就 bootstrap ACL、不可以延後。</p>
<h3 id="case-5health-check-failure-連鎖service-discovery-失效">Case 5：Health check failure 連鎖、service discovery 失效</h3>
<p><strong>徵兆</strong>：某 application instance 因 GC pause 5 秒未 respond health check、被 Consul 標 failed；DNS query 不返回該 instance；流量切走；GC 結束後 instance 仍 healthy 但 Consul 端 still failed、需要 minutes recover。</p>
<p><strong>根因</strong>：Consul health check 失敗後進入 critical state、需要 <em>連續 N 次成功</em> 才回 passing；default 1-2 次成功即可、但實際時間視 check interval 而定。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong><code>success_before_passing</code> 設低</strong>（1）讓快速恢復</li>
<li><strong><code>failures_before_critical</code> 設高</strong>（3-5）容忍 transient failure</li>
<li><strong>Multi-check strategy</strong>：HTTP + TCP + script check 三軸、不靠單 check</li>
<li><strong>Application-side hint</strong>：JVM application 配 <code>MaxGCPauseMillis</code> 限制 GC pause &lt; health check interval</li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>etcd</th>
          <th>Consul</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cluster baseline</td>
          <td>3-5 node Raft cluster</td>
          <td>3-5 server + N agent (per host)</td>
      </tr>
      <tr>
          <td>Memory per node</td>
          <td>2-8GB</td>
          <td>4-16GB（含 agent）</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>0.2-0.5</td>
          <td>0.5-1.0（多 features 多運維）</td>
      </tr>
      <tr>
          <td>Feature surface</td>
          <td>Pure KV</td>
          <td>KV + service mesh + multi-DC + ACL</td>
      </tr>
      <tr>
          <td>Setup complexity</td>
          <td>Low</td>
          <td>Medium-High</td>
      </tr>
      <tr>
          <td>Multi-DC support</td>
          <td>不支援</td>
          <td>Built-in WAN federation</td>
      </tr>
      <tr>
          <td>License</td>
          <td>Apache 2.0 (open)</td>
          <td>MPL 2.0 (community) / commercial (enterprise)</td>
      </tr>
      <tr>
          <td>Migration cost</td>
          <td>-</td>
          <td>1-3 FTE × 2-4 個月</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：純 KV use case 走 etcd；service mesh / multi-DC / discovery 需求大走 Consul；混合 deployment 是 long-term default（K8s control plane 仍跑 etcd、service mesh 跑 Consul）。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-kubernetes-對位">跟 Kubernetes 對位</h3>
<p>K8s control plane <em>永遠</em> 用 etcd、不切 Consul；Consul 是 K8s <em>外</em> 的 service mesh + 跨 cluster discovery。兩者並存、不互斥。</p>
<h3 id="跟-vault-整合">跟 <a href="/blog/backend/07-security-data-protection/vendors/hashicorp-vault/" data-link-title="HashiCorp Vault" data-link-desc="Self-hosted secret management 與 dynamic credential / encryption-as-a-service / PKI engine、跨雲跨環境的 secret 控制面">Vault</a> 整合</h3>
<p>Consul + Vault 是 HashiCorp 同生態、Consul 跑 service discovery / mesh、Vault 跑 secrets；Consul ACL token 可從 Vault dynamic engine 取得。</p>
<h3 id="跟-istio--linkerd-對位">跟 <a href="https://istio.io/">Istio / Linkerd</a> 對位</h3>
<p>Consul Connect 是 service mesh paradigm、跟 Istio / Linkerd 並列；多數 K8s-native organization 用 Istio / Linkerd、Consul 強項在 <em>跨 K8s + VM + multi-DC</em> mesh。</p>
<h3 id="反向-migrationconsul--etcd">反向 migration（Consul → etcd）</h3>
<p>少數 organization 簡化 stack 時做、流程鏡像對稱、但 <em>退掉 service mesh / multi-DC 是有意識降級</em>、不能假裝功能等價。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Consul Connect production rollout</strong>：mesh adoption 是 incremental、per-service intentions 漸進</li>
<li><strong>Multi-DC topology 設計</strong>：active-active vs active-passive、依 RPO/RTO 跟 cost trade-off</li>
<li><strong>跟 Kubernetes Gateway API 整合</strong>：service mesh paradigm 在 K8s 內 vs 外整合策略</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Target vendor：<a href="/blog/backend/05-deployment-platform/vendors/consul/" data-link-title="Consul" data-link-desc="Service registry / mesh / KV / DNS">Consul</a></li>
<li>平行 migration playbook (Type E)：<a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-memcached/" data-link-title="Redis → Memcached：Memcached 不是 simpler Redis、是 cache paradigm" data-link-desc="Redis → Memcached 是 Type E paradigm reduction migration — 從 multi-paradigm（KV &#43; 資料結構 &#43; pub/sub &#43; Lua &#43; streams）退到 pure cache；不是「remove Redis features」、是「重新分配 Redis-specific feature 到對應 specialized 服務」；5 個 production 踩雷 &#43; paradigm reduction 路線">Redis → Memcached</a>（paradigm reduction 對偶）/ <a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS</a></li>
<li>平行整合：<a href="/blog/backend/07-security-data-protection/vendors/hashicorp-vault/" data-link-title="HashiCorp Vault" data-link-desc="Self-hosted secret management 與 dynamic credential / encryption-as-a-service / PKI engine、跨雲跨環境的 secret 控制面">HashiCorp Vault</a></li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a></li>
</ul>
]]></content:encoded></item></channel></rss>