<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Eks on Tarragon</title><link>https://tarrragon.github.io/blog/tags/eks/</link><description>Recent content in Eks on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Fri, 26 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/eks/index.xml" rel="self" type="application/rss+xml"/><item><title>運算平台上 IaC — ECS 與 EKS</title><link>https://tarrragon.github.io/blog/infra/05-core-services/compute-ecs-eks/</link><pubDate>Fri, 26 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/infra/05-core-services/compute-ecs-eks/</guid><description>&lt;p>運算是業務程式碼的執行載體。infra 這層描述的是「運算容量與接線」— 它跑在哪些 subnet、套用哪個 IAM role、掛到哪個 load balancer 的 target group、以及容量怎麼隨負載擴縮。實際跑什麼版本的程式碼由部署流程決定，這個邊界讓 infra 變更與應用發布各走各的節奏 — infra apply 不會因此改動映像，部署 pipeline 不會因此改動 subnet。&lt;/p>
&lt;p>核心服務的部署順序由依賴方向決定（被依賴的先建），運算在這個&lt;a href="https://tarrragon.github.io/blog/infra/05-core-services/deployment-order-database/" data-link-title="部署順序與資料庫上 IaC" data-link-desc="核心服務的依賴圖決定部署順序，資料庫作為第一批上層服務需要最謹慎的 IaC 描述 — 涵蓋 RDS 接線、連線管理、read replica 與端點暴露">四層依賴結構&lt;/a>裡位於第三層：它引用底層的 subnet、security group 與 IAM role，同時被上層的 load balancer target group 引用。所以運算資源的 IaC 定義裡，subnet ID、security group ID、IAM role ARN 都應該是引用而非硬編碼 — 底層重建時上層才會自動跟上。&lt;/p>
&lt;h2 id="ecs-vs-eks-選型">ECS vs EKS 選型&lt;/h2>
&lt;p>ECS 與 EKS 都能跑容器，差異在控制平面的維運模型與生態適配。選型看的是團隊能力與業務需求，而非功能多寡 — 兩者都能達成「容器跑在私有 subnet、用 IAM role 存取資源、掛到 ALB 接收流量」這個基本目標。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>ECS&lt;/th>
 &lt;th>EKS&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>控制平面維運&lt;/td>
 &lt;td>AWS 完全代管&lt;/td>
 &lt;td>AWS 代管 API server，附加元件自行管理&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>學習曲線&lt;/td>
 &lt;td>低（AWS 原生概念）&lt;/td>
 &lt;td>高（Kubernetes 生態）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>跨雲可攜&lt;/td>
 &lt;td>低（AWS 專屬）&lt;/td>
 &lt;td>高（Kubernetes 標準）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>IaC 工具鏈&lt;/td>
 &lt;td>全部用 Terraform AWS provider&lt;/td>
 &lt;td>Terraform 建 cluster，workload 走 Helm&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>適合場景&lt;/td>
 &lt;td>AWS 單雲、團隊無 K8s 經驗&lt;/td>
 &lt;td>已有 K8s 能力或需要其生態時&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>ECS 的控制平面由 AWS 代管，service、task definition、target group 都是 AWS 原生資源，Terraform 的 provider 直接描述，心智負擔低。它的 Fargate 啟動類型更進一步 — 連 EC2 instance 都不用管，只描述 task 要多少 CPU 和記憶體，AWS 負責排程到底層主機。&lt;/p>
&lt;p>EKS 的控制平面是受管的 Kubernetes，IaC 描述的是 cluster 本身與 node group，workload（Deployment、Service）則走 Kubernetes manifest 或 Helm chart。這代表 infra 工具鏈跨越了 Terraform 與 Kubernetes 兩套系統 — Terraform 負責 cluster 基礎設施，kubectl / Helm 負責工作負載，兩者的 state 與變更流程是分開的。&lt;/p>
&lt;p>團隊已有 Kubernetes 能力或需要其生態（service mesh、自訂排程器、多雲部署、社群的 operator 生態）時，EKS 的複雜度才值得承擔。否則 ECS 的低負擔是預設起點。一個自測方式：團隊選了 EKS 但只用到最基本的 Deployment + Service，沒有碰 service mesh、CRD 或跨雲，那等於承擔了 Kubernetes 的維運成本卻沒用到它的回報——退回 ECS 通常更合理。&lt;/p>
&lt;h3 id="fargate-vs-ec2-launch-type">Fargate vs EC2 launch type&lt;/h3>
&lt;p>ECS 的執行模式再分 EC2 launch type 和 Fargate launch type。EC2 launch type 需要自己管理 EC2 instance 組成的 capacity provider — AMI 更新、instance 擴縮、OS 層安全修補都是團隊的責任。Fargate 由 AWS 代管運算實例，不需要配 capacity provider、不需要管 AMI，進一步降低運維面。&lt;/p></description><content:encoded><![CDATA[<p>運算是業務程式碼的執行載體。infra 這層描述的是「運算容量與接線」— 它跑在哪些 subnet、套用哪個 IAM role、掛到哪個 load balancer 的 target group、以及容量怎麼隨負載擴縮。實際跑什麼版本的程式碼由部署流程決定，這個邊界讓 infra 變更與應用發布各走各的節奏 — infra apply 不會因此改動映像，部署 pipeline 不會因此改動 subnet。</p>
<p>核心服務的部署順序由依賴方向決定（被依賴的先建），運算在這個<a href="/blog/infra/05-core-services/deployment-order-database/" data-link-title="部署順序與資料庫上 IaC" data-link-desc="核心服務的依賴圖決定部署順序，資料庫作為第一批上層服務需要最謹慎的 IaC 描述 — 涵蓋 RDS 接線、連線管理、read replica 與端點暴露">四層依賴結構</a>裡位於第三層：它引用底層的 subnet、security group 與 IAM role，同時被上層的 load balancer target group 引用。所以運算資源的 IaC 定義裡，subnet ID、security group ID、IAM role ARN 都應該是引用而非硬編碼 — 底層重建時上層才會自動跟上。</p>
<h2 id="ecs-vs-eks-選型">ECS vs EKS 選型</h2>
<p>ECS 與 EKS 都能跑容器，差異在控制平面的維運模型與生態適配。選型看的是團隊能力與業務需求，而非功能多寡 — 兩者都能達成「容器跑在私有 subnet、用 IAM role 存取資源、掛到 ALB 接收流量」這個基本目標。</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>ECS</th>
          <th>EKS</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>控制平面維運</td>
          <td>AWS 完全代管</td>
          <td>AWS 代管 API server，附加元件自行管理</td>
      </tr>
      <tr>
          <td>學習曲線</td>
          <td>低（AWS 原生概念）</td>
          <td>高（Kubernetes 生態）</td>
      </tr>
      <tr>
          <td>跨雲可攜</td>
          <td>低（AWS 專屬）</td>
          <td>高（Kubernetes 標準）</td>
      </tr>
      <tr>
          <td>IaC 工具鏈</td>
          <td>全部用 Terraform AWS provider</td>
          <td>Terraform 建 cluster，workload 走 Helm</td>
      </tr>
      <tr>
          <td>適合場景</td>
          <td>AWS 單雲、團隊無 K8s 經驗</td>
          <td>已有 K8s 能力或需要其生態時</td>
      </tr>
  </tbody>
</table>
<p>ECS 的控制平面由 AWS 代管，service、task definition、target group 都是 AWS 原生資源，Terraform 的 provider 直接描述，心智負擔低。它的 Fargate 啟動類型更進一步 — 連 EC2 instance 都不用管，只描述 task 要多少 CPU 和記憶體，AWS 負責排程到底層主機。</p>
<p>EKS 的控制平面是受管的 Kubernetes，IaC 描述的是 cluster 本身與 node group，workload（Deployment、Service）則走 Kubernetes manifest 或 Helm chart。這代表 infra 工具鏈跨越了 Terraform 與 Kubernetes 兩套系統 — Terraform 負責 cluster 基礎設施，kubectl / Helm 負責工作負載，兩者的 state 與變更流程是分開的。</p>
<p>團隊已有 Kubernetes 能力或需要其生態（service mesh、自訂排程器、多雲部署、社群的 operator 生態）時，EKS 的複雜度才值得承擔。否則 ECS 的低負擔是預設起點。一個自測方式：團隊選了 EKS 但只用到最基本的 Deployment + Service，沒有碰 service mesh、CRD 或跨雲，那等於承擔了 Kubernetes 的維運成本卻沒用到它的回報——退回 ECS 通常更合理。</p>
<h3 id="fargate-vs-ec2-launch-type">Fargate vs EC2 launch type</h3>
<p>ECS 的執行模式再分 EC2 launch type 和 Fargate launch type。EC2 launch type 需要自己管理 EC2 instance 組成的 capacity provider — AMI 更新、instance 擴縮、OS 層安全修補都是團隊的責任。Fargate 由 AWS 代管運算實例，不需要配 capacity provider、不需要管 AMI，進一步降低運維面。</p>
<p>Fargate 的代價是三個面向：單位成本較高（同規格的 vCPU/記憶體比 EC2 貴約 20-40%）、不支援 GPU workload、啟動延遲稍長（cold start 約 30-60 秒，EC2 已有 instance 時近乎即時）。多數 web API 和非 GPU 的背景工作的初始選擇是 Fargate — 省掉的運維時間通常抵得過溢價。流量穩定且需要成本最佳化時再切回 EC2 launch type，屆時增加的是 capacity provider 的設定與 instance 管理。量級參考：一個持續運行 2 vCPU / 4GB 的 Fargate task 月費約 $70，同規格 EC2 t3.medium 約 $30。月費差距在服務數量少時不顯著，當 task 數量超過 10-20 個且流量穩定時，切回 EC2 launch type 的節省量才值得投入切換工程。</p>
<p>後續 HCL 範例以 ECS Fargate 示意，EKS 的接線骨架（subnet、IAM、target group）相近，差異落在編排層的資源類型。</p>
<h2 id="task-definition描述容器規格與接線">Task definition：描述容器規格與接線</h2>
<p>Task definition 是 ECS 描述「一個工作單元長什麼樣」的宣告：要跑哪個容器映像、給多少 CPU 和記憶體、開哪些 port、用哪個 IAM role、log 送到哪裡。它是運算 IaC 的核心資源。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_ecs_task_definition&#34; &#34;api&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  family</span>                   <span class="o">=</span> <span class="s2">&#34;api-${var.env}&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  requires_compatibilities</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;FARGATE&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  network_mode</span>             <span class="o">=</span> <span class="s2">&#34;awsvpc&#34;</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">  cpu</span>                      <span class="o">=</span> <span class="k">var</span><span class="p">.</span><span class="k">task_cpu</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">  memory</span>                   <span class="o">=</span> <span class="k">var</span><span class="p">.</span><span class="k">task_memory</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">  execution_role_arn</span>       <span class="o">=</span> <span class="k">aws_iam_role</span><span class="p">.</span><span class="k">ecs_execution</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="n">  task_role_arn</span>            <span class="o">=</span> <span class="k">aws_iam_role</span><span class="p">.</span><span class="k">api_task</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">  container_definitions</span> <span class="o">=</span> <span class="k">jsonencode</span><span class="p">([</span>{
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">    name</span>  <span class="o">=</span> <span class="s2">&#34;api&#34;</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="n">    image</span> <span class="o">=</span> <span class="s2">&#34;${var.ecr_repo_url}:${var.image_tag}&#34;</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">    portMappings</span> <span class="o">=</span><span class="n"> [{ containerPort</span> <span class="o">=</span><span class="n"> 8080, protocol</span> <span class="o">=</span> <span class="s2">&#34;tcp&#34;</span> }<span class="p">]</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="n">    logConfiguration</span> <span class="o">=</span> {
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="n">      logDriver</span> <span class="o">=</span> <span class="s2">&#34;awslogs&#34;</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="n">      options</span> <span class="o">=</span> {
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="n">        &#34;awslogs-group&#34;</span>         <span class="o">=</span> <span class="k">aws_cloudwatch_log_group</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">name</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="n">        &#34;awslogs-region&#34;</span>        <span class="o">=</span> <span class="k">var</span><span class="p">.</span><span class="k">region</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="n">        &#34;awslogs-stream-prefix&#34;</span> <span class="o">=</span> <span class="s2">&#34;api&#34;</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">      }
</span></span><span class="line"><span class="ln">21</span><span class="cl">    }
</span></span><span class="line"><span class="ln">22</span><span class="cl">  }<span class="p">])</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">}</span></span></code></pre></div><p>這段定義裡有三個刻意的設計：</p>
<p><strong>映像版本解耦</strong>：<code>var.image_tag</code> 在 infra 的 <code>tfvars</code> 裡給一個穩定的預設值（如 <code>latest</code> 或某個基線版本），部署管線覆寫這個值推新版本。infra apply 不會因此改動映像、部署 pipeline 不會因此改動 subnet — 兩者的變更頻率與審查強度不同，混在一起會讓快的等慢的。如果每次部署新版本都要改 infra 的 Terraform code 並跑 apply，代表映像版本跟 infra 沒有解耦——應該讓部署管線直接用 <code>aws ecs update-service</code> 或修改 task definition 的 image tag，不走 Terraform。</p>
<p><strong>兩個 IAM role 的分工</strong>：<code>execution_role_arn</code> 是 ECS 代理用來拉映像和寫 log 的身分 — 它的權限是 ECS 平台層級的，跟業務邏輯無關。<code>task_role_arn</code> 是容器內的應用程式碼在執行期取得的身分 — 它的權限對應業務需求，例如讀寫某個 S3 bucket 或呼叫某個 SQS queue。兩者混在同一個 role 上，就是把平台權限跟業務權限混在一起，違反最小權限（見<a href="/blog/infra/02-identity-credentials/" data-link-title="模組二：身分與憑證地基 — IAM 與 OIDC" data-link-desc="IAM role / policy 設計、最小權限，以及用 OIDC 短期憑證取代長期 access key">模組二：身分與憑證地基</a>）。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_iam_role&#34; &#34;api_task&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  name</span>               <span class="o">=</span> <span class="s2">&#34;api-task-${var.env}&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  assume_role_policy</span> <span class="o">=</span> <span class="k">data</span><span class="p">.</span><span class="k">aws_iam_policy_document</span><span class="p">.</span><span class="k">ecs_assume</span><span class="p">.</span><span class="k">json</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">}
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_iam_role_policy&#34; &#34;api_task&#34;</span> {
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">  role</span>   <span class="o">=</span> <span class="k">aws_iam_role</span><span class="p">.</span><span class="k">api_task</span><span class="p">.</span><span class="k">id</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="n">  policy</span> <span class="o">=</span> <span class="k">data</span><span class="p">.</span><span class="k">aws_iam_policy_document</span><span class="p">.</span><span class="k">api_permissions</span><span class="p">.</span><span class="k">json</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">}
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="k">data</span> <span class="s2">&#34;aws_iam_policy_document&#34; &#34;api_permissions&#34;</span> {
</span></span><span class="line"><span class="ln">12</span><span class="cl">  <span class="k">statement</span> {
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">    actions</span>   <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;s3:GetObject&#34;, &#34;s3:PutObject&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="n">    resources</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;${aws_s3_bucket.uploads.arn}/*&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">  }
</span></span><span class="line"><span class="ln">16</span><span class="cl">  <span class="k">statement</span> {
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="n">    actions</span>   <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;sqs:SendMessage&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="n">    resources</span> <span class="o">=</span> <span class="p">[</span><span class="k">aws_sqs_queue</span><span class="p">.</span><span class="k">notifications</span><span class="p">.</span><span class="k">arn</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">  }
</span></span><span class="line"><span class="ln">20</span><span class="cl">}</span></span></code></pre></div><p><strong>Log 接線</strong>：<code>logConfiguration</code> 把容器的 stdout/stderr 導向 CloudWatch Logs，log group 名稱引用的是同一份 IaC 裡宣告的資源 — 這正是<a href="/blog/infra/06-observability-logging/" data-link-title="模組六：可觀測性與 log 一併寫進 code" data-link-desc="log group、metric、alarm 跟基礎設施同生命週期管理，出事時追得到查得到">模組六：可觀測性與 log</a> 說的「監控跟資源同生命週期」。</p>
<h2 id="ecs-service部署模式與網路接線">ECS service：部署模式與網路接線</h2>
<p>ECS service 控制「要跑幾個 task、怎麼部署新版本、掛到哪個 target group」。它是 task definition 的執行實例管理者。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_ecs_service&#34; &#34;api&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  name</span>            <span class="o">=</span> <span class="s2">&#34;api-${var.env}&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  cluster</span>         <span class="o">=</span> <span class="k">aws_ecs_cluster</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">id</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  task_definition</span> <span class="o">=</span> <span class="k">aws_ecs_task_definition</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">  desired_count</span>   <span class="o">=</span> <span class="k">var</span><span class="p">.</span><span class="k">api_desired_count</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">  launch_type</span>     <span class="o">=</span> <span class="s2">&#34;FARGATE&#34;</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  <span class="k">network_configuration</span> {
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">    subnets</span>          <span class="o">=</span> <span class="p">[</span><span class="k">for</span> <span class="k">s</span> <span class="k">in</span> <span class="k">aws_subnet</span><span class="p">.</span><span class="k">private</span> <span class="err">:</span> <span class="k">s</span><span class="p">.</span><span class="k">id</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">    security_groups</span>  <span class="o">=</span> <span class="p">[</span><span class="k">aws_security_group</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">id</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">    assign_public_ip</span> <span class="o">=</span> <span class="kt">false</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">  }
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl">  <span class="k">load_balancer</span> {
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="n">    target_group_arn</span> <span class="o">=</span> <span class="k">aws_lb_target_group</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="n">    container_name</span>   <span class="o">=</span> <span class="s2">&#34;api&#34;</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="n">    container_port</span>   <span class="o">=</span> <span class="m">8080</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">  }
</span></span><span class="line"><span class="ln">19</span><span class="cl">
</span></span><span class="line"><span class="ln">20</span><span class="cl">  <span class="k">deployment_circuit_breaker</span> {
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="n">    enable</span>   <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="n">    rollback</span> <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">  }
</span></span><span class="line"><span class="ln">24</span><span class="cl">}</span></span></code></pre></div><p><code>network_configuration</code> 把 task 放進 private subnet 並套用 security group — 它決定了這些容器在網路拓撲裡的位置（見<a href="/blog/infra/03-network-foundation/" data-link-title="模組三：網路地基 — VPC 與分層" data-link-desc="VPC、public / private subnet 切分、route table、NAT、security group 設計">模組三：網路地基</a>）。<code>assign_public_ip = false</code> 讓容器不拿公網 IP，對外流量經由 NAT 出去、入站流量經由 ALB 進來。</p>
<p><code>deployment_circuit_breaker</code> 是 ECS 的內建保護：部署新版本時如果 task 持續啟動失敗（health check 不過、容器 crash），ECS 會自動回滾到上一版。這個行為需要明確開啟、預設是關的 — 關著的話，壞版本的 task 會反覆啟動失敗，新版始終上不來但舊版也不會回來，服務陷入降級狀態。</p>
<h2 id="連線管理運算到資料庫的接線">連線管理：運算到資料庫的接線</h2>
<p>運算到資料庫之間有一段常被略過的接線：連線管理。無狀態運算水平擴張時，每個 task 各自開連線到 RDS，容易把資料庫的連線數打滿。RDS 的連線上限由 instance class 決定（例如 <code>db.r6g.large</code> 約 1000 個連線），而一個跑了 50 個 task 的 ECS service，每個 task 開 20 個連線就到上限了。</p>
<p>出現「擴運算反而拖垮 DB」的訊號時，要引入連線池或受管的連線代理。RDS Proxy 在運算與 RDS 之間代理連線，把運算端的大量短命連線收斂成少量長期連線再進資料庫。它也可以寫進 IaC 並輸出端點給運算引用：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_db_proxy&#34; &#34;main&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  name</span>                   <span class="o">=</span> <span class="s2">&#34;api-proxy-${var.env}&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  engine_family</span>          <span class="o">=</span> <span class="s2">&#34;POSTGRESQL&#34;</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  role_arn</span>               <span class="o">=</span> <span class="k">aws_iam_role</span><span class="p">.</span><span class="k">rds_proxy</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">  vpc_subnet_ids</span>         <span class="o">=</span> <span class="p">[</span><span class="k">for</span> <span class="k">s</span> <span class="k">in</span> <span class="k">aws_subnet</span><span class="p">.</span><span class="k">private</span> <span class="err">:</span> <span class="k">s</span><span class="p">.</span><span class="k">id</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">  vpc_security_group_ids</span> <span class="o">=</span> <span class="p">[</span><span class="k">aws_security_group</span><span class="p">.</span><span class="k">rds_proxy</span><span class="p">.</span><span class="k">id</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  <span class="k">auth</span> {
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">    auth_scheme</span> <span class="o">=</span> <span class="s2">&#34;SECRETS&#34;</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">    secret_arn</span>  <span class="o">=</span> <span class="k">aws_secretsmanager_secret</span><span class="p">.</span><span class="k">db_password</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">  }
</span></span><span class="line"><span class="ln">12</span><span class="cl">}
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="k">output</span> <span class="s2">&#34;db_endpoint&#34;</span> {
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="n">  value</span> <span class="o">=</span> <span class="k">aws_db_proxy</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">endpoint</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">}</span></span></code></pre></div><p>運算端的連線字串指向 proxy 端點而非 RDS 端點。proxy 的 security group 允許來自運算 security group 的流量，proxy 到 RDS 的流量則由 proxy 自己的 security group 對 RDS security group 的規則控制 — 安全邊界多了一層但更清晰。</p>
<h2 id="auto-scaling容量隨負載擴縮">Auto-scaling：容量隨負載擴縮</h2>
<p>ECS service 的 <code>desired_count</code> 是靜態的起始容量。要讓容量隨負載動態調整，需要加上 Application Auto Scaling。它的責任是在負載上升時長出更多 task、負載下降時縮回去省錢。</p>
<p>auto-scaling 的核心決策是「用什麼指標觸發擴縮」。常見的指標分兩類：</p>
<table>
  <thead>
      <tr>
          <th>指標類型</th>
          <th>典型指標</th>
          <th>適用情境</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>資源利用率</td>
          <td>CPU utilization、memory utilization</td>
          <td>運算密集型服務，CPU 與負載正相關</td>
      </tr>
      <tr>
          <td>業務吞吐量</td>
          <td>ALB request count per target</td>
          <td>I/O 密集型服務，CPU 低但併發高</td>
      </tr>
  </tbody>
</table>
<p>CPU utilization 是最直覺的指標，但它在 I/O 密集型服務上會失準 — 一個等待外部 API 回應的 task，CPU 很低但已經沒有多餘的能力處理新請求。這時用 ALB 的 request count per target（每個 task 平均處理幾個請求）更能反映真實負載。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_appautoscaling_target&#34; &#34;api&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  max_capacity</span>       <span class="o">=</span> <span class="k">var</span><span class="p">.</span><span class="k">api_max_count</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  min_capacity</span>       <span class="o">=</span> <span class="k">var</span><span class="p">.</span><span class="k">api_min_count</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  resource_id</span>        <span class="o">=</span> <span class="s2">&#34;service/${aws_ecs_cluster.main.name}/${aws_ecs_service.api.name}&#34;</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">  scalable_dimension</span> <span class="o">=</span> <span class="s2">&#34;ecs:service:DesiredCount&#34;</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">  service_namespace</span>  <span class="o">=</span> <span class="s2">&#34;ecs&#34;</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">}
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_appautoscaling_policy&#34; &#34;api_cpu&#34;</span> {
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">  name</span>               <span class="o">=</span> <span class="s2">&#34;api-cpu-${var.env}&#34;</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">  policy_type</span>        <span class="o">=</span> <span class="s2">&#34;TargetTrackingScaling&#34;</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="n">  resource_id</span>        <span class="o">=</span> <span class="k">aws_appautoscaling_target</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">resource_id</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">  scalable_dimension</span> <span class="o">=</span> <span class="k">aws_appautoscaling_target</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">scalable_dimension</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="n">  service_namespace</span>  <span class="o">=</span> <span class="k">aws_appautoscaling_target</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">service_namespace</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">
</span></span><span class="line"><span class="ln">16</span><span class="cl">  <span class="k">target_tracking_scaling_policy_configuration</span> {
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="n">    target_value</span>       <span class="o">=</span> <span class="m">60</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">    <span class="k">predefined_metric_specification</span> {
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="n">      predefined_metric_type</span> <span class="o">=</span> <span class="s2">&#34;ECSServiceAverageCPUUtilization&#34;</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">    }
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="n">    scale_in_cooldown</span>  <span class="o">=</span> <span class="m">300</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="n">    scale_out_cooldown</span> <span class="o">=</span> <span class="m">60</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">  }
</span></span><span class="line"><span class="ln">24</span><span class="cl">}</span></span></code></pre></div><p><code>target_value = 60</code> 表示目標 CPU 平均維持在 60% — 留 40% 的餘裕應對突發。<code>scale_out_cooldown</code> 設短（60 秒），讓擴張反應快；<code>scale_in_cooldown</code> 設長（300 秒），避免負載短暫下降就立刻縮容、結果下一波流量來了又要重新擴張。</p>
<p>設了 auto-scaling 後要定期看 scaling activity log 確認它在正確的時機擴縮。從來沒觸發過有兩種可能：<code>min_capacity</code> 已經高於實際需求（資源浪費），或 target value 設太高（來不及擴）。</p>
<p><code>max_capacity</code> 是成本護欄 — 設一個你能接受的上限，避免異常流量（爬蟲、攻擊、上游重試風暴）把 task 數推到遠超預期的帳單。運行期的成本優化在 <a href="/blog/devops/08-cost-management/" data-link-title="模組八：成本管理" data-link-desc="雲端帳單怎麼不失控 — reserved instance、spot instance、right-sizing、成本監控告警">devops 模組八：成本管理</a> 展開。</p>
<p>規模放大後，auto-scaling 的行為模式會改變。<a href="/blog/backend/09-performance-capacity/cases/niantic-pokemon-go-fifty-x-surge-gcp/" data-link-title="9.C8 Niantic Pokémon GO：在 GCP 上承載 50 倍突發流量" data-link-desc="Pokémon GO 上線時實際流量達原始預估 50 倍、Google CRE 怎麼即時補容量">Pokémon GO 上線時實際流量達預估的 50 倍</a>，這類突發不是 auto-scaling 能事前規劃的——50 倍的 headroom 會讓平日成本不合理。Niantic 的 infra 層前提是 GKE 把容器啟動時間降到秒級，讓 surge 反應成為可能；同時依賴 Google CRE 即時補 node 容量。<a href="/blog/backend/09-performance-capacity/cases/zoom-covid-surge-dynamodb/" data-link-title="9.C18 Zoom：COVID 期間從 1000 萬到 3 億 DAU 的 30 倍突發" data-link-desc="Zoom 在 2020 年 COVID 爆發時、日活從 1000 萬衝到 3 億、用 DynamoDB 撐住會議後端">Zoom COVID 期間的 30 倍突發</a> 則是結構性成長——日活從 1000 萬升到 3 億後不會回落，容量規劃的 baseline 需要永久重新校準。兩個案例的共同教訓是：auto-scaling 的 <code>max_capacity</code> 設定要預留突發空間，但極端突發的處理靠的是平台能力（容器化的快速啟動）和 vendor 支援（managed service 的彈性），不是 IaC 配置能獨立解決的。</p>
<p>多叢集治理是另一個規模維度。<a href="/blog/backend/09-performance-capacity/cases/riot-games-eks-multi-cluster/" data-link-title="9.C12 Riot Games：246 個 EKS cluster 的多遊戲多地區治理" data-link-desc="Riot Games 從 Mesos 遷移到 EKS、用 246 個 cluster 跨遊戲跨地區治理、年省 1000 萬美金">Riot Games 用 246 個 EKS cluster 跨多遊戲多地區</a>，每個遊戲一個獨立叢集（避免跨遊戲互相影響），搭配 Terraform 做 IaC、Karpenter 做 node lifecycle，年省 1000 萬美金。infra 層的教訓是：當運算叢集數量從個位數長到數十甚至數百，叢集本身變成需要 IaC 治理的資源——叢集的建立、版本升級、安全基線都要標準化。<a href="/blog/backend/05-deployment-platform/cases/conde-nast-platform-modernization-eks/" data-link-title="5.C2 Condé Nast：EKS 平台整併與標準化" data-link-desc="多地區異質 Kubernetes 平台整併為統一控制面的案例。">Condé Nast 的 EKS 平台整併</a>也印證了同樣的模式：多團隊各自維護異質 K8s 叢集會造成安全基線不一致，整併到統一平台後把 kube2iam（有 race condition 風險）換成 IRSA（OIDC federation），消除了 node-level 的 credential 共用。</p>
<h2 id="跨分類引用">跨分類引用</h2>
<ul>
<li>→ <a href="/blog/infra/02-identity-credentials/" data-link-title="模組二：身分與憑證地基 — IAM 與 OIDC" data-link-desc="IAM role / policy 設計、最小權限，以及用 OIDC 短期憑證取代長期 access key">模組二：身分與憑證地基</a>：execution role 與 task role 的最小權限設計</li>
<li>→ <a href="/blog/infra/03-network-foundation/" data-link-title="模組三：網路地基 — VPC 與分層" data-link-desc="VPC、public / private subnet 切分、route table、NAT、security group 設計">模組三：網路地基</a>：運算放在 private subnet、security group 接線</li>
<li>→ <a href="/blog/infra/06-observability-logging/" data-link-title="模組六：可觀測性與 log 一併寫進 code" data-link-desc="log group、metric、alarm 跟基礎設施同生命週期管理，出事時追得到查得到">模組六：可觀測性與 log</a>：log group 與 task definition 同生命週期</li>
<li>→ <a href="/blog/devops/08-cost-management/" data-link-title="模組八：成本管理" data-link-desc="雲端帳單怎麼不失控 — reserved instance、spot instance、right-sizing、成本監控告警">devops 模組八：成本管理</a>：auto-scaling 的成本護欄與 spot/Fargate Spot 混用</li>
</ul>
]]></content:encoded></item></channel></rss>