<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>模組五：核心服務上 IaC on Tarragon</title><link>https://tarrragon.github.io/blog/infra/05-core-services/</link><description>Recent content in 模組五：核心服務上 IaC on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Fri, 26 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/infra/05-core-services/index.xml" rel="self" type="application/rss+xml"/><item><title>部署順序與資料庫上 IaC</title><link>https://tarrragon.github.io/blog/infra/05-core-services/deployment-order-database/</link><pubDate>Fri, 26 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/infra/05-core-services/deployment-order-database/</guid><description>&lt;p>地基就緒後，依「地基 → 上層」的順序把實際承載業務的服務寫進 IaC。&lt;a href="https://tarrragon.github.io/blog/infra/02-identity-credentials/" data-link-title="模組二：身分與憑證地基 — IAM 與 OIDC" data-link-desc="IAM role / policy 設計、最小權限，以及用 OIDC 短期憑證取代長期 access key">身分（IAM）&lt;/a>、&lt;a href="https://tarrragon.github.io/blog/infra/03-network-foundation/" data-link-title="模組三：網路地基 — VPC 與分層" data-link-desc="VPC、public / private subnet 切分、route table、NAT、security group 設計">網路（VPC / subnet）&lt;/a>與&lt;a href="https://tarrragon.github.io/blog/infra/04-environment-separation/" data-link-title="模組四：環境分離與模組化" data-link-desc="dev / staging / prod 切分、目錄結構 vs workspace、用可重用 module 避免環境漂移">環境分離&lt;/a>構成底層平面，這一層在它們之上描述資料庫、運算、儲存與入口 — 業務流量真正落地的地方。順序與依賴的表達方式決定了這層能不能被乾淨地重建、拆除與演進。共通原則是：描述服務的「身分與接線」，而非把每個執行期參數都塞進程式碼。&lt;/p>
&lt;p>本篇先確立依賴圖怎麼驅動部署順序，再展開核心服務裡最需要謹慎描述的一類 — 資料庫。資料庫持有無法重建的狀態，它的 IaC 描述比其他 stateless 資源多出保護策略、連線管理與讀寫分流三個維度。&lt;/p>
&lt;h2 id="核心服務的部署順序">核心服務的部署順序&lt;/h2>
&lt;p>核心服務的部署順序由依賴方向決定：被依賴的先建，依賴別人的後建。網路與身分是幾乎所有上層服務的共同前置 — 資料庫要放進私有 subnet、運算要套用 IAM role 才能讀 S3、load balancer 要掛在公開 subnet 並引用 security group。這些底層平面若還沒成形，上層資源會在 apply 時因為找不到 subnet ID 或 role ARN 而失敗，或更糟，建在預設 VPC 裡繞過了所有隔離設計。&lt;/p>
&lt;p>把順序交給 IaC 工具的依賴圖自動推導，比人工排序可靠。當運算資源的定義引用了 subnet 與 security group 的資源屬性，Terraform 會解析出「subnet 先於運算」的邊，apply 時自動排程。人工維護一份「先做 A 再做 B」的清單會隨資源增加而失準，依賴圖則隨程式碼本身演進。&lt;/p>
&lt;h3 id="四層依賴結構">四層依賴結構&lt;/h3>
&lt;p>依賴圖的典型展開順序呈現四層結構：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>層次&lt;/th>
 &lt;th>資源&lt;/th>
 &lt;th>依賴來源&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>1&lt;/td>
 &lt;td>VPC、subnet、security group、IAM role&lt;/td>
 &lt;td>無（地基層，由模組二到四建立）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>2&lt;/td>
 &lt;td>RDS、ElastiCache、S3 bucket&lt;/td>
 &lt;td>引用 subnet group、security group&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>3&lt;/td>
 &lt;td>ECS service / EKS workload、RDS Proxy&lt;/td>
 &lt;td>引用 subnet、IAM role、DB 端點&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>4&lt;/td>
 &lt;td>ALB、listener、target group、ACM 憑證&lt;/td>
 &lt;td>引用 public subnet、security group、ECS&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>這四層不需要手動編排。只要程式碼裡的引用關係正確，Terraform 就會自動按這個順序 apply。當 plan 輸出的順序看起來不合直覺 — 例如 ALB 先於 ECS — 通常代表某個引用斷了、兩者之間沒有依賴邊。&lt;/p>
&lt;h3 id="順序失控的徵兆">順序失控的徵兆&lt;/h3>
&lt;p>順序失控的早期徵兆是：某個上層資源的定義裡寫了一串 hardcode 的 subnet ID 或 VPC ID。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-hcl" data-lang="hcl">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1"># 硬編碼 ID — 依賴圖斷裂，底層重建時上層不會跟上
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">resource&lt;/span> &lt;span class="s2">&amp;#34;aws_db_subnet_group&amp;#34; &amp;#34;private&amp;#34;&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="n"> subnet_ids&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;subnet-0abc123&amp;#34;, &amp;#34;subnet-0def456&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">}&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>這段 code 跟底層的 subnet 資源沒有引用關係。底層一旦重建、ID 改變，上層不會自動跟上，state 與雲端現實之間的不一致（即 drift）就此產生。修法是把硬編碼的 ID 換成對底層資源屬性的引用：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-hcl" data-lang="hcl">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1"># 引用資源屬性 — 依賴圖自動推導，底層重建時上層自動取得新 ID
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">resource&lt;/span> &lt;span class="s2">&amp;#34;aws_db_subnet_group&amp;#34; &amp;#34;private&amp;#34;&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="n"> subnet_ids&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="k">for&lt;/span> &lt;span class="k">s&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="k">aws_subnet&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">private&lt;/span> &lt;span class="err">:&lt;/span> &lt;span class="k">s&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">id&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">}&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>跨 state 的情境（網路地基與核心服務分屬不同 state）則用 data source 取代直接引用 — 這個取捨在&lt;a href="https://tarrragon.github.io/blog/infra/05-core-services/stateful-protection-dependency/" data-link-title="Stateful 資源保護與跨服務依賴表達" data-link-desc="stateful 資源的保護策略（multi-AZ、備份、刪除保護）、stateful 與 stateless 的操作差異，以及用 output 與 data source 表達服務間依賴">服務依賴與跨 state 引用&lt;/a>展開。&lt;/p></description><content:encoded><![CDATA[<p>地基就緒後，依「地基 → 上層」的順序把實際承載業務的服務寫進 IaC。<a href="/blog/infra/02-identity-credentials/" data-link-title="模組二：身分與憑證地基 — IAM 與 OIDC" data-link-desc="IAM role / policy 設計、最小權限，以及用 OIDC 短期憑證取代長期 access key">身分（IAM）</a>、<a href="/blog/infra/03-network-foundation/" data-link-title="模組三：網路地基 — VPC 與分層" data-link-desc="VPC、public / private subnet 切分、route table、NAT、security group 設計">網路（VPC / subnet）</a>與<a href="/blog/infra/04-environment-separation/" data-link-title="模組四：環境分離與模組化" data-link-desc="dev / staging / prod 切分、目錄結構 vs workspace、用可重用 module 避免環境漂移">環境分離</a>構成底層平面，這一層在它們之上描述資料庫、運算、儲存與入口 — 業務流量真正落地的地方。順序與依賴的表達方式決定了這層能不能被乾淨地重建、拆除與演進。共通原則是：描述服務的「身分與接線」，而非把每個執行期參數都塞進程式碼。</p>
<p>本篇先確立依賴圖怎麼驅動部署順序，再展開核心服務裡最需要謹慎描述的一類 — 資料庫。資料庫持有無法重建的狀態，它的 IaC 描述比其他 stateless 資源多出保護策略、連線管理與讀寫分流三個維度。</p>
<h2 id="核心服務的部署順序">核心服務的部署順序</h2>
<p>核心服務的部署順序由依賴方向決定：被依賴的先建，依賴別人的後建。網路與身分是幾乎所有上層服務的共同前置 — 資料庫要放進私有 subnet、運算要套用 IAM role 才能讀 S3、load balancer 要掛在公開 subnet 並引用 security group。這些底層平面若還沒成形，上層資源會在 apply 時因為找不到 subnet ID 或 role ARN 而失敗，或更糟，建在預設 VPC 裡繞過了所有隔離設計。</p>
<p>把順序交給 IaC 工具的依賴圖自動推導，比人工排序可靠。當運算資源的定義引用了 subnet 與 security group 的資源屬性，Terraform 會解析出「subnet 先於運算」的邊，apply 時自動排程。人工維護一份「先做 A 再做 B」的清單會隨資源增加而失準，依賴圖則隨程式碼本身演進。</p>
<h3 id="四層依賴結構">四層依賴結構</h3>
<p>依賴圖的典型展開順序呈現四層結構：</p>
<table>
  <thead>
      <tr>
          <th>層次</th>
          <th>資源</th>
          <th>依賴來源</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>VPC、subnet、security group、IAM role</td>
          <td>無（地基層，由模組二到四建立）</td>
      </tr>
      <tr>
          <td>2</td>
          <td>RDS、ElastiCache、S3 bucket</td>
          <td>引用 subnet group、security group</td>
      </tr>
      <tr>
          <td>3</td>
          <td>ECS service / EKS workload、RDS Proxy</td>
          <td>引用 subnet、IAM role、DB 端點</td>
      </tr>
      <tr>
          <td>4</td>
          <td>ALB、listener、target group、ACM 憑證</td>
          <td>引用 public subnet、security group、ECS</td>
      </tr>
  </tbody>
</table>
<p>這四層不需要手動編排。只要程式碼裡的引用關係正確，Terraform 就會自動按這個順序 apply。當 plan 輸出的順序看起來不合直覺 — 例如 ALB 先於 ECS — 通常代表某個引用斷了、兩者之間沒有依賴邊。</p>
<h3 id="順序失控的徵兆">順序失控的徵兆</h3>
<p>順序失控的早期徵兆是：某個上層資源的定義裡寫了一串 hardcode 的 subnet ID 或 VPC ID。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 硬編碼 ID — 依賴圖斷裂，底層重建時上層不會跟上
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">resource</span> <span class="s2">&#34;aws_db_subnet_group&#34; &#34;private&#34;</span> {
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="n">  subnet_ids</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;subnet-0abc123&#34;, &#34;subnet-0def456&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">}</span></span></code></pre></div><p>這段 code 跟底層的 subnet 資源沒有引用關係。底層一旦重建、ID 改變，上層不會自動跟上，state 與雲端現實之間的不一致（即 drift）就此產生。修法是把硬編碼的 ID 換成對底層資源屬性的引用：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 引用資源屬性 — 依賴圖自動推導，底層重建時上層自動取得新 ID
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">resource</span> <span class="s2">&#34;aws_db_subnet_group&#34; &#34;private&#34;</span> {
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="n">  subnet_ids</span> <span class="o">=</span> <span class="p">[</span><span class="k">for</span> <span class="k">s</span> <span class="k">in</span> <span class="k">aws_subnet</span><span class="p">.</span><span class="k">private</span> <span class="err">:</span> <span class="k">s</span><span class="p">.</span><span class="k">id</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">}</span></span></code></pre></div><p>跨 state 的情境（網路地基與核心服務分屬不同 state）則用 data source 取代直接引用 — 這個取捨在<a href="/blog/infra/05-core-services/stateful-protection-dependency/" data-link-title="Stateful 資源保護與跨服務依賴表達" data-link-desc="stateful 資源的保護策略（multi-AZ、備份、刪除保護）、stateful 與 stateless 的操作差異，以及用 output 與 data source 表達服務間依賴">服務依賴與跨 state 引用</a>展開。</p>
<h3 id="隱性依賴與-depends_on">隱性依賴與 depends_on</h3>
<p>自動推導涵蓋的是「引用屬性時產生的邊」。少數情況下兩個資源之間有依賴卻沒有屬性引用 — 例如一個 IAM policy attachment 必須在某個 role 被 ECS task 使用之前完成，但 task 引用的是 role ARN 而非 attachment 的輸出。這時用 <code>depends_on</code> 顯式宣告邊：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_ecs_service&#34; &#34;api&#34;</span> {<span class="c1">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1">  # ...
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"></span><span class="n">  depends_on</span> <span class="o">=</span> <span class="p">[</span><span class="k">aws_iam_role_policy_attachment</span><span class="p">.</span><span class="k">ecs_task_s3</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">}</span></span></code></pre></div><p><code>depends_on</code> 應該只出現在自動推導覆蓋不了的場景。如果一個 module 裡到處都是 <code>depends_on</code>，通常代表引用關係寫得不夠明確，該把隱性依賴改成屬性引用。</p>
<h2 id="資料庫rds">資料庫（RDS）</h2>
<p>資料庫是核心服務裡最需要謹慎描述的資源，因為它持有無法重建的狀態。IaC 定義它的 instance class、引擎版本、所在的 subnet group（決定它落在哪些私有 subnet）、套用的 parameter group 與 security group。連線端點不要硬編碼，改用資源 output 暴露給上層運算引用，這樣端點隨主庫 failover 或重建而改變時，上層引用自動更新。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_db_instance&#34; &#34;primary&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  identifier</span>             <span class="o">=</span> <span class="s2">&#34;app-${var.env}-primary&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  engine</span>                 <span class="o">=</span> <span class="s2">&#34;postgres&#34;</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  engine_version</span>         <span class="o">=</span> <span class="s2">&#34;16.3&#34;</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">  instance_class</span>         <span class="o">=</span> <span class="k">var</span><span class="p">.</span><span class="k">db_instance_class</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">  allocated_storage</span>      <span class="o">=</span> <span class="m">100</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">  storage_encrypted</span>      <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">  db_subnet_group_name</span>   <span class="o">=</span> <span class="k">aws_db_subnet_group</span><span class="p">.</span><span class="k">private</span><span class="p">.</span><span class="k">name</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">  vpc_security_group_ids</span> <span class="o">=</span> <span class="p">[</span><span class="k">aws_security_group</span><span class="p">.</span><span class="k">db</span><span class="p">.</span><span class="k">id</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="n">  multi_az</span>                  <span class="o">=</span><span class="n"> var.env</span> <span class="o">==</span> <span class="s2">&#34;prod&#34;</span> <span class="err">?</span> <span class="kt">true</span> <span class="err">:</span> <span class="kt">false</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">  backup_retention_period</span>   <span class="o">=</span><span class="n"> var.env</span> <span class="o">==</span> <span class="s2">&#34;prod&#34;</span> <span class="err">?</span> <span class="m">14</span> <span class="err">:</span> <span class="m">1</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="n">  backup_window</span>             <span class="o">=</span> <span class="s2">&#34;03:00-04:00&#34;</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="n">  deletion_protection</span>       <span class="o">=</span><span class="n"> var.env</span> <span class="o">==</span> <span class="s2">&#34;prod&#34;</span> <span class="err">?</span> <span class="kt">true</span> <span class="err">:</span> <span class="kt">false</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="n">  skip_final_snapshot</span>       <span class="o">=</span><span class="n"> var.env</span> <span class="o">==</span> <span class="s2">&#34;prod&#34;</span> <span class="err">?</span> <span class="kt">false</span> <span class="err">:</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="n">  final_snapshot_identifier</span> <span class="o">=</span><span class="n"> var.env</span> <span class="o">==</span> <span class="s2">&#34;prod&#34; ? &#34;app-prod-final-${formatdate(&#34;YYYYMMDD&#34;, timestamp())}&#34;</span> <span class="err">:</span> <span class="k">null</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="n">  tags</span> <span class="o">=</span><span class="n"> { service</span> <span class="o">=</span> <span class="s2">&#34;payments&#34;</span> }
</span></span><span class="line"><span class="ln">20</span><span class="cl">}
</span></span><span class="line"><span class="ln">21</span><span class="cl">
</span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="k">output</span> <span class="s2">&#34;db_endpoint&#34;</span> {
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="n">  value</span> <span class="o">=</span> <span class="k">aws_db_instance</span><span class="p">.</span><span class="k">primary</span><span class="p">.</span><span class="k">endpoint</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">}</span></span></code></pre></div><h3 id="加密的不可逆性">加密的不可逆性</h3>
<p><code>storage_encrypted = true</code> 確保磁碟層級的加密在資源建立時就生效。RDS 不支援事後對既有 instance 開加密 — 漏了只能重建。補救路徑是匯出快照、用加密 KMS key 複製快照成加密版本、再用加密快照還原成新 instance。這個過程需要停機或切換端點，對已經承載流量的 production 資料庫代價很高。prod 的 RDS 若 <code>storage_encrypted</code> 為 false，這筆技術債越早處理越便宜。</p>
<h3 id="parameter-group-的角色">parameter group 的角色</h3>
<p>parameter group 定義資料庫引擎層級的行為參數（如 <code>max_connections</code>、<code>work_mem</code>、<code>log_min_duration_statement</code>），是 RDS instance 的設定骨架。IaC 描述 parameter group 的好處是讓這些參數進版本控制 — 有人改了 <code>max_connections</code> 會出現在 PR diff 裡，而不是某天在 Console 改了沒人知道。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_db_parameter_group&#34; &#34;postgres16&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  family</span> <span class="o">=</span> <span class="s2">&#34;postgres16&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  name</span>   <span class="o">=</span> <span class="s2">&#34;app-${var.env}-pg16&#34;</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">  <span class="k">parameter</span> {
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">    name</span>  <span class="o">=</span> <span class="s2">&#34;log_min_duration_statement&#34;</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">    value</span> <span class="o">=</span> <span class="s2">&#34;1000&#34;</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  }
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl">  <span class="k">parameter</span> {
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">    name</span>  <span class="o">=</span> <span class="s2">&#34;shared_preload_libraries&#34;</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="n">    value</span> <span class="o">=</span> <span class="s2">&#34;pg_stat_statements&#34;</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">  }
</span></span><span class="line"><span class="ln">14</span><span class="cl">}</span></span></code></pre></div><p>修改 parameter group 的某些參數需要重啟 RDS instance（稱為 <code>apply_method = &quot;pending-reboot&quot;</code>），修改前要先確認這個參數屬於「立即生效」還是「要重啟」。在 Terraform plan 裡不會明確標示重啟，要靠 AWS 文件交叉比對。</p>
<h3 id="連線管理">連線管理</h3>
<p>運算到資料庫之間有一段常被略過的接線：連線管理。無狀態運算水平擴張時，每個實例各自開連線，容易把資料庫的連線數打滿。一個 ECS service 從 5 個 task 擴到 50 個、每個 task 開 10 條連線，就從 50 條跳到 500 條 — 而一台 <code>db.r6g.large</code> 的 <code>max_connections</code> 預設約在 1600 左右，500 條已經吃掉三分之一。</p>
<p>出現「擴運算反而拖垮 DB」的訊號時，解法是引入連線池或受管的連線代理。RDS Proxy 是 AWS 的受管方案：它在運算與 RDS 之間當一層連線池，把下游的數百條短連線收斂成對 RDS 的少量長連線。在 IaC 裡一併定義，輸出 proxy 端點給運算引用：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_db_proxy&#34; &#34;app&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  name</span>                   <span class="o">=</span> <span class="s2">&#34;app-${var.env}-proxy&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  engine_family</span>          <span class="o">=</span> <span class="s2">&#34;POSTGRESQL&#34;</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  role_arn</span>               <span class="o">=</span> <span class="k">aws_iam_role</span><span class="p">.</span><span class="k">rds_proxy</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">  vpc_subnet_ids</span>         <span class="o">=</span> <span class="p">[</span><span class="k">for</span> <span class="k">s</span> <span class="k">in</span> <span class="k">aws_subnet</span><span class="p">.</span><span class="k">private</span> <span class="err">:</span> <span class="k">s</span><span class="p">.</span><span class="k">id</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">  vpc_security_group_ids</span> <span class="o">=</span> <span class="p">[</span><span class="k">aws_security_group</span><span class="p">.</span><span class="k">db</span><span class="p">.</span><span class="k">id</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  <span class="k">auth</span> {
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">    auth_scheme</span> <span class="o">=</span> <span class="s2">&#34;SECRETS&#34;</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">    secret_arn</span>  <span class="o">=</span> <span class="k">aws_secretsmanager_secret</span><span class="p">.</span><span class="k">db_password</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">  }
</span></span><span class="line"><span class="ln">12</span><span class="cl">}
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="k">output</span> <span class="s2">&#34;db_proxy_endpoint&#34;</span> {
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="n">  value</span> <span class="o">=</span> <span class="k">aws_db_proxy</span><span class="p">.</span><span class="k">app</span><span class="p">.</span><span class="k">endpoint</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">}</span></span></code></pre></div><p>運算端引用 <code>db_proxy_endpoint</code> 而非 <code>db_endpoint</code>，連線管理就從各 task 自己處理轉成由 proxy 統一收斂。RDS Proxy 同時提供 failover 的連線保持 — 主庫切換到 standby 時，proxy 維護的連線不會全部斷開重建，應用端感受到的是短暫延遲而非連線錯誤。</p>
<p>判讀是否需要 RDS Proxy 的訊號是連線數成長曲線：如果運算的擴縮範圍固定且連線數上限遠低於 <code>max_connections</code>，直連即可；如果運算會頻繁擴縮或連線數可能逼近上限，proxy 值得引入。proxy 本身有額外成本（按 vCPU 計費），不是所有環境都划算 — dev 環境通常直連就夠。</p>
<h3 id="read-replica">read replica</h3>
<p>當讀流量遠大於寫、且能容忍副本的複寫延遲（通常是毫秒到秒級）時，read replica 是把讀請求導離主庫的下一步。replica 在 IaC 裡用獨立資源描述，引用主庫的 identifier：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_db_instance&#34; &#34;read_replica&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  identifier</span>             <span class="o">=</span> <span class="s2">&#34;app-${var.env}-replica&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  replicate_source_db</span>    <span class="o">=</span> <span class="k">aws_db_instance</span><span class="p">.</span><span class="k">primary</span><span class="p">.</span><span class="k">identifier</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  instance_class</span>         <span class="o">=</span> <span class="k">var</span><span class="p">.</span><span class="k">db_replica_class</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">  vpc_security_group_ids</span> <span class="o">=</span> <span class="p">[</span><span class="k">aws_security_group</span><span class="p">.</span><span class="k">db</span><span class="p">.</span><span class="k">id</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">}
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="k">output</span> <span class="s2">&#34;db_replica_endpoint&#34;</span> {
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">  value</span> <span class="o">=</span> <span class="k">aws_db_instance</span><span class="p">.</span><span class="k">read_replica</span><span class="p">.</span><span class="k">endpoint</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">}</span></span></code></pre></div><p>運算端依讀寫分流引用不同端點 — 寫走 <code>db_endpoint</code>（或 <code>db_proxy_endpoint</code>），讀走 <code>db_replica_endpoint</code>。這個分流邏輯屬於應用層的責任，infra 只負責把端點暴露出來。</p>
<p>read replica 的邊界要講清楚：它緩解讀流量對主庫的壓力，但它不是備份。replica 會同步複製主庫的所有變更 — 包括誤刪的資料。需要還原到某個時間點的保護由 backup retention 與 PITR（point-in-time recovery）提供，這兩者的 IaC 描述在 <a href="/blog/infra/05-core-services/stateful-protection-dependency/" data-link-title="Stateful 資源保護與跨服務依賴表達" data-link-desc="stateful 資源的保護策略（multi-AZ、備份、刪除保護）、stateful 與 stateless 的操作差異，以及用 output 與 data source 表達服務間依賴">stateful 保護策略</a>。</p>
<h3 id="引擎版本升級的取捨">引擎版本升級的取捨</h3>
<p>RDS 引擎版本（<code>engine_version</code>）寫進 IaC 後，版本升級就成為一個需要 PR review 的變更。升級分 minor 和 major：minor 升級（16.2 → 16.3）通常向後相容、可在維護視窗自動套用；major 升級（15 → 16）可能有 breaking change，需要先在 dev 環境驗證、備份、排維護窗口。</p>
<p>在 IaC 裡把 <code>engine_version</code> 寫死是刻意的選擇 — 它阻止 AWS 在背景自動升級 major 版本，讓版本變更必須走 PR。代價是需要定期檢查是否有 EOL 版本還在用。如果 <code>engine_version</code> 指向的版本已經超過 AWS 的支援期限，Terraform apply 會在某天失敗（AWS 會強制升級），這比主動升級更不可控。</p>
<p>資料庫在規模放大後的治理維度也會改變。<a href="/blog/backend/09-performance-capacity/cases/netflix-aurora-consolidation/" data-link-title="9.C23 Netflix：把關聯式 DB 統一到 Aurora、效能 &#43;75%、成本 -28%" data-link-desc="Netflix 把多套關聯式 DB 統一到 Aurora、效能提升 75%、成本下降 28%、串流數十億小時">Netflix 把分散的 Aurora 叢集整併</a>後成本降了 28%——多個團隊各自開的 RDS instance 加起來的閒置容量遠超一個整併後的叢集。infra 層的教訓是 RDS 的 IaC 描述不只管單一 instance 的設定，長期還要管叢集的分布與合併策略。另一個維度是合規需求驅動的資料落地：<a href="/blog/backend/09-performance-capacity/cases/hard-rock-digital-cockroachdb-sports-betting/" data-link-title="9.C41 Hard Rock Digital：CockroachDB on AWS Outposts、Wire Act 合規 &#43; 跨州單一邏輯 DB" data-link-desc="Hard Rock Digital 用 CockroachDB 跨 AWS Outposts &#43; US-East-1、Wire Act 強制資料留州、單一邏輯 DB 解多州 sportsbook、100 node 32 vCPU 撐 Super Bowl">Hard Rock Digital 因為 Wire Act 法規要求資料留在特定州</a>，用 AWS Outposts 在地端跑運算——這類情境下 infra 的 region 與可用區選擇由法規約束驅動，而非純技術決策。</p>
<h2 id="跨分類引用">跨分類引用</h2>
<ul>
<li>→ <a href="/blog/infra/03-network-foundation/" data-link-title="模組三：網路地基 — VPC 與分層" data-link-desc="VPC、public / private subnet 切分、route table、NAT、security group 設計">模組三：網路地基</a>：資料庫的 subnet group 引用 private subnet</li>
<li>→ <a href="/blog/infra/02-identity-credentials/" data-link-title="模組二：身分與憑證地基 — IAM 與 OIDC" data-link-desc="IAM role / policy 設計、最小權限，以及用 OIDC 短期憑證取代長期 access key">模組二：身分與憑證地基</a>：RDS Proxy 的 IAM role 與 secret 存取</li>
<li>→ <a href="/blog/infra/04-environment-separation/" data-link-title="模組四：環境分離與模組化" data-link-desc="dev / staging / prod 切分、目錄結構 vs workspace、用可重用 module 避免環境漂移">模組四：環境分離與模組化</a>：prod / dev 用同一個 module、不同參數值</li>
<li>→ <a href="/blog/infra/05-core-services/stateful-protection-dependency/" data-link-title="Stateful 資源保護與跨服務依賴表達" data-link-desc="stateful 資源的保護策略（multi-AZ、備份、刪除保護）、stateful 與 stateless 的操作差異，以及用 output 與 data source 表達服務間依賴">stateful 保護與跨 state 引用</a>：backup retention、deletion protection、multi-AZ 的完整討論</li>
<li>→ <a href="/blog/infra/05-core-services/compute-ecs-eks/" data-link-title="運算平台上 IaC — ECS 與 EKS" data-link-desc="容器運算平台的 IaC 描述：ECS 與 EKS 選型、task definition 與映像版本解耦、IAM task role 分離、auto-scaling 策略">運算上 IaC</a>：運算端怎麼引用資料庫端點</li>
<li>→ <a href="/blog/backend/01-database/" data-link-title="模組一：資料庫與持久化" data-link-desc="整理 SQL、transaction、migration 與 repository adapter 的後端實務">backend 模組一：資料庫</a>：schema 設計、migration、query 層面的服務端討論</li>
</ul>
]]></content:encoded></item><item><title>運算平台上 IaC — ECS 與 EKS</title><link>https://tarrragon.github.io/blog/infra/05-core-services/compute-ecs-eks/</link><pubDate>Fri, 26 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/infra/05-core-services/compute-ecs-eks/</guid><description>&lt;p>運算是業務程式碼的執行載體。infra 這層描述的是「運算容量與接線」— 它跑在哪些 subnet、套用哪個 IAM role、掛到哪個 load balancer 的 target group、以及容量怎麼隨負載擴縮。實際跑什麼版本的程式碼由部署流程決定，這個邊界讓 infra 變更與應用發布各走各的節奏 — infra apply 不會因此改動映像，部署 pipeline 不會因此改動 subnet。&lt;/p>
&lt;p>核心服務的部署順序由依賴方向決定（被依賴的先建），運算在這個&lt;a href="https://tarrragon.github.io/blog/infra/05-core-services/deployment-order-database/" data-link-title="部署順序與資料庫上 IaC" data-link-desc="核心服務的依賴圖決定部署順序，資料庫作為第一批上層服務需要最謹慎的 IaC 描述 — 涵蓋 RDS 接線、連線管理、read replica 與端點暴露">四層依賴結構&lt;/a>裡位於第三層：它引用底層的 subnet、security group 與 IAM role，同時被上層的 load balancer target group 引用。所以運算資源的 IaC 定義裡，subnet ID、security group ID、IAM role ARN 都應該是引用而非硬編碼 — 底層重建時上層才會自動跟上。&lt;/p>
&lt;h2 id="ecs-vs-eks-選型">ECS vs EKS 選型&lt;/h2>
&lt;p>ECS 與 EKS 都能跑容器，差異在控制平面的維運模型與生態適配。選型看的是團隊能力與業務需求，而非功能多寡 — 兩者都能達成「容器跑在私有 subnet、用 IAM role 存取資源、掛到 ALB 接收流量」這個基本目標。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>ECS&lt;/th>
 &lt;th>EKS&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>控制平面維運&lt;/td>
 &lt;td>AWS 完全代管&lt;/td>
 &lt;td>AWS 代管 API server，附加元件自行管理&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>學習曲線&lt;/td>
 &lt;td>低（AWS 原生概念）&lt;/td>
 &lt;td>高（Kubernetes 生態）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>跨雲可攜&lt;/td>
 &lt;td>低（AWS 專屬）&lt;/td>
 &lt;td>高（Kubernetes 標準）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>IaC 工具鏈&lt;/td>
 &lt;td>全部用 Terraform AWS provider&lt;/td>
 &lt;td>Terraform 建 cluster，workload 走 Helm&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>適合場景&lt;/td>
 &lt;td>AWS 單雲、團隊無 K8s 經驗&lt;/td>
 &lt;td>已有 K8s 能力或需要其生態時&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>ECS 的控制平面由 AWS 代管，service、task definition、target group 都是 AWS 原生資源，Terraform 的 provider 直接描述，心智負擔低。它的 Fargate 啟動類型更進一步 — 連 EC2 instance 都不用管，只描述 task 要多少 CPU 和記憶體，AWS 負責排程到底層主機。&lt;/p>
&lt;p>EKS 的控制平面是受管的 Kubernetes，IaC 描述的是 cluster 本身與 node group，workload（Deployment、Service）則走 Kubernetes manifest 或 Helm chart。這代表 infra 工具鏈跨越了 Terraform 與 Kubernetes 兩套系統 — Terraform 負責 cluster 基礎設施，kubectl / Helm 負責工作負載，兩者的 state 與變更流程是分開的。&lt;/p>
&lt;p>團隊已有 Kubernetes 能力或需要其生態（service mesh、自訂排程器、多雲部署、社群的 operator 生態）時，EKS 的複雜度才值得承擔。否則 ECS 的低負擔是預設起點。一個自測方式：團隊選了 EKS 但只用到最基本的 Deployment + Service，沒有碰 service mesh、CRD 或跨雲，那等於承擔了 Kubernetes 的維運成本卻沒用到它的回報——退回 ECS 通常更合理。&lt;/p>
&lt;h3 id="fargate-vs-ec2-launch-type">Fargate vs EC2 launch type&lt;/h3>
&lt;p>ECS 的執行模式再分 EC2 launch type 和 Fargate launch type。EC2 launch type 需要自己管理 EC2 instance 組成的 capacity provider — AMI 更新、instance 擴縮、OS 層安全修補都是團隊的責任。Fargate 由 AWS 代管運算實例，不需要配 capacity provider、不需要管 AMI，進一步降低運維面。&lt;/p></description><content:encoded><![CDATA[<p>運算是業務程式碼的執行載體。infra 這層描述的是「運算容量與接線」— 它跑在哪些 subnet、套用哪個 IAM role、掛到哪個 load balancer 的 target group、以及容量怎麼隨負載擴縮。實際跑什麼版本的程式碼由部署流程決定，這個邊界讓 infra 變更與應用發布各走各的節奏 — infra apply 不會因此改動映像，部署 pipeline 不會因此改動 subnet。</p>
<p>核心服務的部署順序由依賴方向決定（被依賴的先建），運算在這個<a href="/blog/infra/05-core-services/deployment-order-database/" data-link-title="部署順序與資料庫上 IaC" data-link-desc="核心服務的依賴圖決定部署順序，資料庫作為第一批上層服務需要最謹慎的 IaC 描述 — 涵蓋 RDS 接線、連線管理、read replica 與端點暴露">四層依賴結構</a>裡位於第三層：它引用底層的 subnet、security group 與 IAM role，同時被上層的 load balancer target group 引用。所以運算資源的 IaC 定義裡，subnet ID、security group ID、IAM role ARN 都應該是引用而非硬編碼 — 底層重建時上層才會自動跟上。</p>
<h2 id="ecs-vs-eks-選型">ECS vs EKS 選型</h2>
<p>ECS 與 EKS 都能跑容器，差異在控制平面的維運模型與生態適配。選型看的是團隊能力與業務需求，而非功能多寡 — 兩者都能達成「容器跑在私有 subnet、用 IAM role 存取資源、掛到 ALB 接收流量」這個基本目標。</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>ECS</th>
          <th>EKS</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>控制平面維運</td>
          <td>AWS 完全代管</td>
          <td>AWS 代管 API server，附加元件自行管理</td>
      </tr>
      <tr>
          <td>學習曲線</td>
          <td>低（AWS 原生概念）</td>
          <td>高（Kubernetes 生態）</td>
      </tr>
      <tr>
          <td>跨雲可攜</td>
          <td>低（AWS 專屬）</td>
          <td>高（Kubernetes 標準）</td>
      </tr>
      <tr>
          <td>IaC 工具鏈</td>
          <td>全部用 Terraform AWS provider</td>
          <td>Terraform 建 cluster，workload 走 Helm</td>
      </tr>
      <tr>
          <td>適合場景</td>
          <td>AWS 單雲、團隊無 K8s 經驗</td>
          <td>已有 K8s 能力或需要其生態時</td>
      </tr>
  </tbody>
</table>
<p>ECS 的控制平面由 AWS 代管，service、task definition、target group 都是 AWS 原生資源，Terraform 的 provider 直接描述，心智負擔低。它的 Fargate 啟動類型更進一步 — 連 EC2 instance 都不用管，只描述 task 要多少 CPU 和記憶體，AWS 負責排程到底層主機。</p>
<p>EKS 的控制平面是受管的 Kubernetes，IaC 描述的是 cluster 本身與 node group，workload（Deployment、Service）則走 Kubernetes manifest 或 Helm chart。這代表 infra 工具鏈跨越了 Terraform 與 Kubernetes 兩套系統 — Terraform 負責 cluster 基礎設施，kubectl / Helm 負責工作負載，兩者的 state 與變更流程是分開的。</p>
<p>團隊已有 Kubernetes 能力或需要其生態（service mesh、自訂排程器、多雲部署、社群的 operator 生態）時，EKS 的複雜度才值得承擔。否則 ECS 的低負擔是預設起點。一個自測方式：團隊選了 EKS 但只用到最基本的 Deployment + Service，沒有碰 service mesh、CRD 或跨雲，那等於承擔了 Kubernetes 的維運成本卻沒用到它的回報——退回 ECS 通常更合理。</p>
<h3 id="fargate-vs-ec2-launch-type">Fargate vs EC2 launch type</h3>
<p>ECS 的執行模式再分 EC2 launch type 和 Fargate launch type。EC2 launch type 需要自己管理 EC2 instance 組成的 capacity provider — AMI 更新、instance 擴縮、OS 層安全修補都是團隊的責任。Fargate 由 AWS 代管運算實例，不需要配 capacity provider、不需要管 AMI，進一步降低運維面。</p>
<p>Fargate 的代價是三個面向：單位成本較高（同規格的 vCPU/記憶體比 EC2 貴約 20-40%）、不支援 GPU workload、啟動延遲稍長（cold start 約 30-60 秒，EC2 已有 instance 時近乎即時）。多數 web API 和非 GPU 的背景工作的初始選擇是 Fargate — 省掉的運維時間通常抵得過溢價。流量穩定且需要成本最佳化時再切回 EC2 launch type，屆時增加的是 capacity provider 的設定與 instance 管理。量級參考：一個持續運行 2 vCPU / 4GB 的 Fargate task 月費約 $70，同規格 EC2 t3.medium 約 $30。月費差距在服務數量少時不顯著，當 task 數量超過 10-20 個且流量穩定時，切回 EC2 launch type 的節省量才值得投入切換工程。</p>
<p>後續 HCL 範例以 ECS Fargate 示意，EKS 的接線骨架（subnet、IAM、target group）相近，差異落在編排層的資源類型。</p>
<h2 id="task-definition描述容器規格與接線">Task definition：描述容器規格與接線</h2>
<p>Task definition 是 ECS 描述「一個工作單元長什麼樣」的宣告：要跑哪個容器映像、給多少 CPU 和記憶體、開哪些 port、用哪個 IAM role、log 送到哪裡。它是運算 IaC 的核心資源。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_ecs_task_definition&#34; &#34;api&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  family</span>                   <span class="o">=</span> <span class="s2">&#34;api-${var.env}&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  requires_compatibilities</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;FARGATE&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  network_mode</span>             <span class="o">=</span> <span class="s2">&#34;awsvpc&#34;</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">  cpu</span>                      <span class="o">=</span> <span class="k">var</span><span class="p">.</span><span class="k">task_cpu</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">  memory</span>                   <span class="o">=</span> <span class="k">var</span><span class="p">.</span><span class="k">task_memory</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">  execution_role_arn</span>       <span class="o">=</span> <span class="k">aws_iam_role</span><span class="p">.</span><span class="k">ecs_execution</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="n">  task_role_arn</span>            <span class="o">=</span> <span class="k">aws_iam_role</span><span class="p">.</span><span class="k">api_task</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">  container_definitions</span> <span class="o">=</span> <span class="k">jsonencode</span><span class="p">([</span>{
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">    name</span>  <span class="o">=</span> <span class="s2">&#34;api&#34;</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="n">    image</span> <span class="o">=</span> <span class="s2">&#34;${var.ecr_repo_url}:${var.image_tag}&#34;</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">    portMappings</span> <span class="o">=</span><span class="n"> [{ containerPort</span> <span class="o">=</span><span class="n"> 8080, protocol</span> <span class="o">=</span> <span class="s2">&#34;tcp&#34;</span> }<span class="p">]</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="n">    logConfiguration</span> <span class="o">=</span> {
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="n">      logDriver</span> <span class="o">=</span> <span class="s2">&#34;awslogs&#34;</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="n">      options</span> <span class="o">=</span> {
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="n">        &#34;awslogs-group&#34;</span>         <span class="o">=</span> <span class="k">aws_cloudwatch_log_group</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">name</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="n">        &#34;awslogs-region&#34;</span>        <span class="o">=</span> <span class="k">var</span><span class="p">.</span><span class="k">region</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="n">        &#34;awslogs-stream-prefix&#34;</span> <span class="o">=</span> <span class="s2">&#34;api&#34;</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">      }
</span></span><span class="line"><span class="ln">21</span><span class="cl">    }
</span></span><span class="line"><span class="ln">22</span><span class="cl">  }<span class="p">])</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">}</span></span></code></pre></div><p>這段定義裡有三個刻意的設計：</p>
<p><strong>映像版本解耦</strong>：<code>var.image_tag</code> 在 infra 的 <code>tfvars</code> 裡給一個穩定的預設值（如 <code>latest</code> 或某個基線版本），部署管線覆寫這個值推新版本。infra apply 不會因此改動映像、部署 pipeline 不會因此改動 subnet — 兩者的變更頻率與審查強度不同，混在一起會讓快的等慢的。如果每次部署新版本都要改 infra 的 Terraform code 並跑 apply，代表映像版本跟 infra 沒有解耦——應該讓部署管線直接用 <code>aws ecs update-service</code> 或修改 task definition 的 image tag，不走 Terraform。</p>
<p><strong>兩個 IAM role 的分工</strong>：<code>execution_role_arn</code> 是 ECS 代理用來拉映像和寫 log 的身分 — 它的權限是 ECS 平台層級的，跟業務邏輯無關。<code>task_role_arn</code> 是容器內的應用程式碼在執行期取得的身分 — 它的權限對應業務需求，例如讀寫某個 S3 bucket 或呼叫某個 SQS queue。兩者混在同一個 role 上，就是把平台權限跟業務權限混在一起，違反最小權限（見<a href="/blog/infra/02-identity-credentials/" data-link-title="模組二：身分與憑證地基 — IAM 與 OIDC" data-link-desc="IAM role / policy 設計、最小權限，以及用 OIDC 短期憑證取代長期 access key">模組二：身分與憑證地基</a>）。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_iam_role&#34; &#34;api_task&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  name</span>               <span class="o">=</span> <span class="s2">&#34;api-task-${var.env}&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  assume_role_policy</span> <span class="o">=</span> <span class="k">data</span><span class="p">.</span><span class="k">aws_iam_policy_document</span><span class="p">.</span><span class="k">ecs_assume</span><span class="p">.</span><span class="k">json</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">}
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_iam_role_policy&#34; &#34;api_task&#34;</span> {
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">  role</span>   <span class="o">=</span> <span class="k">aws_iam_role</span><span class="p">.</span><span class="k">api_task</span><span class="p">.</span><span class="k">id</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="n">  policy</span> <span class="o">=</span> <span class="k">data</span><span class="p">.</span><span class="k">aws_iam_policy_document</span><span class="p">.</span><span class="k">api_permissions</span><span class="p">.</span><span class="k">json</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">}
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="k">data</span> <span class="s2">&#34;aws_iam_policy_document&#34; &#34;api_permissions&#34;</span> {
</span></span><span class="line"><span class="ln">12</span><span class="cl">  <span class="k">statement</span> {
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">    actions</span>   <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;s3:GetObject&#34;, &#34;s3:PutObject&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="n">    resources</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;${aws_s3_bucket.uploads.arn}/*&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">  }
</span></span><span class="line"><span class="ln">16</span><span class="cl">  <span class="k">statement</span> {
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="n">    actions</span>   <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;sqs:SendMessage&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="n">    resources</span> <span class="o">=</span> <span class="p">[</span><span class="k">aws_sqs_queue</span><span class="p">.</span><span class="k">notifications</span><span class="p">.</span><span class="k">arn</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">  }
</span></span><span class="line"><span class="ln">20</span><span class="cl">}</span></span></code></pre></div><p><strong>Log 接線</strong>：<code>logConfiguration</code> 把容器的 stdout/stderr 導向 CloudWatch Logs，log group 名稱引用的是同一份 IaC 裡宣告的資源 — 這正是<a href="/blog/infra/06-observability-logging/" data-link-title="模組六：可觀測性與 log 一併寫進 code" data-link-desc="log group、metric、alarm 跟基礎設施同生命週期管理，出事時追得到查得到">模組六：可觀測性與 log</a> 說的「監控跟資源同生命週期」。</p>
<h2 id="ecs-service部署模式與網路接線">ECS service：部署模式與網路接線</h2>
<p>ECS service 控制「要跑幾個 task、怎麼部署新版本、掛到哪個 target group」。它是 task definition 的執行實例管理者。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_ecs_service&#34; &#34;api&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  name</span>            <span class="o">=</span> <span class="s2">&#34;api-${var.env}&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  cluster</span>         <span class="o">=</span> <span class="k">aws_ecs_cluster</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">id</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  task_definition</span> <span class="o">=</span> <span class="k">aws_ecs_task_definition</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">  desired_count</span>   <span class="o">=</span> <span class="k">var</span><span class="p">.</span><span class="k">api_desired_count</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">  launch_type</span>     <span class="o">=</span> <span class="s2">&#34;FARGATE&#34;</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  <span class="k">network_configuration</span> {
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">    subnets</span>          <span class="o">=</span> <span class="p">[</span><span class="k">for</span> <span class="k">s</span> <span class="k">in</span> <span class="k">aws_subnet</span><span class="p">.</span><span class="k">private</span> <span class="err">:</span> <span class="k">s</span><span class="p">.</span><span class="k">id</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">    security_groups</span>  <span class="o">=</span> <span class="p">[</span><span class="k">aws_security_group</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">id</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">    assign_public_ip</span> <span class="o">=</span> <span class="kt">false</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">  }
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl">  <span class="k">load_balancer</span> {
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="n">    target_group_arn</span> <span class="o">=</span> <span class="k">aws_lb_target_group</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="n">    container_name</span>   <span class="o">=</span> <span class="s2">&#34;api&#34;</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="n">    container_port</span>   <span class="o">=</span> <span class="m">8080</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">  }
</span></span><span class="line"><span class="ln">19</span><span class="cl">
</span></span><span class="line"><span class="ln">20</span><span class="cl">  <span class="k">deployment_circuit_breaker</span> {
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="n">    enable</span>   <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="n">    rollback</span> <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">  }
</span></span><span class="line"><span class="ln">24</span><span class="cl">}</span></span></code></pre></div><p><code>network_configuration</code> 把 task 放進 private subnet 並套用 security group — 它決定了這些容器在網路拓撲裡的位置（見<a href="/blog/infra/03-network-foundation/" data-link-title="模組三：網路地基 — VPC 與分層" data-link-desc="VPC、public / private subnet 切分、route table、NAT、security group 設計">模組三：網路地基</a>）。<code>assign_public_ip = false</code> 讓容器不拿公網 IP，對外流量經由 NAT 出去、入站流量經由 ALB 進來。</p>
<p><code>deployment_circuit_breaker</code> 是 ECS 的內建保護：部署新版本時如果 task 持續啟動失敗（health check 不過、容器 crash），ECS 會自動回滾到上一版。這個行為需要明確開啟、預設是關的 — 關著的話，壞版本的 task 會反覆啟動失敗，新版始終上不來但舊版也不會回來，服務陷入降級狀態。</p>
<h2 id="連線管理運算到資料庫的接線">連線管理：運算到資料庫的接線</h2>
<p>運算到資料庫之間有一段常被略過的接線：連線管理。無狀態運算水平擴張時，每個 task 各自開連線到 RDS，容易把資料庫的連線數打滿。RDS 的連線上限由 instance class 決定（例如 <code>db.r6g.large</code> 約 1000 個連線），而一個跑了 50 個 task 的 ECS service，每個 task 開 20 個連線就到上限了。</p>
<p>出現「擴運算反而拖垮 DB」的訊號時，要引入連線池或受管的連線代理。RDS Proxy 在運算與 RDS 之間代理連線，把運算端的大量短命連線收斂成少量長期連線再進資料庫。它也可以寫進 IaC 並輸出端點給運算引用：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_db_proxy&#34; &#34;main&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  name</span>                   <span class="o">=</span> <span class="s2">&#34;api-proxy-${var.env}&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  engine_family</span>          <span class="o">=</span> <span class="s2">&#34;POSTGRESQL&#34;</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  role_arn</span>               <span class="o">=</span> <span class="k">aws_iam_role</span><span class="p">.</span><span class="k">rds_proxy</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">  vpc_subnet_ids</span>         <span class="o">=</span> <span class="p">[</span><span class="k">for</span> <span class="k">s</span> <span class="k">in</span> <span class="k">aws_subnet</span><span class="p">.</span><span class="k">private</span> <span class="err">:</span> <span class="k">s</span><span class="p">.</span><span class="k">id</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">  vpc_security_group_ids</span> <span class="o">=</span> <span class="p">[</span><span class="k">aws_security_group</span><span class="p">.</span><span class="k">rds_proxy</span><span class="p">.</span><span class="k">id</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  <span class="k">auth</span> {
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">    auth_scheme</span> <span class="o">=</span> <span class="s2">&#34;SECRETS&#34;</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">    secret_arn</span>  <span class="o">=</span> <span class="k">aws_secretsmanager_secret</span><span class="p">.</span><span class="k">db_password</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">  }
</span></span><span class="line"><span class="ln">12</span><span class="cl">}
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="k">output</span> <span class="s2">&#34;db_endpoint&#34;</span> {
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="n">  value</span> <span class="o">=</span> <span class="k">aws_db_proxy</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">endpoint</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">}</span></span></code></pre></div><p>運算端的連線字串指向 proxy 端點而非 RDS 端點。proxy 的 security group 允許來自運算 security group 的流量，proxy 到 RDS 的流量則由 proxy 自己的 security group 對 RDS security group 的規則控制 — 安全邊界多了一層但更清晰。</p>
<h2 id="auto-scaling容量隨負載擴縮">Auto-scaling：容量隨負載擴縮</h2>
<p>ECS service 的 <code>desired_count</code> 是靜態的起始容量。要讓容量隨負載動態調整，需要加上 Application Auto Scaling。它的責任是在負載上升時長出更多 task、負載下降時縮回去省錢。</p>
<p>auto-scaling 的核心決策是「用什麼指標觸發擴縮」。常見的指標分兩類：</p>
<table>
  <thead>
      <tr>
          <th>指標類型</th>
          <th>典型指標</th>
          <th>適用情境</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>資源利用率</td>
          <td>CPU utilization、memory utilization</td>
          <td>運算密集型服務，CPU 與負載正相關</td>
      </tr>
      <tr>
          <td>業務吞吐量</td>
          <td>ALB request count per target</td>
          <td>I/O 密集型服務，CPU 低但併發高</td>
      </tr>
  </tbody>
</table>
<p>CPU utilization 是最直覺的指標，但它在 I/O 密集型服務上會失準 — 一個等待外部 API 回應的 task，CPU 很低但已經沒有多餘的能力處理新請求。這時用 ALB 的 request count per target（每個 task 平均處理幾個請求）更能反映真實負載。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_appautoscaling_target&#34; &#34;api&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  max_capacity</span>       <span class="o">=</span> <span class="k">var</span><span class="p">.</span><span class="k">api_max_count</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  min_capacity</span>       <span class="o">=</span> <span class="k">var</span><span class="p">.</span><span class="k">api_min_count</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  resource_id</span>        <span class="o">=</span> <span class="s2">&#34;service/${aws_ecs_cluster.main.name}/${aws_ecs_service.api.name}&#34;</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">  scalable_dimension</span> <span class="o">=</span> <span class="s2">&#34;ecs:service:DesiredCount&#34;</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">  service_namespace</span>  <span class="o">=</span> <span class="s2">&#34;ecs&#34;</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">}
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_appautoscaling_policy&#34; &#34;api_cpu&#34;</span> {
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">  name</span>               <span class="o">=</span> <span class="s2">&#34;api-cpu-${var.env}&#34;</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">  policy_type</span>        <span class="o">=</span> <span class="s2">&#34;TargetTrackingScaling&#34;</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="n">  resource_id</span>        <span class="o">=</span> <span class="k">aws_appautoscaling_target</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">resource_id</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">  scalable_dimension</span> <span class="o">=</span> <span class="k">aws_appautoscaling_target</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">scalable_dimension</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="n">  service_namespace</span>  <span class="o">=</span> <span class="k">aws_appautoscaling_target</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">service_namespace</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">
</span></span><span class="line"><span class="ln">16</span><span class="cl">  <span class="k">target_tracking_scaling_policy_configuration</span> {
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="n">    target_value</span>       <span class="o">=</span> <span class="m">60</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">    <span class="k">predefined_metric_specification</span> {
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="n">      predefined_metric_type</span> <span class="o">=</span> <span class="s2">&#34;ECSServiceAverageCPUUtilization&#34;</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">    }
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="n">    scale_in_cooldown</span>  <span class="o">=</span> <span class="m">300</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="n">    scale_out_cooldown</span> <span class="o">=</span> <span class="m">60</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">  }
</span></span><span class="line"><span class="ln">24</span><span class="cl">}</span></span></code></pre></div><p><code>target_value = 60</code> 表示目標 CPU 平均維持在 60% — 留 40% 的餘裕應對突發。<code>scale_out_cooldown</code> 設短（60 秒），讓擴張反應快；<code>scale_in_cooldown</code> 設長（300 秒），避免負載短暫下降就立刻縮容、結果下一波流量來了又要重新擴張。</p>
<p>設了 auto-scaling 後要定期看 scaling activity log 確認它在正確的時機擴縮。從來沒觸發過有兩種可能：<code>min_capacity</code> 已經高於實際需求（資源浪費），或 target value 設太高（來不及擴）。</p>
<p><code>max_capacity</code> 是成本護欄 — 設一個你能接受的上限，避免異常流量（爬蟲、攻擊、上游重試風暴）把 task 數推到遠超預期的帳單。運行期的成本優化在 <a href="/blog/devops/08-cost-management/" data-link-title="模組八：成本管理" data-link-desc="雲端帳單怎麼不失控 — reserved instance、spot instance、right-sizing、成本監控告警">devops 模組八：成本管理</a> 展開。</p>
<p>規模放大後，auto-scaling 的行為模式會改變。<a href="/blog/backend/09-performance-capacity/cases/niantic-pokemon-go-fifty-x-surge-gcp/" data-link-title="9.C8 Niantic Pokémon GO：在 GCP 上承載 50 倍突發流量" data-link-desc="Pokémon GO 上線時實際流量達原始預估 50 倍、Google CRE 怎麼即時補容量">Pokémon GO 上線時實際流量達預估的 50 倍</a>，這類突發不是 auto-scaling 能事前規劃的——50 倍的 headroom 會讓平日成本不合理。Niantic 的 infra 層前提是 GKE 把容器啟動時間降到秒級，讓 surge 反應成為可能；同時依賴 Google CRE 即時補 node 容量。<a href="/blog/backend/09-performance-capacity/cases/zoom-covid-surge-dynamodb/" data-link-title="9.C18 Zoom：COVID 期間從 1000 萬到 3 億 DAU 的 30 倍突發" data-link-desc="Zoom 在 2020 年 COVID 爆發時、日活從 1000 萬衝到 3 億、用 DynamoDB 撐住會議後端">Zoom COVID 期間的 30 倍突發</a> 則是結構性成長——日活從 1000 萬升到 3 億後不會回落，容量規劃的 baseline 需要永久重新校準。兩個案例的共同教訓是：auto-scaling 的 <code>max_capacity</code> 設定要預留突發空間，但極端突發的處理靠的是平台能力（容器化的快速啟動）和 vendor 支援（managed service 的彈性），不是 IaC 配置能獨立解決的。</p>
<p>多叢集治理是另一個規模維度。<a href="/blog/backend/09-performance-capacity/cases/riot-games-eks-multi-cluster/" data-link-title="9.C12 Riot Games：246 個 EKS cluster 的多遊戲多地區治理" data-link-desc="Riot Games 從 Mesos 遷移到 EKS、用 246 個 cluster 跨遊戲跨地區治理、年省 1000 萬美金">Riot Games 用 246 個 EKS cluster 跨多遊戲多地區</a>，每個遊戲一個獨立叢集（避免跨遊戲互相影響），搭配 Terraform 做 IaC、Karpenter 做 node lifecycle，年省 1000 萬美金。infra 層的教訓是：當運算叢集數量從個位數長到數十甚至數百，叢集本身變成需要 IaC 治理的資源——叢集的建立、版本升級、安全基線都要標準化。<a href="/blog/backend/05-deployment-platform/cases/conde-nast-platform-modernization-eks/" data-link-title="5.C2 Condé Nast：EKS 平台整併與標準化" data-link-desc="多地區異質 Kubernetes 平台整併為統一控制面的案例。">Condé Nast 的 EKS 平台整併</a>也印證了同樣的模式：多團隊各自維護異質 K8s 叢集會造成安全基線不一致，整併到統一平台後把 kube2iam（有 race condition 風險）換成 IRSA（OIDC federation），消除了 node-level 的 credential 共用。</p>
<h2 id="跨分類引用">跨分類引用</h2>
<ul>
<li>→ <a href="/blog/infra/02-identity-credentials/" data-link-title="模組二：身分與憑證地基 — IAM 與 OIDC" data-link-desc="IAM role / policy 設計、最小權限，以及用 OIDC 短期憑證取代長期 access key">模組二：身分與憑證地基</a>：execution role 與 task role 的最小權限設計</li>
<li>→ <a href="/blog/infra/03-network-foundation/" data-link-title="模組三：網路地基 — VPC 與分層" data-link-desc="VPC、public / private subnet 切分、route table、NAT、security group 設計">模組三：網路地基</a>：運算放在 private subnet、security group 接線</li>
<li>→ <a href="/blog/infra/06-observability-logging/" data-link-title="模組六：可觀測性與 log 一併寫進 code" data-link-desc="log group、metric、alarm 跟基礎設施同生命週期管理，出事時追得到查得到">模組六：可觀測性與 log</a>：log group 與 task definition 同生命週期</li>
<li>→ <a href="/blog/devops/08-cost-management/" data-link-title="模組八：成本管理" data-link-desc="雲端帳單怎麼不失控 — reserved instance、spot instance、right-sizing、成本監控告警">devops 模組八：成本管理</a>：auto-scaling 的成本護欄與 spot/Fargate Spot 混用</li>
</ul>
]]></content:encoded></item><item><title>儲存上 IaC — S3 bucket 的安全與生命週期</title><link>https://tarrragon.github.io/blog/infra/05-core-services/storage-s3/</link><pubDate>Fri, 26 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/infra/05-core-services/storage-s3/</guid><description>&lt;p>S3 bucket 描述的是物件儲存的存在、命名、加密設定、版本控制與存取政策。bucket 本身沒有重建代價意義上的狀態問題 — 困難在它「裝的東西」。空 bucket 可隨時重建，裝了正式資料的 bucket 與 RDS 一樣不可隨意 destroy。把安全設定與生命週期規則寫進 IaC，讓這些防線成為可版本控制、可審查的程式碼，而非散落在 Console 的隱性設定。&lt;/p>
&lt;h2 id="bucket-的四道安全防線">bucket 的四道安全防線&lt;/h2>
&lt;p>一個 S3 bucket 在 IaC 裡至少要描述四個獨立資源，各自對應一道防線。Terraform 把它們拆成獨立資源是設計選擇 — 每道防線可以單獨 review、單獨調整、單獨追蹤變更歷史。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-hcl" data-lang="hcl">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="k">resource&lt;/span> &lt;span class="s2">&amp;#34;aws_s3_bucket&amp;#34; &amp;#34;assets&amp;#34;&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="n"> bucket&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;acme-${var.env}-assets&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="n"> tags&lt;/span> &lt;span class="o">=&lt;/span>&lt;span class="n"> { service&lt;/span> &lt;span class="o">=&lt;/span>&lt;span class="n"> &amp;#34;cdn-origin&amp;#34;, env&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="k">var&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">env&lt;/span> }
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">}
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="k">resource&lt;/span> &lt;span class="s2">&amp;#34;aws_s3_bucket_versioning&amp;#34; &amp;#34;assets&amp;#34;&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="n"> bucket&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="k">aws_s3_bucket&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">assets&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">id&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="n"> versioning_configuration { status&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;Enabled&amp;#34;&lt;/span> }
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">}
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="k">resource&lt;/span> &lt;span class="s2">&amp;#34;aws_s3_bucket_server_side_encryption_configuration&amp;#34; &amp;#34;assets&amp;#34;&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="n"> bucket&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="k">aws_s3_bucket&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">assets&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">id&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl"> &lt;span class="k">rule&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl"> &lt;span class="k">apply_server_side_encryption_by_default&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl">&lt;span class="n"> sse_algorithm&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;aws:kms&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">17&lt;/span>&lt;span class="cl"> }
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">18&lt;/span>&lt;span class="cl"> }
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">19&lt;/span>&lt;span class="cl">}
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">20&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">21&lt;/span>&lt;span class="cl">&lt;span class="k">resource&lt;/span> &lt;span class="s2">&amp;#34;aws_s3_bucket_public_access_block&amp;#34; &amp;#34;assets&amp;#34;&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">22&lt;/span>&lt;span class="cl">&lt;span class="n"> bucket&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="k">aws_s3_bucket&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">assets&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">id&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">23&lt;/span>&lt;span class="cl">&lt;span class="n"> block_public_acls&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">true&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">24&lt;/span>&lt;span class="cl">&lt;span class="n"> block_public_policy&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">true&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">25&lt;/span>&lt;span class="cl">&lt;span class="n"> ignore_public_acls&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">true&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">26&lt;/span>&lt;span class="cl">&lt;span class="n"> restrict_public_buckets&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">true&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">27&lt;/span>&lt;span class="cl">}&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="versioning">versioning&lt;/h3>
&lt;p>&lt;code>versioning&lt;/code> 讓物件的每次覆寫都保留前一版。誤覆寫時可以從版本歷史回退到前一個正確版本，誤刪時物件只是被標記為 delete marker、前一版仍然存在。這道防線對承載正式資料的 bucket 是必要的 — 沒有 versioning 的 bucket，一次誤操作就是資料永久遺失。&lt;/p>
&lt;p>versioning 開啟後會累積歷史版本的儲存量。搭配生命週期規則設定 &lt;code>noncurrent_version_expiration&lt;/code> 可以控制保留多少天的舊版本，避免儲存成本無限成長。這個天數是「保留能力」跟「儲存成本」的取捨 — 保留 30 天通常足以涵蓋發現問題到回退的時間差，受合規要求的資料則依規定延長。&lt;/p>
&lt;h3 id="server-side-encryption">server-side encryption&lt;/h3>
&lt;p>&lt;code>server_side_encryption&lt;/code> 確保物件在 S3 落地時加密。&lt;code>aws:kms&lt;/code> 使用 KMS 管理的金鑰，加密操作對應用程式透明 — 寫入時自動加密、讀取時自動解密，不需要改應用程式碼。選 &lt;code>aws:kms&lt;/code> 而非 &lt;code>AES256&lt;/code>（SSE-S3）的判斷依據是存取控制粒度：KMS 金鑰可以獨立設定 key policy，讓「誰能解密」這件事跟「誰能讀 bucket」分開管理，適合跨帳號或跨團隊的場景。&lt;/p>
&lt;p>使用 KMS 加密的 bucket 在跨帳號存取時，目標帳號除了要有 bucket 的讀取權限，還需要 KMS key 的 &lt;code>kms:Decrypt&lt;/code> 權限 — 少了這一步會拿到 &lt;code>AccessDenied&lt;/code>，錯誤訊息通常指向 S3 權限而非 KMS，排查時容易走錯方向。&lt;/p>
&lt;h3 id="public-access-block">public access block&lt;/h3>
&lt;p>&lt;code>public_access_block&lt;/code> 的四個布林全設 true，等於從 bucket 層級封死對外公開的可能。即使有人之後誤加了一條公開的 bucket policy 或 ACL，這個 block 也會擋住。它是一道兜底機制 — 擋的是設定錯誤，不是正常操作。&lt;/p>
&lt;p>靜態掃描工具（checkov / tfsec）會標記缺少 public access block 的 bucket。這正是&lt;a href="https://tarrragon.github.io/blog/infra/07-infra-as-pr/" data-link-title="模組七：infra 走 PR 流程與自動化護欄" data-link-desc="infra 變更走 PR → plan → review diff → 合併 → apply，配 fmt / validate / tflint / checkov / tfsec 與 Atlantis 自動化，讓基礎設施可審查、可回溯、可交接">模組七：infra 走 PR 流程&lt;/a>裡自動化護欄的典型攔截對象 — 漏設的 bucket 會在 PR 階段被擋下，而非部署到線上才發現。&lt;/p></description><content:encoded><![CDATA[<p>S3 bucket 描述的是物件儲存的存在、命名、加密設定、版本控制與存取政策。bucket 本身沒有重建代價意義上的狀態問題 — 困難在它「裝的東西」。空 bucket 可隨時重建，裝了正式資料的 bucket 與 RDS 一樣不可隨意 destroy。把安全設定與生命週期規則寫進 IaC，讓這些防線成為可版本控制、可審查的程式碼，而非散落在 Console 的隱性設定。</p>
<h2 id="bucket-的四道安全防線">bucket 的四道安全防線</h2>
<p>一個 S3 bucket 在 IaC 裡至少要描述四個獨立資源，各自對應一道防線。Terraform 把它們拆成獨立資源是設計選擇 — 每道防線可以單獨 review、單獨調整、單獨追蹤變更歷史。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_s3_bucket&#34; &#34;assets&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  bucket</span> <span class="o">=</span> <span class="s2">&#34;acme-${var.env}-assets&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  tags</span> <span class="o">=</span><span class="n"> { service</span> <span class="o">=</span><span class="n"> &#34;cdn-origin&#34;, env</span> <span class="o">=</span> <span class="k">var</span><span class="p">.</span><span class="k">env</span> }
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">}
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_s3_bucket_versioning&#34; &#34;assets&#34;</span> {
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="n">  bucket</span> <span class="o">=</span> <span class="k">aws_s3_bucket</span><span class="p">.</span><span class="k">assets</span><span class="p">.</span><span class="k">id</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">  versioning_configuration { status</span> <span class="o">=</span> <span class="s2">&#34;Enabled&#34;</span> }
</span></span><span class="line"><span class="ln">10</span><span class="cl">}
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_s3_bucket_server_side_encryption_configuration&#34; &#34;assets&#34;</span> {
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">  bucket</span> <span class="o">=</span> <span class="k">aws_s3_bucket</span><span class="p">.</span><span class="k">assets</span><span class="p">.</span><span class="k">id</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">  <span class="k">rule</span> {
</span></span><span class="line"><span class="ln">15</span><span class="cl">    <span class="k">apply_server_side_encryption_by_default</span> {
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="n">      sse_algorithm</span> <span class="o">=</span> <span class="s2">&#34;aws:kms&#34;</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">    }
</span></span><span class="line"><span class="ln">18</span><span class="cl">  }
</span></span><span class="line"><span class="ln">19</span><span class="cl">}
</span></span><span class="line"><span class="ln">20</span><span class="cl">
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_s3_bucket_public_access_block&#34; &#34;assets&#34;</span> {
</span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="n">  bucket</span>                  <span class="o">=</span> <span class="k">aws_s3_bucket</span><span class="p">.</span><span class="k">assets</span><span class="p">.</span><span class="k">id</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="n">  block_public_acls</span>       <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="n">  block_public_policy</span>     <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="n">  ignore_public_acls</span>      <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="n">  restrict_public_buckets</span> <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">27</span><span class="cl">}</span></span></code></pre></div><h3 id="versioning">versioning</h3>
<p><code>versioning</code> 讓物件的每次覆寫都保留前一版。誤覆寫時可以從版本歷史回退到前一個正確版本，誤刪時物件只是被標記為 delete marker、前一版仍然存在。這道防線對承載正式資料的 bucket 是必要的 — 沒有 versioning 的 bucket，一次誤操作就是資料永久遺失。</p>
<p>versioning 開啟後會累積歷史版本的儲存量。搭配生命週期規則設定 <code>noncurrent_version_expiration</code> 可以控制保留多少天的舊版本，避免儲存成本無限成長。這個天數是「保留能力」跟「儲存成本」的取捨 — 保留 30 天通常足以涵蓋發現問題到回退的時間差，受合規要求的資料則依規定延長。</p>
<h3 id="server-side-encryption">server-side encryption</h3>
<p><code>server_side_encryption</code> 確保物件在 S3 落地時加密。<code>aws:kms</code> 使用 KMS 管理的金鑰，加密操作對應用程式透明 — 寫入時自動加密、讀取時自動解密，不需要改應用程式碼。選 <code>aws:kms</code> 而非 <code>AES256</code>（SSE-S3）的判斷依據是存取控制粒度：KMS 金鑰可以獨立設定 key policy，讓「誰能解密」這件事跟「誰能讀 bucket」分開管理，適合跨帳號或跨團隊的場景。</p>
<p>使用 KMS 加密的 bucket 在跨帳號存取時，目標帳號除了要有 bucket 的讀取權限，還需要 KMS key 的 <code>kms:Decrypt</code> 權限 — 少了這一步會拿到 <code>AccessDenied</code>，錯誤訊息通常指向 S3 權限而非 KMS，排查時容易走錯方向。</p>
<h3 id="public-access-block">public access block</h3>
<p><code>public_access_block</code> 的四個布林全設 true，等於從 bucket 層級封死對外公開的可能。即使有人之後誤加了一條公開的 bucket policy 或 ACL，這個 block 也會擋住。它是一道兜底機制 — 擋的是設定錯誤，不是正常操作。</p>
<p>靜態掃描工具（checkov / tfsec）會標記缺少 public access block 的 bucket。這正是<a href="/blog/infra/07-infra-as-pr/" data-link-title="模組七：infra 走 PR 流程與自動化護欄" data-link-desc="infra 變更走 PR → plan → review diff → 合併 → apply，配 fmt / validate / tflint / checkov / tfsec 與 Atlantis 自動化，讓基礎設施可審查、可回溯、可交接">模組七：infra 走 PR 流程</a>裡自動化護欄的典型攔截對象 — 漏設的 bucket 會在 PR 階段被擋下，而非部署到線上才發現。</p>
<p>定期用 CLI 掃一遍帳號內所有 bucket 的公開狀態，命中的每個 bucket 都要能回答「這個公開是故意的、理由是什麼」：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">aws s3api list-buckets --query <span class="s1">&#39;Buckets[].Name&#39;</span> --output text <span class="p">|</span> tr <span class="s1">&#39;\t&#39;</span> <span class="s1">&#39;\n&#39;</span> <span class="p">|</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  <span class="k">while</span> <span class="nb">read</span> b<span class="p">;</span> <span class="k">do</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">    <span class="nv">status</span><span class="o">=</span><span class="k">$(</span>aws s3api get-public-access-block --bucket <span class="s2">&#34;</span><span class="nv">$b</span><span class="s2">&#34;</span> 2&gt;/dev/null <span class="p">|</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>      jq -r <span class="s1">&#39;.PublicAccessBlockConfiguration | to_entries[] | select(.value==false) | .key&#39;</span><span class="k">)</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">    <span class="o">[</span> -n <span class="s2">&#34;</span><span class="nv">$status</span><span class="s2">&#34;</span> <span class="o">]</span> <span class="o">&amp;&amp;</span> <span class="nb">echo</span> <span class="s2">&#34;</span><span class="nv">$b</span><span class="s2">: </span><span class="nv">$status</span><span class="s2">&#34;</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">  <span class="k">done</span></span></span></code></pre></div><h2 id="生命週期規則">生命週期規則</h2>
<p>儲存成本隨物件數量與保留時間線性成長。生命週期規則讓 IaC 描述「某類物件多久後搬到更便宜的儲存層、再多久後刪掉」，把成本控制變成可版本控制的設定。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_s3_bucket_lifecycle_configuration&#34; &#34;assets&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  bucket</span> <span class="o">=</span> <span class="k">aws_s3_bucket</span><span class="p">.</span><span class="k">assets</span><span class="p">.</span><span class="k">id</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  <span class="k">rule</span> {
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">    id</span>     <span class="o">=</span> <span class="s2">&#34;archive-old-logs&#34;</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">    status</span> <span class="o">=</span> <span class="s2">&#34;Enabled&#34;</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">    filter { prefix</span> <span class="o">=</span> <span class="s2">&#34;logs/&#34;</span> }
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">    <span class="k">transition</span> {
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">      days</span>          <span class="o">=</span> <span class="m">30</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">      storage_class</span> <span class="o">=</span> <span class="s2">&#34;GLACIER_IR&#34;</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">    }
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">    expiration { days</span> <span class="o">=</span> <span class="m">365</span> }
</span></span><span class="line"><span class="ln">14</span><span class="cl">  }
</span></span><span class="line"><span class="ln">15</span><span class="cl">
</span></span><span class="line"><span class="ln">16</span><span class="cl">  <span class="k">rule</span> {
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="n">    id</span>     <span class="o">=</span> <span class="s2">&#34;cleanup-old-versions&#34;</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="n">    status</span> <span class="o">=</span> <span class="s2">&#34;Enabled&#34;</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">    <span class="k">filter</span> {}
</span></span><span class="line"><span class="ln">20</span><span class="cl">
</span></span><span class="line"><span class="ln">21</span><span class="cl">    <span class="k">noncurrent_version_expiration</span> {
</span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="n">      noncurrent_days</span> <span class="o">=</span> <span class="m">30</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">    }
</span></span><span class="line"><span class="ln">24</span><span class="cl">  }
</span></span><span class="line"><span class="ln">25</span><span class="cl">}</span></span></code></pre></div><h3 id="儲存層的取捨">儲存層的取捨</h3>
<p>S3 提供多個儲存層，各自在存取延遲與儲存單價之間取捨：</p>
<table>
  <thead>
      <tr>
          <th>儲存層</th>
          <th>存取延遲</th>
          <th>適用場景</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Standard</td>
          <td>毫秒級</td>
          <td>頻繁讀取的熱資料</td>
      </tr>
      <tr>
          <td>Standard-IA</td>
          <td>毫秒級</td>
          <td>不常存取但需要時立即讀到</td>
      </tr>
      <tr>
          <td>Glacier Instant</td>
          <td>毫秒級</td>
          <td>每季存取一次的歸檔</td>
      </tr>
      <tr>
          <td>Glacier Flexible</td>
          <td>分鐘到小時級</td>
          <td>稽核留存、年度查閱</td>
      </tr>
      <tr>
          <td>Glacier Deep Archive</td>
          <td>12 小時級</td>
          <td>法規留存、極少存取</td>
      </tr>
  </tbody>
</table>
<p><code>transition</code> 規則的日數設定要回推自業務需求：log 在除錯期間需要即時讀取（Standard），超過 30 天後幾乎只在事故回顧時才翻（Glacier Instant Retrieval 或 Standard-IA），超過一年可以淘汰或移到更深的歸檔層。把這些規則寫進 IaC，「為什麼 logs 只留一年」就是一個能在 PR 上被討論的決定，而非某人在 Console 點了不知道大家知不知道的設定。</p>
<h2 id="bucket-policy-與跨帳號存取">bucket policy 與跨帳號存取</h2>
<p>bucket policy 描述誰能對這個 bucket 做什麼操作，是 bucket 層級的存取控制。它跟 IAM policy 的差別在施力點：IAM policy 貼在身分上、定義「這個身分能做什麼」；bucket policy 貼在資源上、定義「這個 bucket 允許誰來」。兩者同時生效 — 一個請求要同時被身分端和資源端允許才會放行（除非有顯式 deny）。</p>
<p>跨帳號存取是 bucket policy 最常見的使用場景。一個帳號的 S3 bucket 要讓另一個帳號的 IAM role 讀取，需要兩端同時授權：bucket policy 允許那個 role 的 ARN，對方帳號的 IAM policy 也允許對這個 bucket 操作。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_s3_bucket_policy&#34; &#34;cross_account_read&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  bucket</span> <span class="o">=</span> <span class="k">aws_s3_bucket</span><span class="p">.</span><span class="k">assets</span><span class="p">.</span><span class="k">id</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  policy</span> <span class="o">=</span> <span class="k">jsonencode</span><span class="p">(</span>{
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">    Version</span> <span class="o">=</span> <span class="s2">&#34;2012-10-17&#34;</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">    Statement</span> <span class="o">=</span> <span class="p">[</span>{
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">      Sid</span>       <span class="o">=</span> <span class="s2">&#34;AllowCrossAccountRead&#34;</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="n">      Effect</span>    <span class="o">=</span> <span class="s2">&#34;Allow&#34;</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">      Principal</span> <span class="o">=</span><span class="n"> { AWS</span> <span class="o">=</span> <span class="s2">&#34;arn:aws:iam::111222333444:role/data-reader&#34;</span> }
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">      Action</span>    <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;s3:GetObject&#34;, &#34;s3:ListBucket&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">      Resource</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">        <span class="k">aws_s3_bucket</span><span class="p">.</span><span class="k">assets</span><span class="p">.</span><span class="k">arn</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">        <span class="s2">&#34;${aws_s3_bucket.assets.arn}/*&#34;</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">      <span class="p">]</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">    }<span class="p">]</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">  }<span class="p">)</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">}</span></span></code></pre></div><p>bucket policy 的常見陷阱是 <code>Principal: &quot;*&quot;</code> — 允許任何人存取。這跟 security group 的 <code>0.0.0.0/0</code> 是同一類風險。除了做為 CloudFront Origin Access Control（OAC）的配合設定，幾乎沒有合理場景需要把 Principal 設成 wildcard。checkov 的 <code>CKV_AWS_70</code> 規則專門攔這個。</p>
<p>把 bucket policy 寫進 IaC 的好處是每一條授權都有 PR 紀錄 — 誰在什麼時候加了一條跨帳號存取、為什麼加、reviewer 同意了沒有。散落在 Console 的 bucket policy 沒有這些追蹤，某天發現一條不認得的授權時，只能去翻 CloudTrail 猜它是什麼時候加的。</p>
<h2 id="事件通知">事件通知</h2>
<p>S3 事件通知讓 bucket 在物件被建立、刪除或還原時，自動觸發下游處理 — 寫入後自動縮圖、上傳後自動掃毒、刪除後自動通知。這些觸發關係寫進 IaC，讓「這個 bucket 會觸發什麼」成為可查詢的事實，而非散落在 Console 的隱性接線。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_s3_bucket_notification&#34; &#34;assets&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  bucket</span> <span class="o">=</span> <span class="k">aws_s3_bucket</span><span class="p">.</span><span class="k">assets</span><span class="p">.</span><span class="k">id</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  <span class="k">lambda_function</span> {
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">    lambda_function_arn</span> <span class="o">=</span> <span class="k">aws_lambda_function</span><span class="p">.</span><span class="k">thumbnail</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">    events</span>              <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;s3:ObjectCreated:*&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">    filter_prefix</span>       <span class="o">=</span> <span class="s2">&#34;uploads/&#34;</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="n">    filter_suffix</span>       <span class="o">=</span> <span class="s2">&#34;.jpg&#34;</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  }
</span></span><span class="line"><span class="ln">10</span><span class="cl">}
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_lambda_permission&#34; &#34;allow_s3&#34;</span> {
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">  statement_id</span>  <span class="o">=</span> <span class="s2">&#34;AllowS3Invoke&#34;</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="n">  action</span>        <span class="o">=</span> <span class="s2">&#34;lambda:InvokeFunction&#34;</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="n">  function_name</span> <span class="o">=</span> <span class="k">aws_lambda_function</span><span class="p">.</span><span class="k">thumbnail</span><span class="p">.</span><span class="k">function_name</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="n">  principal</span>     <span class="o">=</span> <span class="s2">&#34;s3.amazonaws.com&#34;</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="n">  source_arn</span>    <span class="o">=</span> <span class="k">aws_s3_bucket</span><span class="p">.</span><span class="k">assets</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">}</span></span></code></pre></div><p>事件通知的兩個配置常被忽略。第一是權限：S3 要觸發 Lambda，Lambda 的 resource-based policy 必須允許 S3 呼叫它（上面的 <code>aws_lambda_permission</code>），少了這段 apply 會成功但事件不會觸發，除錯時不容易發現。第二是 filter：不設 prefix / suffix 的通知會對 bucket 裡每一個物件操作都觸發，包括生命週期搬遷產生的物件變動 — 流量遠超預期。用 filter 把觸發範圍收斂到需要處理的路徑與檔案類型。</p>
<p>事件通知也可以導向 SQS 或 SNS，適合需要非同步佇列處理或 fan-out 到多個消費者的場景。選擇依據是下游的消費模式：Lambda 適合輕量即時處理（毫秒級回應），SQS 適合需要 backpressure 和重試的批次處理，SNS 適合同一事件需要同時通知多個服務。</p>
<h2 id="跨分類引用">跨分類引用</h2>
<ul>
<li>→ <a href="/blog/infra/07-infra-as-pr/" data-link-title="模組七：infra 走 PR 流程與自動化護欄" data-link-desc="infra 變更走 PR → plan → review diff → 合併 → apply，配 fmt / validate / tflint / checkov / tfsec 與 Atlantis 自動化，讓基礎設施可審查、可回溯、可交接">模組七：infra 走 PR 流程</a>：checkov / tfsec 攔截缺少 public access block 或加密的 bucket</li>
<li>→ <a href="/blog/infra/08-governance-habits/" data-link-title="模組八：治理好習慣 — 規模長大後不失控的最小節奏" data-link-desc="tagging 規範、secrets 不進 code、成本可見性、最小可行節奏，規模長大後不失控">模組八：治理好習慣</a>：bucket 的 tagging 與成本歸因</li>
<li>→ <a href="/blog/infra/02-identity-credentials/" data-link-title="模組二：身分與憑證地基 — IAM 與 OIDC" data-link-desc="IAM role / policy 設計、最小權限，以及用 OIDC 短期憑證取代長期 access key">模組二：身分與憑證地基</a>：bucket policy 與 IAM policy 的權限模型交集</li>
</ul>
]]></content:encoded></item><item><title>入口上 IaC — ALB、TLS 與健康檢查</title><link>https://tarrragon.github.io/blog/infra/05-core-services/loadbalancer-alb/</link><pubDate>Fri, 26 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/infra/05-core-services/loadbalancer-alb/</guid><description>&lt;p>ALB（Application Load Balancer）描述流量進入系統的第一站。它在 IaC 裡的接線責任是把三個層次釘清楚：listener 決定監聽哪些 port 與協定、target group 決定流量導向哪些運算後端、health check 決定後端是否健康到可以接流量。ALB 本身是 stateless 的 — 重建不會遺失資料，但會換掉它的 DNS 名稱，所以對外服務通常在它前面再掛一層穩定的 DNS 記錄（Route 53 alias 或 CNAME），讓使用者看到的網域不隨 ALB 重建而改變。&lt;/p>
&lt;p>ALB 掛在 public subnet、引用專屬的 security group，security group 的入站通常只開 80 和 443 對 &lt;code>0.0.0.0/0&lt;/code>（這是少數合理出現全開的位置，因為 ALB 的工作本來就是接收公開流量）。後端運算節點住在 private subnet，它們的 security group 入站只允許來自 ALB security group 的流量 — 這個 group-to-group 引用讓規則跟著成員身分走，不跟著 IP 走（見&lt;a href="https://tarrragon.github.io/blog/infra/03-network-foundation/" data-link-title="模組三：網路地基 — VPC 與分層" data-link-desc="VPC、public / private subnet 切分、route table、NAT、security group 設計">模組三：網路地基&lt;/a>）。&lt;/p>
&lt;h2 id="alb-與-listener-設定">ALB 與 listener 設定&lt;/h2>
&lt;p>ALB 資源本身描述的是它掛在哪些 subnet、用哪個 security group、是對外（&lt;code>internal = false&lt;/code>）還是內部。Listener 則是掛在 ALB 上的監聽端點，每個 listener 綁定一個 port + protocol 的組合。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-hcl" data-lang="hcl">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">resource&lt;/span> &lt;span class="s2">&amp;#34;aws_lb&amp;#34; &amp;#34;api&amp;#34;&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="n"> name&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;api-${var.env}&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="n"> internal&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">false&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="n"> load_balancer_type&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;application&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="n"> security_groups&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="k">aws_security_group&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">alb&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">id&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">&lt;span class="n"> subnets&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="k">for&lt;/span> &lt;span class="k">s&lt;/span> &lt;span class="k">in&lt;/span> &lt;span class="k">aws_subnet&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">public&lt;/span> &lt;span class="err">:&lt;/span> &lt;span class="k">s&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">id&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">}&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="http-到-https-的強制跳轉">HTTP 到 HTTPS 的強制跳轉&lt;/h3>
&lt;p>正式服務通常同時建兩個 listener：port 443 接受 HTTPS 流量並轉發到後端，port 80 接收 HTTP 流量後直接回一個 301 redirect 到 HTTPS — 確保使用者即使用 &lt;code>http://&lt;/code> 開頭訪問也會被導到加密連線。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-hcl" data-lang="hcl">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="k">resource&lt;/span> &lt;span class="s2">&amp;#34;aws_lb_listener&amp;#34; &amp;#34;https&amp;#34;&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="n"> load_balancer_arn&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="k">aws_lb&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">api&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">arn&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="n"> port&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">443&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="n"> protocol&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;HTTPS&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="n"> ssl_policy&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;ELBSecurityPolicy-TLS13-1-2-2021-06&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="n"> certificate_arn&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="k">aws_acm_certificate&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">api&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">arn&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl"> &lt;span class="k">default_action&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="n"> type&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;forward&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="n"> target_group_arn&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="k">aws_lb_target_group&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">api&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">arn&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl"> }
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">}
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="k">resource&lt;/span> &lt;span class="s2">&amp;#34;aws_lb_listener&amp;#34; &amp;#34;http_redirect&amp;#34;&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl">&lt;span class="n"> load_balancer_arn&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="k">aws_lb&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">api&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">arn&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl">&lt;span class="n"> port&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">80&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">17&lt;/span>&lt;span class="cl">&lt;span class="n"> protocol&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;HTTP&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">18&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">19&lt;/span>&lt;span class="cl"> &lt;span class="k">default_action&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">20&lt;/span>&lt;span class="cl">&lt;span class="n"> type&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;redirect&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">21&lt;/span>&lt;span class="cl"> &lt;span class="k">redirect&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">22&lt;/span>&lt;span class="cl">&lt;span class="n"> port&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;443&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">23&lt;/span>&lt;span class="cl">&lt;span class="n"> protocol&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;HTTPS&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">24&lt;/span>&lt;span class="cl">&lt;span class="n"> status_code&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;HTTP_301&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">25&lt;/span>&lt;span class="cl"> }
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">26&lt;/span>&lt;span class="cl"> }
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">27&lt;/span>&lt;span class="cl">}&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>ssl_policy&lt;/code> 決定 ALB 接受哪些 TLS 版本與密碼套件。選擇以安全與相容性為取捨 — &lt;code>ELBSecurityPolicy-TLS13-1-2-2021-06&lt;/code> 只接受 TLS 1.2 和 1.3，能阻擋過時協定的降級攻擊，但會拒絕仍在使用 TLS 1.0/1.1 的極舊用戶端。對面向公眾的 API 或網站，TLS 1.2 以上是合理的底線；如果有明確的舊用戶端需求（例如嵌入式設備），再往下調但要知道代價。&lt;/p></description><content:encoded><![CDATA[<p>ALB（Application Load Balancer）描述流量進入系統的第一站。它在 IaC 裡的接線責任是把三個層次釘清楚：listener 決定監聽哪些 port 與協定、target group 決定流量導向哪些運算後端、health check 決定後端是否健康到可以接流量。ALB 本身是 stateless 的 — 重建不會遺失資料，但會換掉它的 DNS 名稱，所以對外服務通常在它前面再掛一層穩定的 DNS 記錄（Route 53 alias 或 CNAME），讓使用者看到的網域不隨 ALB 重建而改變。</p>
<p>ALB 掛在 public subnet、引用專屬的 security group，security group 的入站通常只開 80 和 443 對 <code>0.0.0.0/0</code>（這是少數合理出現全開的位置，因為 ALB 的工作本來就是接收公開流量）。後端運算節點住在 private subnet，它們的 security group 入站只允許來自 ALB security group 的流量 — 這個 group-to-group 引用讓規則跟著成員身分走，不跟著 IP 走（見<a href="/blog/infra/03-network-foundation/" data-link-title="模組三：網路地基 — VPC 與分層" data-link-desc="VPC、public / private subnet 切分、route table、NAT、security group 設計">模組三：網路地基</a>）。</p>
<h2 id="alb-與-listener-設定">ALB 與 listener 設定</h2>
<p>ALB 資源本身描述的是它掛在哪些 subnet、用哪個 security group、是對外（<code>internal = false</code>）還是內部。Listener 則是掛在 ALB 上的監聽端點，每個 listener 綁定一個 port + protocol 的組合。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_lb&#34; &#34;api&#34;</span> {
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="n">  name</span>               <span class="o">=</span> <span class="s2">&#34;api-${var.env}&#34;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="n">  internal</span>           <span class="o">=</span> <span class="kt">false</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="n">  load_balancer_type</span> <span class="o">=</span> <span class="s2">&#34;application&#34;</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="n">  security_groups</span>    <span class="o">=</span> <span class="p">[</span><span class="k">aws_security_group</span><span class="p">.</span><span class="k">alb</span><span class="p">.</span><span class="k">id</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="n">  subnets</span>            <span class="o">=</span> <span class="p">[</span><span class="k">for</span> <span class="k">s</span> <span class="k">in</span> <span class="k">aws_subnet</span><span class="p">.</span><span class="k">public</span> <span class="err">:</span> <span class="k">s</span><span class="p">.</span><span class="k">id</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">}</span></span></code></pre></div><h3 id="http-到-https-的強制跳轉">HTTP 到 HTTPS 的強制跳轉</h3>
<p>正式服務通常同時建兩個 listener：port 443 接受 HTTPS 流量並轉發到後端，port 80 接收 HTTP 流量後直接回一個 301 redirect 到 HTTPS — 確保使用者即使用 <code>http://</code> 開頭訪問也會被導到加密連線。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_lb_listener&#34; &#34;https&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  load_balancer_arn</span> <span class="o">=</span> <span class="k">aws_lb</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  port</span>              <span class="o">=</span> <span class="m">443</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  protocol</span>          <span class="o">=</span> <span class="s2">&#34;HTTPS&#34;</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">  ssl_policy</span>        <span class="o">=</span> <span class="s2">&#34;ELBSecurityPolicy-TLS13-1-2-2021-06&#34;</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">  certificate_arn</span>   <span class="o">=</span> <span class="k">aws_acm_certificate</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  <span class="k">default_action</span> {
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">    type</span>             <span class="o">=</span> <span class="s2">&#34;forward&#34;</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">    target_group_arn</span> <span class="o">=</span> <span class="k">aws_lb_target_group</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">  }
</span></span><span class="line"><span class="ln">12</span><span class="cl">}
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_lb_listener&#34; &#34;http_redirect&#34;</span> {
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="n">  load_balancer_arn</span> <span class="o">=</span> <span class="k">aws_lb</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="n">  port</span>              <span class="o">=</span> <span class="m">80</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="n">  protocol</span>          <span class="o">=</span> <span class="s2">&#34;HTTP&#34;</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">
</span></span><span class="line"><span class="ln">19</span><span class="cl">  <span class="k">default_action</span> {
</span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="n">    type</span> <span class="o">=</span> <span class="s2">&#34;redirect&#34;</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">    <span class="k">redirect</span> {
</span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="n">      port</span>        <span class="o">=</span> <span class="s2">&#34;443&#34;</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="n">      protocol</span>    <span class="o">=</span> <span class="s2">&#34;HTTPS&#34;</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="n">      status_code</span> <span class="o">=</span> <span class="s2">&#34;HTTP_301&#34;</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl">    }
</span></span><span class="line"><span class="ln">26</span><span class="cl">  }
</span></span><span class="line"><span class="ln">27</span><span class="cl">}</span></span></code></pre></div><p><code>ssl_policy</code> 決定 ALB 接受哪些 TLS 版本與密碼套件。選擇以安全與相容性為取捨 — <code>ELBSecurityPolicy-TLS13-1-2-2021-06</code> 只接受 TLS 1.2 和 1.3，能阻擋過時協定的降級攻擊，但會拒絕仍在使用 TLS 1.0/1.1 的極舊用戶端。對面向公眾的 API 或網站，TLS 1.2 以上是合理的底線；如果有明確的舊用戶端需求（例如嵌入式設備），再往下調但要知道代價。</p>
<h3 id="多服務共用-alb">多服務共用 ALB</h3>
<p>一個 ALB 可以掛多個 listener rule，用 host header 或 path 把流量分到不同的 target group。這讓多個微服務共用一個 ALB（省成本），而不需要每個服務各開一個：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_lb_listener_rule&#34; &#34;auth&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  listener_arn</span> <span class="o">=</span> <span class="k">aws_lb_listener</span><span class="p">.</span><span class="k">https</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  priority</span>     <span class="o">=</span> <span class="m">10</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">  <span class="k">condition</span> {
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">    path_pattern { values</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;/auth/*&#34;</span><span class="p">]</span> }
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">  }
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  <span class="k">action</span> {
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">    type</span>             <span class="o">=</span> <span class="s2">&#34;forward&#34;</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">    target_group_arn</span> <span class="o">=</span> <span class="k">aws_lb_target_group</span><span class="p">.</span><span class="k">auth</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">  }
</span></span><span class="line"><span class="ln">13</span><span class="cl">}</span></span></code></pre></div><p>一個常見的收斂機會：如果每個服務都各自開了一個 ALB，但流量都從同一個入口進來、只是路徑不同，可以收斂成一個 ALB 加 listener rule。每個 ALB 有固定的小時費，少開幾個月費就少幾筆。反過來，當不同服務的安全等級或流量特性差異大到需要獨立的 security group 和 WAF 規則時，分開 ALB 才合理。</p>
<h2 id="target-group-與健康檢查">target group 與健康檢查</h2>
<p>Target group 定義一組接收流量的後端（ECS task、EC2 instance 或 IP），以及判斷這些後端是否健康的檢查邏輯。它是 ALB 和實際運算之間的橋樑。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_lb_target_group&#34; &#34;api&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  name</span>        <span class="o">=</span> <span class="s2">&#34;api-${var.env}-tg&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  port</span>        <span class="o">=</span> <span class="m">8080</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  protocol</span>    <span class="o">=</span> <span class="s2">&#34;HTTP&#34;</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">  vpc_id</span>      <span class="o">=</span> <span class="k">aws_vpc</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">id</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">  target_type</span> <span class="o">=</span> <span class="s2">&#34;ip&#34;</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  <span class="k">health_check</span> {
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">    path</span>                <span class="o">=</span> <span class="s2">&#34;/healthz&#34;</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">    interval</span>            <span class="o">=</span> <span class="m">15</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">    healthy_threshold</span>   <span class="o">=</span> <span class="m">2</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="n">    unhealthy_threshold</span> <span class="o">=</span> <span class="m">3</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">    timeout</span>             <span class="o">=</span> <span class="m">5</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="n">    matcher</span>             <span class="o">=</span> <span class="s2">&#34;200&#34;</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">  }
</span></span><span class="line"><span class="ln">16</span><span class="cl">}</span></span></code></pre></div><h3 id="健康檢查的閾值設計">健康檢查的閾值設計</h3>
<p>健康檢查的路徑與閾值是最常被忽略的判讀點。各參數之間的交互作用決定了兩個時間窗口：新後端多久後開始接流量、壞後端多久後被移出。</p>
<p><code>healthy_threshold = 2</code> 配 <code>interval = 15</code> 代表一個新啟動的後端要等 30 秒（兩次通過）才開始接流量。<code>unhealthy_threshold = 3</code> 代表連續三次失敗（45 秒）才被移出。閾值太寬鬆會把壞掉的後端留在輪替裡，讓部分使用者持續收到錯誤；太嚴格會在部署瞬間 — 新容器啟動、應用還在初始化 — 就判定不健康，反覆移出移入，使用者看到間歇性失敗。</p>
<table>
  <thead>
      <tr>
          <th>參數</th>
          <th>過小的風險</th>
          <th>過大的風險</th>
          <th>起點建議</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>interval</code></td>
          <td>ALB 對後端造成額外負擔</td>
          <td>壞後端被偵測到的延遲增加</td>
          <td>15-30 秒</td>
      </tr>
      <tr>
          <td><code>healthy_threshold</code></td>
          <td>還沒完全就緒就接流量</td>
          <td>部署後等太久才開始分流</td>
          <td>2-3 次</td>
      </tr>
      <tr>
          <td><code>unhealthy_threshold</code></td>
          <td>暫時性波動導致健康的後端被移出</td>
          <td>壞後端繼續收流量太久</td>
          <td>2-3 次</td>
      </tr>
      <tr>
          <td><code>timeout</code></td>
          <td>正常但偏慢的回應被誤判為失敗</td>
          <td>確實掛了卻要等很久才確認</td>
          <td>5 秒</td>
      </tr>
  </tbody>
</table>
<h3 id="健康檢查路徑的選擇">健康檢查路徑的選擇</h3>
<p><code>path</code> 指向的端點應該能反映應用是否確實能服務請求，而不只是 process 還活著。一個只回 200 的空端點（所謂 liveness check）證明 HTTP server 在跑，但不代表它能連到資料庫、能讀到必要的 config。較合理的做法是讓 <code>/healthz</code> 至少檢查核心依賴的連線（例如 ping 一下 DB），失敗時回 503。代價是健康檢查會跟著核心依賴一起報不健康 — 如果 DB 暫時斷了，所有後端都會被判定不健康，ALB 會回 503 給使用者。這是正確的行為：如果應用確實無法服務請求，把它標成不健康比假裝健康好。</p>
<p>判讀方式：部署後觀察 target group 裡的 healthy / unhealthy 轉換次數。如果每次部署都看到新 target 在 healthy 與 unhealthy 之間跳動，代表初始等待不夠 — 應用的啟動時間超出 <code>healthy_threshold * interval</code>，考慮加大 <code>healthy_threshold</code> 或設定 ECS 的 <code>startPeriod</code>（啟動寬限期）讓健康檢查在應用初始化期間暫停。</p>
<h2 id="tls-憑證acm-簽發dns-驗證與自動續期">TLS 憑證：ACM 簽發、DNS 驗證與自動續期</h2>
<p>HTTPS listener 引用的 TLS 憑證也屬於 ALB 的接線。用 ACM（AWS Certificate Manager）簽發的憑證在 IaC 裡完整描述 — 涵蓋網域與 DNS 驗證方式 — 讓「憑證存在、驗證、掛載」整條鏈都進版本控制，而非在 Console 手動上傳一份會過期沒人盯的憑證。</p>
<p>ACM 簽發的憑證使用 DNS 驗證時，ACM 要求在指定的 DNS 記錄上放一段驗證值。Terraform 可以自動建立這段記錄並等待驗證通過：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_acm_certificate&#34; &#34;api&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  domain_name</span>       <span class="o">=</span> <span class="s2">&#34;api.${var.domain}&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  validation_method</span> <span class="o">=</span> <span class="s2">&#34;DNS&#34;</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">  lifecycle { create_before_destroy</span> <span class="o">=</span> <span class="kt">true</span> }
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">}
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_route53_record&#34; &#34;cert_validation&#34;</span> {
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">  for_each</span> <span class="o">=</span> {
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">    for dvo in aws_acm_certificate.api.domain_validation_options : dvo.domain_name</span> <span class="o">=</span><span class="err">&gt;</span> <span class="k">dvo</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">  }
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="n">  zone_id</span> <span class="o">=</span> <span class="k">data</span><span class="p">.</span><span class="k">aws_route53_zone</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">zone_id</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">  name</span>    <span class="o">=</span> <span class="k">each</span><span class="p">.</span><span class="k">value</span><span class="p">.</span><span class="k">resource_record_name</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="n">  type</span>    <span class="o">=</span> <span class="k">each</span><span class="p">.</span><span class="k">value</span><span class="p">.</span><span class="k">resource_record_type</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="n">  records</span> <span class="o">=</span> <span class="p">[</span><span class="k">each</span><span class="p">.</span><span class="k">value</span><span class="p">.</span><span class="k">resource_record_value</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="n">  ttl</span>     <span class="o">=</span> <span class="m">60</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">}
</span></span><span class="line"><span class="ln">18</span><span class="cl">
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_acm_certificate_validation&#34; &#34;api&#34;</span> {
</span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="n">  certificate_arn</span>         <span class="o">=</span> <span class="k">aws_acm_certificate</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="n">  validation_record_fqdns</span> <span class="o">=</span> <span class="p">[</span><span class="k">for</span> <span class="k">r</span> <span class="k">in</span> <span class="k">aws_route53_record</span><span class="p">.</span><span class="k">cert_validation</span> <span class="err">:</span> <span class="k">r</span><span class="p">.</span><span class="k">fqdn</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">}</span></span></code></pre></div><h3 id="create_before_destroy-的必要性">create_before_destroy 的必要性</h3>
<p><code>create_before_destroy = true</code> 確保憑證更新（例如加 SAN 或續期觸發重建）時先建新的再刪舊的，避免 listener 在交接期間沒有可用憑證。Terraform 預設行為是先刪後建，會造成一個短暫的 HTTPS 中斷窗口 — listener 找不到憑證、所有 HTTPS 連線失敗直到新憑證簽發並驗證完畢。</p>
<p>ACM 簽發的憑證自動續期：只要 DNS 驗證記錄還在（由 Terraform 管理，所以會一直在），ACM 在到期前 60 天自動續期。這是把憑證管理成本降到接近零的做法 — 不需要排程提醒、不需要手動下載上傳。判讀訊號：如果 CloudWatch 出現 <code>DaysToExpiry</code> 降到 30 以下的 alarm，代表自動續期失敗，通常是 DNS 驗證記錄被手動刪了或 Route 53 zone 變了。</p>
<h3 id="多網域憑證san">多網域憑證（SAN）</h3>
<p>一張 ACM 憑證可以涵蓋多個網域（Subject Alternative Names），例如 <code>api.example.com</code> 和 <code>admin.example.com</code> 共用一張。在 IaC 裡用 <code>subject_alternative_names</code> 列舉：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_acm_certificate&#34; &#34;multi&#34;</span> {
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="n">  domain_name</span>               <span class="o">=</span> <span class="s2">&#34;api.${var.domain}&#34;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="n">  subject_alternative_names</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;admin.${var.domain}&#34;, &#34;*.internal.${var.domain}&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="n">  validation_method</span>         <span class="o">=</span> <span class="s2">&#34;DNS&#34;</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="n">  lifecycle { create_before_destroy</span> <span class="o">=</span> <span class="kt">true</span> }
</span></span><span class="line"><span class="ln">7</span><span class="cl">}</span></span></code></pre></div><p>共用一張還是分開簽取決於生命週期：如果這幾個網域總是一起上下線、一起變更，共用一張省維護；如果各自獨立演進，分開簽讓變更範圍更小。</p>
<h2 id="dns-zone-管理與-alb-的銜接">DNS zone 管理與 ALB 的銜接</h2>
<h3 id="hosted-zonedns-記錄的容器">Hosted zone：DNS 記錄的容器</h3>
<p>Route 53 的 hosted zone 是一個網域下所有 DNS 記錄的容器。public hosted zone 管理對外可見的網域（如 <code>example.com</code>），private hosted zone 管理只在 VPC 內可解析的內部網域（如 <code>internal.example.com</code>），讓服務之間用 DNS 名稱互連而不靠 IP。</p>
<p>多環境的 DNS 管理常用子網域 delegation：production 用 <code>example.com</code>（主 zone），dev 和 staging 各用 <code>dev.example.com</code> 和 <code>staging.example.com</code>（子 zone）。子 zone 可以放在不同帳號、由不同團隊管理，主 zone 只需要一組 NS 記錄指向子 zone。這讓環境之間的 DNS 邊界跟帳號邊界對齊。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_route53_zone&#34; &#34;main&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  name</span> <span class="o">=</span> <span class="k">var</span><span class="p">.</span><span class="k">domain</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">}
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_route53_zone&#34; &#34;staging&#34;</span> {
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">  name</span> <span class="o">=</span> <span class="s2">&#34;staging.${var.domain}&#34;</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">}
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_route53_record&#34; &#34;staging_ns&#34;</span> {
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">  zone_id</span> <span class="o">=</span> <span class="k">aws_route53_zone</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">zone_id</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">  name</span>    <span class="o">=</span> <span class="s2">&#34;staging.${var.domain}&#34;</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="n">  type</span>    <span class="o">=</span> <span class="s2">&#34;NS&#34;</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">  ttl</span>     <span class="o">=</span> <span class="m">300</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="n">  records</span> <span class="o">=</span> <span class="k">aws_route53_zone</span><span class="p">.</span><span class="k">staging</span><span class="p">.</span><span class="k">name_servers</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">}</span></span></code></pre></div><p>hosted zone 也是 ACM 憑證 DNS 驗證的依賴 — ACM 簽發憑證時需要在對應的 zone 寫入一條驗證記錄，zone 不存在或不在同帳號就接不上。把 zone 的建立排在 ACM 之前，讓依賴圖自然正確。</p>
<h3 id="alb-的穩定-dns-記錄">ALB 的穩定 DNS 記錄</h3>
<p>ALB 重建後 DNS 名稱會改變。穩定對外的方式是在 Route 53 建一條 alias 記錄指向 ALB，使用者連的是 <code>api.example.com</code>，DNS 自動解析到 ALB 目前的位址：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_route53_record&#34; &#34;api&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  zone_id</span> <span class="o">=</span> <span class="k">data</span><span class="p">.</span><span class="k">aws_route53_zone</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">zone_id</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  name</span>    <span class="o">=</span> <span class="s2">&#34;api.${var.domain}&#34;</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  type</span>    <span class="o">=</span> <span class="s2">&#34;A&#34;</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">  <span class="k">alias</span> {
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">    name</span>                   <span class="o">=</span> <span class="k">aws_lb</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">dns_name</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="n">    zone_id</span>                <span class="o">=</span> <span class="k">aws_lb</span><span class="p">.</span><span class="k">api</span><span class="p">.</span><span class="k">zone_id</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">    evaluate_target_health</span> <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">  }
</span></span><span class="line"><span class="ln">11</span><span class="cl">}</span></span></code></pre></div><p><code>evaluate_target_health = true</code> 讓 Route 53 在 ALB 所有 target 都不健康時把這條記錄標為不健康。如果有多個 region 的 ALB 做了 failover routing，這個設定能讓 DNS 層自動切換到健康的 region — 屬於跨區域容災的地基，在 devops 模組展開。</p>
<h2 id="waf-與下一步">WAF 與下一步</h2>
<p>ALB 支援掛載 AWS WAF（Web Application Firewall），在流量進到應用之前先過一層規則 — 擋已知惡意 IP、防 SQL injection / XSS 的常見模式、限制單一 IP 的請求速率。WAF 的規則也可以寫進 IaC，讓「哪些流量被擋」成為可審查的程式碼而非 Console 上的設定。WAF 的詳細設計屬於安全層的範圍（見 <a href="/blog/backend/07-security-data-protection/" data-link-title="模組七：資安與資料保護" data-link-desc="以問題驅動方式擴充資安知識網：先定義服務環節問題，再以案例作為觸發式參考">backend 模組七：資安與資料保護</a>），這裡只確認它的掛載點是 ALB。</p>
<p>四類核心服務的 IaC 描述到此完成。下一步是讓這些服務可被觀測——log、metric、alarm 跟資源同生命週期建立，見<a href="/blog/infra/06-observability-logging/" data-link-title="模組六：可觀測性與 log 一併寫進 code" data-link-desc="log group、metric、alarm 跟基礎設施同生命週期管理，出事時追得到查得到">模組六：可觀測性與 log</a>。</p>
<h2 id="跨分類引用">跨分類引用</h2>
<ul>
<li>→ <a href="/blog/infra/03-network-foundation/" data-link-title="模組三：網路地基 — VPC 與分層" data-link-desc="VPC、public / private subnet 切分、route table、NAT、security group 設計">模組三：網路地基</a>：ALB 的 security group 設計，group-to-group 引用</li>
<li>→ <a href="/blog/infra/05-core-services/stateful-protection-dependency/" data-link-title="Stateful 資源保護與跨服務依賴表達" data-link-desc="stateful 資源的保護策略（multi-AZ、備份、刪除保護）、stateful 與 stateless 的操作差異，以及用 output 與 data source 表達服務間依賴">模組五：stateful 資源的保護策略</a>：ALB 是 stateless，但它引用的 ACM 憑證和 DNS 記錄有自己的生命週期考量</li>
<li>→ <a href="/blog/devops/01-load-balancing/" data-link-title="模組一：負載平衡與反向代理" data-link-desc="流量進來怎麼分給多個服務實例 — nginx / HAProxy / DNS round-robin 的選型和健康檢查路由設計">devops 模組一：負載平衡</a>：ALB 的運行期調校 — 跨 AZ 流量分配、connection draining、sticky session</li>
<li>→ <a href="/blog/backend/07-security-data-protection/" data-link-title="模組七：資安與資料保護" data-link-desc="以問題驅動方式擴充資安知識網：先定義服務環節問題，再以案例作為觸發式參考">backend 模組七：資安與資料保護</a>：WAF 規則設計</li>
</ul>
]]></content:encoded></item><item><title>Stateful 資源保護與跨服務依賴表達</title><link>https://tarrragon.github.io/blog/infra/05-core-services/stateful-protection-dependency/</link><pubDate>Fri, 26 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/infra/05-core-services/stateful-protection-dependency/</guid><description>&lt;p>核心服務寫進 IaC 之後，stateful 資源需要一套與 stateless 截然不同的保護與操作規範。資料庫、裝了正式資料的 S3 bucket、持久化 volume 這類資源的共同特性是：重建代價極高甚至不可逆。運算節點掛了重開一台，資料刪了就是刪了。這個差別會傳導到 IaC 的描述方式、變更的審查強度、以及 drift 的處理策略。&lt;/p>
&lt;p>本篇同時處理服務之間依賴的表達方式 — output 與 data source — 因為依賴表達直接影響 stateful 資源的爆炸半徑：同一份 state 裡的資料庫跟運算綁在一起 apply，還是拆成獨立 state 各自演進，決定了一次 apply 失敗會波及多少資源。&lt;/p>
&lt;h2 id="stateful-資源的保護策略">stateful 資源的保護策略&lt;/h2>
&lt;p>stateful 資源的 IaC 描述要把「保護狀態」當成第一類需求，而非事後補上的選項。保護的三個面向 — 可用性、可還原性、防誤刪 — 各自對應不同的機制，混在一起談會讓判斷失焦。&lt;/p>
&lt;h3 id="multi-az-的職責邊界">multi-AZ 的職責邊界&lt;/h3>
&lt;p>multi-AZ 用一個布林屬性開啟，背後是 RDS 在另一個可用區維護同步副本。它承擔的是可用性：主庫所在的可用區故障時，RDS 自動 failover 到 standby，服務在秒級到一兩分鐘的窗口後恢復。&lt;/p>
&lt;p>multi-AZ 的邊界要明確界定，因為把它當成超出職責的工具會在事故裡踩空：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>standby 是熱備不可讀&lt;/strong>。multi-AZ 的 standby 不接受任何查詢流量，所以它不提供讀取擴展。要分攤讀流量得另開 read replica，這是另一個資源、另一個端點、另一套複寫延遲要管。&lt;/li>
&lt;li>&lt;strong>failover 有切換窗口&lt;/strong>。切換期間應用的資料庫連線會中斷、需要重連。應用層如果沒有處理連線中斷的重試邏輯，failover 就會變成一段可見的服務中斷，而非透明切換。&lt;/li>
&lt;li>&lt;strong>它不防邏輯損壞&lt;/strong>。誤刪一張 table、一筆錯誤的批次 UPDATE、一段有 bug 的 migration script — 這些操作會同步複製到 standby。multi-AZ 防的是硬體與可用區故障，邏輯損壞的防線是備份與時間點還原（PITR）。&lt;/li>
&lt;/ul>
&lt;p>這三條邊界說明 multi-AZ 和 backup 的職責正交：前者解可用性，後者解可還原性。兩者要分別配置、分別驗證。成本參考：multi-AZ RDS 的費用約為 single-AZ 的兩倍（standby instance 按相同規格計費）。這筆費用對應的能力是可用區故障時的分鐘級自動 failover——判斷值不值得時，用主庫所承載的服務停機每小時的商業代價來衡量。&lt;/p>
&lt;h3 id="備份保留與時間點還原">備份保留與時間點還原&lt;/h3>
&lt;p>backup 用保留天數與備份視窗描述。RDS 依此每日自動快照並保留交易日誌，以支援還原到任意時間點（PITR）。自動備份的保留上限是 35 天，更長的留存要靠手動快照或匯出到 S3 自行管理。&lt;/p>
&lt;p>&lt;code>backup_retention_period&lt;/code> 取多少天，以 RPO（Recovery Point Objective）與合規要求反推。RPO 問的是「出事時最多能接受遺失多久的資料」— PITR 能還原到最近 5 分鐘內的時間點，但前提是自動備份有開、交易日誌有保留。保留天數決定的是「能回溯多遠」：14 天是 AWS RDS 自動備份 35 天上限的保守折衷，足以涵蓋多數營運場景下「發現問題到決定還原」的時間差；受監理的服務往 30 天推，以滿足稽核追溯窗口。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-hcl" data-lang="hcl">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">resource&lt;/span> &lt;span class="s2">&amp;#34;aws_db_instance&amp;#34; &amp;#34;primary&amp;#34;&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="n"> multi_az&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">true&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="n"> backup_retention_period&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">14&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="n"> backup_window&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;03:00-04:00&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="n"> deletion_protection&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">true&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">&lt;span class="n"> skip_final_snapshot&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">false&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">&lt;span class="n"> final_snapshot_identifier&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;app-prod-final-${formatdate(&amp;#34;YYYYMMDD&amp;#34;, timestamp())}&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">8&lt;/span>&lt;span class="cl">}&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>備份視窗選在流量低谷（如 UTC 凌晨），避免快照 IO 跟尖峰流量競爭。手動快照用獨立資源描述，常見用途是重大變更前的保險點 — 大版本升級、schema migration、或任何會改變資料結構的操作。&lt;/p>
&lt;h3 id="刪除保護與-final-snapshot">刪除保護與 final snapshot&lt;/h3>
&lt;p>&lt;code>deletion_protection = true&lt;/code> 讓 &lt;code>terraform destroy&lt;/code> 無法直接刪除這個 instance — 要先用另一次 apply 把保護關掉，這一步本身就會出現在 plan 裡、被 review 攔住。&lt;code>skip_final_snapshot = false&lt;/code> 確保即使確實要刪，也會先拍一份最終快照。兩者搭配是正式資料庫的硬性下限。&lt;/p>
&lt;p>該在 review 攔下的訊號是：正式環境的 stateful 資源若 &lt;code>backup_retention_period&lt;/code> 為 0 或 &lt;code>deletion_protection&lt;/code> 為 false，代表狀態保護沒有寫進程式碼。把這些屬性視為正式資料庫的預設值，而非可調的偏好。&lt;/p>
&lt;p>S3 bucket 的保護同理但機制不同。versioning 讓覆寫或刪除的物件可以回到先前版本；MFA delete 要求刪除前提供第二因素驗證；lifecycle rule 控制舊版本的保留時間 — 這三者分別對應「可還原」「防誤刪」「控成本」三個職責，見&lt;a href="https://tarrragon.github.io/blog/infra/05-core-services/storage-s3/" data-link-title="儲存上 IaC — S3 bucket 的安全與生命週期" data-link-desc="S3 bucket 的加密、版本控制、公開存取封鎖、生命週期規則、bucket policy 與事件通知怎麼寫進 IaC，讓儲存的安全與成本防線可審查可追蹤">儲存（S3）&lt;/a>。&lt;/p>
&lt;h3 id="跨-region-災難復原的邊界">跨 region 災難復原的邊界&lt;/h3>
&lt;p>multi-AZ 解的是可用區級故障 — 單一資料中心出問題時，同 region 的另一個可用區接手。跨 region 的災難復原（cross-region read replica、S3 cross-region replication、Route 53 failover routing）屬於更高級的可用性投資，解的是整個 region 不可用的極端情境。它的成本與複雜度顯著上升：跨 region 複寫有延遲、failover routing 需要健康檢查與 DNS TTL 配合、兩個 region 的 infra 要各自維護。多數服務在單 region 的 multi-AZ + 備份做完之後再評估是否需要跨 region，依據是業務的 RTO（Recovery Time Objective）對 region 級故障的容忍度。&lt;/p></description><content:encoded><![CDATA[<p>核心服務寫進 IaC 之後，stateful 資源需要一套與 stateless 截然不同的保護與操作規範。資料庫、裝了正式資料的 S3 bucket、持久化 volume 這類資源的共同特性是：重建代價極高甚至不可逆。運算節點掛了重開一台，資料刪了就是刪了。這個差別會傳導到 IaC 的描述方式、變更的審查強度、以及 drift 的處理策略。</p>
<p>本篇同時處理服務之間依賴的表達方式 — output 與 data source — 因為依賴表達直接影響 stateful 資源的爆炸半徑：同一份 state 裡的資料庫跟運算綁在一起 apply，還是拆成獨立 state 各自演進，決定了一次 apply 失敗會波及多少資源。</p>
<h2 id="stateful-資源的保護策略">stateful 資源的保護策略</h2>
<p>stateful 資源的 IaC 描述要把「保護狀態」當成第一類需求，而非事後補上的選項。保護的三個面向 — 可用性、可還原性、防誤刪 — 各自對應不同的機制，混在一起談會讓判斷失焦。</p>
<h3 id="multi-az-的職責邊界">multi-AZ 的職責邊界</h3>
<p>multi-AZ 用一個布林屬性開啟，背後是 RDS 在另一個可用區維護同步副本。它承擔的是可用性：主庫所在的可用區故障時，RDS 自動 failover 到 standby，服務在秒級到一兩分鐘的窗口後恢復。</p>
<p>multi-AZ 的邊界要明確界定，因為把它當成超出職責的工具會在事故裡踩空：</p>
<ul>
<li><strong>standby 是熱備不可讀</strong>。multi-AZ 的 standby 不接受任何查詢流量，所以它不提供讀取擴展。要分攤讀流量得另開 read replica，這是另一個資源、另一個端點、另一套複寫延遲要管。</li>
<li><strong>failover 有切換窗口</strong>。切換期間應用的資料庫連線會中斷、需要重連。應用層如果沒有處理連線中斷的重試邏輯，failover 就會變成一段可見的服務中斷，而非透明切換。</li>
<li><strong>它不防邏輯損壞</strong>。誤刪一張 table、一筆錯誤的批次 UPDATE、一段有 bug 的 migration script — 這些操作會同步複製到 standby。multi-AZ 防的是硬體與可用區故障，邏輯損壞的防線是備份與時間點還原（PITR）。</li>
</ul>
<p>這三條邊界說明 multi-AZ 和 backup 的職責正交：前者解可用性，後者解可還原性。兩者要分別配置、分別驗證。成本參考：multi-AZ RDS 的費用約為 single-AZ 的兩倍（standby instance 按相同規格計費）。這筆費用對應的能力是可用區故障時的分鐘級自動 failover——判斷值不值得時，用主庫所承載的服務停機每小時的商業代價來衡量。</p>
<h3 id="備份保留與時間點還原">備份保留與時間點還原</h3>
<p>backup 用保留天數與備份視窗描述。RDS 依此每日自動快照並保留交易日誌，以支援還原到任意時間點（PITR）。自動備份的保留上限是 35 天，更長的留存要靠手動快照或匯出到 S3 自行管理。</p>
<p><code>backup_retention_period</code> 取多少天，以 RPO（Recovery Point Objective）與合規要求反推。RPO 問的是「出事時最多能接受遺失多久的資料」— PITR 能還原到最近 5 分鐘內的時間點，但前提是自動備份有開、交易日誌有保留。保留天數決定的是「能回溯多遠」：14 天是 AWS RDS 自動備份 35 天上限的保守折衷，足以涵蓋多數營運場景下「發現問題到決定還原」的時間差；受監理的服務往 30 天推，以滿足稽核追溯窗口。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_db_instance&#34; &#34;primary&#34;</span> {
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="n">  multi_az</span>                  <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="n">  backup_retention_period</span>   <span class="o">=</span> <span class="m">14</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="n">  backup_window</span>             <span class="o">=</span> <span class="s2">&#34;03:00-04:00&#34;</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="n">  deletion_protection</span>       <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="n">  skip_final_snapshot</span>       <span class="o">=</span> <span class="kt">false</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="n">  final_snapshot_identifier</span> <span class="o">=</span> <span class="s2">&#34;app-prod-final-${formatdate(&#34;YYYYMMDD&#34;, timestamp())}&#34;</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">}</span></span></code></pre></div><p>備份視窗選在流量低谷（如 UTC 凌晨），避免快照 IO 跟尖峰流量競爭。手動快照用獨立資源描述，常見用途是重大變更前的保險點 — 大版本升級、schema migration、或任何會改變資料結構的操作。</p>
<h3 id="刪除保護與-final-snapshot">刪除保護與 final snapshot</h3>
<p><code>deletion_protection = true</code> 讓 <code>terraform destroy</code> 無法直接刪除這個 instance — 要先用另一次 apply 把保護關掉，這一步本身就會出現在 plan 裡、被 review 攔住。<code>skip_final_snapshot = false</code> 確保即使確實要刪，也會先拍一份最終快照。兩者搭配是正式資料庫的硬性下限。</p>
<p>該在 review 攔下的訊號是：正式環境的 stateful 資源若 <code>backup_retention_period</code> 為 0 或 <code>deletion_protection</code> 為 false，代表狀態保護沒有寫進程式碼。把這些屬性視為正式資料庫的預設值，而非可調的偏好。</p>
<p>S3 bucket 的保護同理但機制不同。versioning 讓覆寫或刪除的物件可以回到先前版本；MFA delete 要求刪除前提供第二因素驗證；lifecycle rule 控制舊版本的保留時間 — 這三者分別對應「可還原」「防誤刪」「控成本」三個職責，見<a href="/blog/infra/05-core-services/storage-s3/" data-link-title="儲存上 IaC — S3 bucket 的安全與生命週期" data-link-desc="S3 bucket 的加密、版本控制、公開存取封鎖、生命週期規則、bucket policy 與事件通知怎麼寫進 IaC，讓儲存的安全與成本防線可審查可追蹤">儲存（S3）</a>。</p>
<h3 id="跨-region-災難復原的邊界">跨 region 災難復原的邊界</h3>
<p>multi-AZ 解的是可用區級故障 — 單一資料中心出問題時，同 region 的另一個可用區接手。跨 region 的災難復原（cross-region read replica、S3 cross-region replication、Route 53 failover routing）屬於更高級的可用性投資，解的是整個 region 不可用的極端情境。它的成本與複雜度顯著上升：跨 region 複寫有延遲、failover routing 需要健康檢查與 DNS TTL 配合、兩個 region 的 infra 要各自維護。多數服務在單 region 的 multi-AZ + 備份做完之後再評估是否需要跨 region，依據是業務的 RTO（Recovery Time Objective）對 region 級故障的容忍度。</p>
<p>跨 region 的 infra 投資在 B2B SaaS 的合約義務下更容易成立。<a href="/blog/backend/09-performance-capacity/cases/genesys-dynamodb-99999-availability/" data-link-title="9.C24 Genesys：用 DynamoDB 在 15 region 跑出 99.999% 可用性" data-link-desc="Genesys 客服平台用 DynamoDB 為預設資料層、跨 15 主 region &#43; 5 衛星 region、達成 12 個月 99.999% 可用性">Genesys 的客服平台跨 15 個 region 用 DynamoDB 達成 99.999% 可用性</a>——年停機只有 5 分鐘。對 B2B SaaS 來說，客戶服務中斷等於客戶的終端使用者打不通電話，可用性是合約義務而非行銷敘述。infra 層的判斷依據是：multi-AZ 不夠用（業務需要跨 region failover）的情況通常由合約 SLA 驅動，而非技術判斷驅動。</p>
<h2 id="stateful-與-stateless-的操作差異">stateful 與 stateless 的操作差異</h2>
<p>stateful 與 stateless 資源的根本差別在重建代價。這個差別傳導到三個操作後果，每一個都影響日常的 PR review 與 apply 流程。</p>
<h3 id="刪除保護的必要性">刪除保護的必要性</h3>
<p>stateless 資源（ECS service、ALB、無狀態運算）重建只是換一組新實例，幾分鐘內恢復、沒有資料損失，所以它們可以被頻繁地 destroy 與 recreate — 這是 IaC 最擅長的對象。stateful 資源重建意味著資料遺失或漫長的還原，代價可能是數小時的停機與不可逆的損失。開啟 deletion protection 讓「不小心 destroy」需要先顯式關閉保護這一步，多一道人為確認。</p>
<h3 id="drift-容忍度">drift 容忍度</h3>
<p>stateless 資源的 drift 可以靠重建抹平 — apply 一次就回到程式碼的狀態，副作用只是新實例的短暫滾動更新。stateful 資源的 drift 要謹慎處理，因為 IaC 的「修正回程式碼狀態」動作可能觸發重啟甚至重建。</p>
<p>一個常見的情境：某人手動改了 RDS 的 parameter group，Terraform plan 顯示要把它改回程式碼的版本。這個改回動作是 <code>update in-place</code>（改設定、不重建）還是 <code>replace</code>（先刪後建），取決於哪個參數被改了 — 某些 parameter 的修改需要重啟，而某些需要整個 instance 重建。判讀方式是先跑 plan、看 drift 修正的結果，<code>update in-place</code> 通常安全（可能觸發重啟），<code>replace</code> 對資料庫意味著先刪後建，在 prod 上需要額外的確認。</p>
<h3 id="變更審查強度">變更審查強度</h3>
<p>改動 stateful 資源的 plan 輸出要逐行看，特別警惕任何顯示為 <code>replace</code>（<code>-/+</code>）或標記 <code>forces replacement</code> 的項目。某些欄位的改動看似無害但會觸發 replace：</p>
<table>
  <thead>
      <tr>
          <th>欄位</th>
          <th>預期行為</th>
          <th>實際行為</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RDS <code>identifier</code> 改名</td>
          <td>改個名字而已</td>
          <td>forces replacement</td>
      </tr>
      <tr>
          <td>RDS <code>engine_version</code> 大版本</td>
          <td>升級引擎版本</td>
          <td>可能 replace 或 in-place</td>
      </tr>
      <tr>
          <td>RDS <code>storage_type</code> 變更</td>
          <td>換儲存類型</td>
          <td>部分組合 forces replacement</td>
      </tr>
      <tr>
          <td>S3 bucket <code>bucket</code> 改名</td>
          <td>改個名字而已</td>
          <td>forces replacement</td>
      </tr>
  </tbody>
</table>
<p>Review 時看到 stateful 資源出現 <code>forces replacement</code>，在 prod 路徑上幾乎都該先暫停、確認回退路徑（手動快照是否已拍）再決定是否繼續。常見做法是把這個差別寫進流程：stateful 資源的變更走更嚴格的 PR review 與分階段套用（先在 dev apply 驗證、確認是 in-place 後再推 prod），自動化護欄在<a href="/blog/infra/07-infra-as-pr/" data-link-title="模組七：infra 走 PR 流程與自動化護欄" data-link-desc="infra 變更走 PR → plan → review diff → 合併 → apply，配 fmt / validate / tflint / checkov / tfsec 與 Atlantis 自動化，讓基礎設施可審查、可回溯、可交接">模組七：infra 走 PR 流程</a>展開。</p>
<h2 id="服務之間的依賴怎麼表達">服務之間的依賴怎麼表達</h2>
<p>服務間依賴用 output 與 data source 表達，讓引用關係成為程式碼裡可追蹤的邊，而非靠人記憶的隱性約定。引用方式的選擇直接影響 state 的大小與爆炸半徑。</p>
<h3 id="同-state-內的引用">同 state 內的引用</h3>
<p>同一個 state 內，直接引用資源屬性即可建立依賴。運算資源引用資料庫的端點，IaC 自動推導出「資料庫先於運算」的邊，也讓端點變更時上層自動取得新值：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_ecs_task_definition&#34; &#34;api&#34;</span> {
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="n">  container_definitions</span> <span class="o">=</span> <span class="k">jsonencode</span><span class="p">([</span>{
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="n">    environment</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="n">      { name</span> <span class="o">=</span><span class="n"> &#34;DB_HOST&#34;, value</span> <span class="o">=</span> <span class="k">aws_db_instance</span><span class="p">.</span><span class="k">primary</span><span class="p">.</span><span class="k">endpoint</span> }
</span></span><span class="line"><span class="ln">5</span><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">  }<span class="p">])</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">}</span></span></code></pre></div><p>同 state 引用的好處是依賴圖最完整 — apply 一次就把所有引用解析到正確的值。代價是 state 越大、單次 apply 的爆炸半徑越大。一份包含網路、資料庫、運算、LB 的 state，一次 apply 失敗可能讓所有資源處於半完成狀態。</p>
<h3 id="跨-state-的-data-source">跨 state 的 data source</h3>
<p>跨 state（例如網路地基與核心服務分屬不同 Terraform state，呼應<a href="/blog/infra/04-environment-separation/" data-link-title="模組四：環境分離與模組化" data-link-desc="dev / staging / prod 切分、目錄結構 vs workspace、用可重用 module 避免環境漂移">模組四：環境分離與模組化</a>的拆分）時，下游用 data source 唯讀地讀取上游已建立的資源：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">data</span> <span class="s2">&#34;aws_vpc&#34; &#34;main&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  tags</span> <span class="o">=</span><span class="n"> { Name</span> <span class="o">=</span> <span class="s2">&#34;app-${var.env}&#34;</span> }
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">}
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="k">data</span> <span class="s2">&#34;aws_subnets&#34; &#34;private&#34;</span> {
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">  <span class="k">filter</span> {
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">    name</span>   <span class="o">=</span> <span class="s2">&#34;vpc-id&#34;</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="n">    values</span> <span class="o">=</span> <span class="p">[</span><span class="k">data</span><span class="p">.</span><span class="k">aws_vpc</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">id</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  }
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">  tags</span> <span class="o">=</span><span class="n"> { tier</span> <span class="o">=</span> <span class="s2">&#34;private&#34;</span> }
</span></span><span class="line"><span class="ln">11</span><span class="cl">}</span></span></code></pre></div><p>下游查詢上游的 VPC 與 subnet，取得 ID 來放置自己的資源，而不複製貼上硬編碼的值。</p>
<h3 id="同-state-vs-跨-state-的取捨">同 state vs 跨 state 的取捨</h3>
<p>兩種方式的取捨在耦合與隔離之間：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>同 state 引用</th>
          <th>跨 state data source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>依賴圖</td>
          <td>完整、自動推導</td>
          <td>跨 state 邊界，需約定上游先 apply</td>
      </tr>
      <tr>
          <td>爆炸半徑</td>
          <td>state 越大、單次 apply 越大</td>
          <td>各 state 獨立、爆炸半徑小</td>
      </tr>
      <tr>
          <td>適合場景</td>
          <td>少量緊密耦合的資源</td>
          <td>地基層與服務層分離</td>
      </tr>
      <tr>
          <td>drift 風險</td>
          <td>低（引用自動追蹤）</td>
          <td>中（上游重建後 data source 可能查不到）</td>
      </tr>
  </tbody>
</table>
<p>用 grep 搜一遍核心服務的 HCL：如果出現大量寫死的 subnet ID 或 VPC ID，代表該用 data source 而沒用。這些硬編碼是日後上游重建時 drift 與 broken reference 的來源。把它們換成 data source，依賴關係才會在程式碼裡顯性化、可被工具與 review 看見。</p>
<p>data source 查詢的可靠性取決於查詢條件的穩定度。用 <code>tags</code> 查比用 <code>Name</code> 查更穩 — tag 是自己定義的、可控的值，而某些資源的 Name 可能在重建時改變。用 <code>terraform_remote_state</code> data source 直接讀上游的 state output 是最精確的方式，但它把兩份 state 的 backend 設定耦合在一起，上游搬 state 時下游也要跟著改。在團隊規模小、state 拆分不多的階段，<code>terraform_remote_state</code> 的耦合代價通常可接受；團隊變大後，用 tag-based data source 或 SSM Parameter Store 當中間層，能讓上下游各自獨立演進。</p>
<h2 id="跨分類引用">跨分類引用</h2>
<ul>
<li>→ <a href="/blog/infra/03-network-foundation/" data-link-title="模組三：網路地基 — VPC 與分層" data-link-desc="VPC、public / private subnet 切分、route table、NAT、security group 設計">模組三：網路地基</a>：核心服務落在哪些 subnet、security group 怎麼引用</li>
<li>→ <a href="/blog/infra/04-environment-separation/" data-link-title="模組四：環境分離與模組化" data-link-desc="dev / staging / prod 切分、目錄結構 vs workspace、用可重用 module 避免環境漂移">模組四：環境分離與模組化</a>：跨 state 的拆分策略</li>
<li>→ <a href="/blog/infra/07-infra-as-pr/" data-link-title="模組七：infra 走 PR 流程與自動化護欄" data-link-desc="infra 變更走 PR → plan → review diff → 合併 → apply，配 fmt / validate / tflint / checkov / tfsec 與 Atlantis 自動化，讓基礎設施可審查、可回溯、可交接">模組七：infra 走 PR 流程</a>：stateful 變更的自動化護欄</li>
</ul>
]]></content:encoded></item><item><title>ACM 憑證、DNS 與 HTTPS 設定</title><link>https://tarrragon.github.io/blog/infra/05-core-services/acm-tls-dns-setup/</link><pubDate>Fri, 26 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/infra/05-core-services/acm-tls-dns-setup/</guid><description>&lt;p>HTTPS 的運作需要三個元件配合：一個管理網域記錄的 DNS zone、一張證明網域所有權的 TLS 憑證、以及一個用這張憑證終結 TLS 連線的入口（ALB listener）。這三者在 IaC 裡各自是獨立資源，但建立順序有依賴——zone 先存在、憑證才能用 DNS 驗證、驗證通過才能掛到 listener。把這條鏈路寫進 Terraform，讓憑證的申請、驗證與續期都在版本控制裡，是避免「憑證過期才發現沒人盯」的結構性做法。&lt;/p>
&lt;h2 id="route-53-hosted-zone">Route 53 Hosted Zone&lt;/h2>
&lt;p>Hosted zone 是 Route 53 用來管理某個網域的 DNS 記錄集合。建立 zone 後，Route 53 會分配一組 NS（Name Server）記錄，網域的 DNS 解析就由這組 NS 負責。&lt;/p>
&lt;h3 id="public-vs-private-zone">Public vs Private Zone&lt;/h3>
&lt;p>Public hosted zone 對應的是可從網際網路解析的網域（如 &lt;code>example.com&lt;/code>），用於對外服務的 A / CNAME / MX 記錄。Private hosted zone 只在指定的 VPC 內可解析，用於內部服務發現（如 &lt;code>db.internal.example.com&lt;/code> 解析到 RDS 的 private IP）。多數專案兩者都需要：public zone 給對外流量、private zone 給內部服務互連。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-hcl" data-lang="hcl">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="k">resource&lt;/span> &lt;span class="s2">&amp;#34;aws_route53_zone&amp;#34; &amp;#34;public&amp;#34;&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="n"> name&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;example.com&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="n"> tags&lt;/span> &lt;span class="o">=&lt;/span>&lt;span class="n"> { Environment&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;production&amp;#34;&lt;/span> }
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">}
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="k">resource&lt;/span> &lt;span class="s2">&amp;#34;aws_route53_zone&amp;#34; &amp;#34;private&amp;#34;&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="n"> name&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;internal.example.com&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl"> &lt;span class="k">vpc&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="n"> vpc_id&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="k">aws_vpc&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">main&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">id&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl"> }
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="n"> tags&lt;/span> &lt;span class="o">=&lt;/span>&lt;span class="n"> { Environment&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;production&amp;#34;&lt;/span> }
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">}&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="子網域-delegation">子網域 delegation&lt;/h3>
&lt;p>當 dev / staging / prod 各用獨立帳號時，每個帳號建自己的 hosted zone 管理子網域（如 &lt;code>dev.example.com&lt;/code>）。父網域的 zone 需要加一組 NS 記錄指向子網域的 zone，這個動作叫 delegation。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-hcl" data-lang="hcl">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">resource&lt;/span> &lt;span class="s2">&amp;#34;aws_route53_record&amp;#34; &amp;#34;dev_ns&amp;#34;&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="n"> zone_id&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="k">aws_route53_zone&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">public&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">zone_id&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="n"> name&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;dev.example.com&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="n"> type&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;NS&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="n"> ttl&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="m">300&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">&lt;span class="n"> records&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="k">aws_route53_zone&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">dev&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="k">name_servers&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">}&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>delegation 的 NS 記錄指向子帳號 zone 的 name server。子帳號內的所有 DNS 記錄（如 &lt;code>api.dev.example.com&lt;/code>）由子帳號的 zone 管理，父帳號不需要逐條設定。跨帳號 delegation 需要兩邊的 Terraform 各自管理自己的 zone，NS 記錄在父帳號的 state 裡。&lt;/p>
&lt;p>判讀設定是否正確：用 &lt;code>dig dev.example.com NS&lt;/code> 查回的 name server 應該是子帳號 zone 的 NS，不是父帳號的。如果查回父帳號的 NS，代表 delegation 沒生效，子網域的 DNS 記錄不會被解析。&lt;/p>
&lt;h2 id="acm-憑證申請與-dns-驗證">ACM 憑證申請與 DNS 驗證&lt;/h2>
&lt;p>AWS Certificate Manager（ACM）提供免費的 TLS 憑證，條件是透過 DNS 或 email 驗證網域所有權。DNS 驗證是 IaC 友善的方式——ACM 要求在指定網域下建一條 CNAME 記錄，記錄值由 ACM 提供，驗證通過後憑證自動簽發。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-hcl" data-lang="hcl">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="k">resource&lt;/span> &lt;span class="s2">&amp;#34;aws_acm_certificate&amp;#34; &amp;#34;main&amp;#34;&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="n"> domain_name&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;example.com&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="n"> subject_alternative_names&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;*.example.com&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="n"> validation_method&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;DNS&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl"> &lt;span class="k">lifecycle&lt;/span> {
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="n"> create_before_destroy&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kt">true&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl"> }
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="n"> tags&lt;/span> &lt;span class="o">=&lt;/span>&lt;span class="n"> { Environment&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;production&amp;#34;&lt;/span> }
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">}&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>subject_alternative_names&lt;/code> 加 &lt;code>*.example.com&lt;/code> 讓同一張憑證涵蓋所有子網域（如 &lt;code>api.example.com&lt;/code>、&lt;code>admin.example.com&lt;/code>），省去為每個子網域各申請一張。&lt;/p></description><content:encoded><![CDATA[<p>HTTPS 的運作需要三個元件配合：一個管理網域記錄的 DNS zone、一張證明網域所有權的 TLS 憑證、以及一個用這張憑證終結 TLS 連線的入口（ALB listener）。這三者在 IaC 裡各自是獨立資源，但建立順序有依賴——zone 先存在、憑證才能用 DNS 驗證、驗證通過才能掛到 listener。把這條鏈路寫進 Terraform，讓憑證的申請、驗證與續期都在版本控制裡，是避免「憑證過期才發現沒人盯」的結構性做法。</p>
<h2 id="route-53-hosted-zone">Route 53 Hosted Zone</h2>
<p>Hosted zone 是 Route 53 用來管理某個網域的 DNS 記錄集合。建立 zone 後，Route 53 會分配一組 NS（Name Server）記錄，網域的 DNS 解析就由這組 NS 負責。</p>
<h3 id="public-vs-private-zone">Public vs Private Zone</h3>
<p>Public hosted zone 對應的是可從網際網路解析的網域（如 <code>example.com</code>），用於對外服務的 A / CNAME / MX 記錄。Private hosted zone 只在指定的 VPC 內可解析，用於內部服務發現（如 <code>db.internal.example.com</code> 解析到 RDS 的 private IP）。多數專案兩者都需要：public zone 給對外流量、private zone 給內部服務互連。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_route53_zone&#34; &#34;public&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  name</span> <span class="o">=</span> <span class="s2">&#34;example.com&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  tags</span> <span class="o">=</span><span class="n"> { Environment</span> <span class="o">=</span> <span class="s2">&#34;production&#34;</span> }
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">}
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_route53_zone&#34; &#34;private&#34;</span> {
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">  name</span> <span class="o">=</span> <span class="s2">&#34;internal.example.com&#34;</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  <span class="k">vpc</span> {
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">    vpc_id</span> <span class="o">=</span> <span class="k">aws_vpc</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">id</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">  }
</span></span><span class="line"><span class="ln">12</span><span class="cl">
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">  tags</span> <span class="o">=</span><span class="n"> { Environment</span> <span class="o">=</span> <span class="s2">&#34;production&#34;</span> }
</span></span><span class="line"><span class="ln">14</span><span class="cl">}</span></span></code></pre></div><h3 id="子網域-delegation">子網域 delegation</h3>
<p>當 dev / staging / prod 各用獨立帳號時，每個帳號建自己的 hosted zone 管理子網域（如 <code>dev.example.com</code>）。父網域的 zone 需要加一組 NS 記錄指向子網域的 zone，這個動作叫 delegation。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_route53_record&#34; &#34;dev_ns&#34;</span> {
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="n">  zone_id</span> <span class="o">=</span> <span class="k">aws_route53_zone</span><span class="p">.</span><span class="k">public</span><span class="p">.</span><span class="k">zone_id</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="n">  name</span>    <span class="o">=</span> <span class="s2">&#34;dev.example.com&#34;</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="n">  type</span>    <span class="o">=</span> <span class="s2">&#34;NS&#34;</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="n">  ttl</span>     <span class="o">=</span> <span class="m">300</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="n">  records</span> <span class="o">=</span> <span class="k">aws_route53_zone</span><span class="p">.</span><span class="k">dev</span><span class="p">.</span><span class="k">name_servers</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">}</span></span></code></pre></div><p>delegation 的 NS 記錄指向子帳號 zone 的 name server。子帳號內的所有 DNS 記錄（如 <code>api.dev.example.com</code>）由子帳號的 zone 管理，父帳號不需要逐條設定。跨帳號 delegation 需要兩邊的 Terraform 各自管理自己的 zone，NS 記錄在父帳號的 state 裡。</p>
<p>判讀設定是否正確：用 <code>dig dev.example.com NS</code> 查回的 name server 應該是子帳號 zone 的 NS，不是父帳號的。如果查回父帳號的 NS，代表 delegation 沒生效，子網域的 DNS 記錄不會被解析。</p>
<h2 id="acm-憑證申請與-dns-驗證">ACM 憑證申請與 DNS 驗證</h2>
<p>AWS Certificate Manager（ACM）提供免費的 TLS 憑證，條件是透過 DNS 或 email 驗證網域所有權。DNS 驗證是 IaC 友善的方式——ACM 要求在指定網域下建一條 CNAME 記錄，記錄值由 ACM 提供，驗證通過後憑證自動簽發。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_acm_certificate&#34; &#34;main&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  domain_name</span>               <span class="o">=</span> <span class="s2">&#34;example.com&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  subject_alternative_names</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;*.example.com&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  validation_method</span>         <span class="o">=</span> <span class="s2">&#34;DNS&#34;</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">  <span class="k">lifecycle</span> {
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">    create_before_destroy</span> <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  }
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">  tags</span> <span class="o">=</span><span class="n"> { Environment</span> <span class="o">=</span> <span class="s2">&#34;production&#34;</span> }
</span></span><span class="line"><span class="ln">11</span><span class="cl">}</span></span></code></pre></div><p><code>subject_alternative_names</code> 加 <code>*.example.com</code> 讓同一張憑證涵蓋所有子網域（如 <code>api.example.com</code>、<code>admin.example.com</code>），省去為每個子網域各申請一張。</p>
<h3 id="dns-驗證記錄">DNS 驗證記錄</h3>
<p>ACM 簽發後會產出一組驗證用的 CNAME 記錄。用 Terraform 自動在 Route 53 建立這些記錄，讓驗證流程不需要手動操作：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_route53_record&#34; &#34;cert_validation&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  for_each</span> <span class="o">=</span> {
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">    for dvo in aws_acm_certificate.main.domain_validation_options : dvo.domain_name</span> <span class="o">=</span><span class="err">&gt;</span> {
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">      name</span>   <span class="o">=</span> <span class="k">dvo</span><span class="p">.</span><span class="k">resource_record_name</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">      record</span> <span class="o">=</span> <span class="k">dvo</span><span class="p">.</span><span class="k">resource_record_value</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">      type</span>   <span class="o">=</span> <span class="k">dvo</span><span class="p">.</span><span class="k">resource_record_type</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">    }
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  }
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">  zone_id</span> <span class="o">=</span> <span class="k">aws_route53_zone</span><span class="p">.</span><span class="k">public</span><span class="p">.</span><span class="k">zone_id</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">  name</span>    <span class="o">=</span> <span class="k">each</span><span class="p">.</span><span class="k">value</span><span class="p">.</span><span class="k">name</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="n">  type</span>    <span class="o">=</span> <span class="k">each</span><span class="p">.</span><span class="k">value</span><span class="p">.</span><span class="k">type</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">  ttl</span>     <span class="o">=</span> <span class="m">300</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="n">  records</span> <span class="o">=</span> <span class="p">[</span><span class="k">each</span><span class="p">.</span><span class="k">value</span><span class="p">.</span><span class="k">record</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="n">  allow_overwrite</span> <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">}
</span></span><span class="line"><span class="ln">18</span><span class="cl">
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_acm_certificate_validation&#34; &#34;main&#34;</span> {
</span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="n">  certificate_arn</span>         <span class="o">=</span> <span class="k">aws_acm_certificate</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="n">  validation_record_fqdns</span> <span class="o">=</span> <span class="p">[</span><span class="k">for</span> <span class="k">record</span> <span class="k">in</span> <span class="k">aws_route53_record</span><span class="p">.</span><span class="k">cert_validation</span> <span class="err">:</span> <span class="k">record</span><span class="p">.</span><span class="k">fqdn</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">}</span></span></code></pre></div><p><code>aws_acm_certificate_validation</code> 資源會等到 ACM 確認驗證通過才算 apply 成功。如果 DNS 記錄設錯或 zone 的 NS delegation 有問題，這個資源會卡住直到 timeout——排查方向是先確認驗證 CNAME 記錄能被公網 DNS 解析。</p>
<h3 id="create_before_destroy">create_before_destroy</h3>
<p><code>lifecycle { create_before_destroy = true }</code> 在憑證需要替換時（如增加 SAN、更換網域），讓 Terraform 先建新憑證、再刪舊憑證。沒有這個設定，預設行為是先刪後建——刪除的瞬間 ALB listener 失去憑證，HTTPS 連線全部中斷直到新憑證驗證通過（可能要幾分鐘到幾十分鐘）。</p>
<h2 id="alb-https-listener">ALB HTTPS Listener</h2>
<p>憑證驗證通過後，把它掛到 ALB 的 HTTPS listener：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_lb_listener&#34; &#34;https&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  load_balancer_arn</span> <span class="o">=</span> <span class="k">aws_lb</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  port</span>              <span class="o">=</span> <span class="m">443</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  protocol</span>          <span class="o">=</span> <span class="s2">&#34;HTTPS&#34;</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">  ssl_policy</span>        <span class="o">=</span> <span class="s2">&#34;ELBSecurityPolicy-TLS13-1-2-2021-06&#34;</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">  certificate_arn</span>   <span class="o">=</span> <span class="k">aws_acm_certificate_validation</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">certificate_arn</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  <span class="k">default_action</span> {
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">    type</span>             <span class="o">=</span> <span class="s2">&#34;forward&#34;</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">    target_group_arn</span> <span class="o">=</span> <span class="k">aws_lb_target_group</span><span class="p">.</span><span class="k">app</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">  }
</span></span><span class="line"><span class="ln">12</span><span class="cl">}</span></span></code></pre></div><p><code>ssl_policy</code> 決定 TLS 版本與加密套件。<code>ELBSecurityPolicy-TLS13-1-2-2021-06</code> 支援 TLS 1.2 和 1.3、停用已知不安全的舊版協定。選型判準是相容性與安全性的平衡——TLS 1.3-only policy 最安全但可能排除舊版客戶端，多數場景用 1.2+1.3 的組合。</p>
<p><code>certificate_arn</code> 引用的是 <code>aws_acm_certificate_validation</code> 而非直接引用 <code>aws_acm_certificate</code>，確保 listener 只在憑證驗證通過後才建立。</p>
<h3 id="http--https-重導">HTTP → HTTPS 重導</h3>
<p>同時建立一個 HTTP listener，把所有 80 埠流量重導到 443：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_lb_listener&#34; &#34;http_redirect&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  load_balancer_arn</span> <span class="o">=</span> <span class="k">aws_lb</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  port</span>              <span class="o">=</span> <span class="m">80</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  protocol</span>          <span class="o">=</span> <span class="s2">&#34;HTTP&#34;</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">  <span class="k">default_action</span> {
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">    type</span> <span class="o">=</span> <span class="s2">&#34;redirect&#34;</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">    <span class="k">redirect</span> {
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">      port</span>        <span class="o">=</span> <span class="s2">&#34;443&#34;</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">      protocol</span>    <span class="o">=</span> <span class="s2">&#34;HTTPS&#34;</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">      status_code</span> <span class="o">=</span> <span class="s2">&#34;HTTP_301&#34;</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">    }
</span></span><span class="line"><span class="ln">13</span><span class="cl">  }
</span></span><span class="line"><span class="ln">14</span><span class="cl">}</span></span></code></pre></div><p>301 永久重導讓瀏覽器記住後續直接走 HTTPS。security group 仍然需要開放 80 埠入站，否則重導不會發生——client 連 80 埠被擋、收到的是連線失敗而非重導回應。</p>
<h2 id="多網域與-san-憑證">多網域與 SAN 憑證</h2>
<p>一張 ACM 憑證最多支援 10 個 SAN（Subject Alternative Name）。多數場景用主網域 + wildcard（<code>example.com</code> + <code>*.example.com</code>）就夠用。如果有多個不同根網域（如 <code>example.com</code> 和 <code>example-app.com</code>），可以加進同一張憑證：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_acm_certificate&#34; &#34;multi_domain&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  domain_name</span>               <span class="o">=</span> <span class="s2">&#34;example.com&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  subject_alternative_names</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">    <span class="s2">&#34;*.example.com&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    <span class="s2">&#34;example-app.com&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">    <span class="s2">&#34;*.example-app.com&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">  <span class="p">]</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="n">  validation_method</span> <span class="o">=</span> <span class="s2">&#34;DNS&#34;</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl">  <span class="k">lifecycle</span> {
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">    create_before_destroy</span> <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">  }
</span></span><span class="line"><span class="ln">13</span><span class="cl">}</span></span></code></pre></div><p>每個 SAN 網域都需要獨立的 DNS 驗證記錄。如果不同網域在不同的 hosted zone 裡，驗證記錄的建立要分別指向各自的 zone。</p>
<p>當 SAN 數量超過 10、或不同網域的憑證需要獨立管理（不同 team 負責不同網域），改用 <code>aws_lb_listener_certificate</code> 額外掛載：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_lb_listener_certificate&#34; &#34;additional&#34;</span> {
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="n">  listener_arn</span>    <span class="o">=</span> <span class="k">aws_lb_listener</span><span class="p">.</span><span class="k">https</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="n">  certificate_arn</span> <span class="o">=</span> <span class="k">aws_acm_certificate</span><span class="p">.</span><span class="k">other_domain</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">}</span></span></code></pre></div><p>ALB 會根據 SNI（Server Name Indication）自動選擇匹配的憑證。</p>
<h2 id="穩定的-dns-別名記錄">穩定的 DNS 別名記錄</h2>
<p>ALB 重建後 DNS 名稱會改變，對外服務不應該直接用 ALB 的 DNS 名稱。用 Route 53 的 alias record 把穩定的網域名指向 ALB：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_route53_record&#34; &#34;app&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  zone_id</span> <span class="o">=</span> <span class="k">aws_route53_zone</span><span class="p">.</span><span class="k">public</span><span class="p">.</span><span class="k">zone_id</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  name</span>    <span class="o">=</span> <span class="s2">&#34;api.example.com&#34;</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  type</span>    <span class="o">=</span> <span class="s2">&#34;A&#34;</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">  <span class="k">alias</span> {
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">    name</span>                   <span class="o">=</span> <span class="k">aws_lb</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">dns_name</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="n">    zone_id</span>                <span class="o">=</span> <span class="k">aws_lb</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">zone_id</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">    evaluate_target_health</span> <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">  }
</span></span><span class="line"><span class="ln">11</span><span class="cl">}</span></span></code></pre></div><p>alias record 不收費（一般的 A/CNAME 記錄每百萬次查詢 $0.40，alias 到 AWS 資源免費），且支援 zone apex（如 <code>example.com</code>，一般 CNAME 不支援 zone apex）。<code>evaluate_target_health = true</code> 讓 Route 53 在 ALB 不健康時停止回應該記錄，配合 failover routing 使用。</p>
<h2 id="憑證續期監控">憑證續期監控</h2>
<p>ACM 的 DNS 驗證憑證會自動續期——條件是驗證用的 CNAME 記錄仍然存在且可解析。只要那條記錄沒被刪掉，憑證到期前 60 天 ACM 會自動續期。</p>
<p>自動續期失敗的常見原因：驗證 CNAME 記錄被手動刪除、hosted zone 的 NS delegation 失效、或 zone 本身被刪除重建導致 NS 改變。用 CloudWatch alarm 監控憑證到期日，在自動續期失敗時提前收到通知：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_cloudwatch_metric_alarm&#34; &#34;cert_expiry&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  alarm_name</span>          <span class="o">=</span> <span class="s2">&#34;acm-cert-expiry-${aws_acm_certificate.main.domain_name}&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  comparison_operator</span> <span class="o">=</span> <span class="s2">&#34;LessThanThreshold&#34;</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  evaluation_periods</span>  <span class="o">=</span> <span class="m">1</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">  metric_name</span>         <span class="o">=</span> <span class="s2">&#34;DaysToExpiry&#34;</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">  namespace</span>           <span class="o">=</span> <span class="s2">&#34;AWS/CertificateManager&#34;</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">  period</span>              <span class="o">=</span> <span class="m">86400</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="n">  statistic</span>           <span class="o">=</span> <span class="s2">&#34;Minimum&#34;</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">  threshold</span>           <span class="o">=</span> <span class="m">30</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">  alarm_actions</span>       <span class="o">=</span> <span class="p">[</span><span class="k">aws_sns_topic</span><span class="p">.</span><span class="k">oncall</span><span class="p">.</span><span class="k">arn</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="n">  dimensions</span> <span class="o">=</span> {
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">    CertificateArn</span> <span class="o">=</span> <span class="k">aws_acm_certificate</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">  }
</span></span><span class="line"><span class="ln">15</span><span class="cl">}</span></span></code></pre></div><p>這個 alarm 在憑證距離到期不足 30 天時觸發。正常情況下 ACM 在到期前 60 天就會完成續期，收到 30 天警報代表自動續期失敗了、需要人工介入確認驗證記錄。</p>
<h2 id="跨分類引用">跨分類引用</h2>
<ul>
<li>→ <a href="/blog/infra/05-core-services/loadbalancer-alb/" data-link-title="入口上 IaC — ALB、TLS 與健康檢查" data-link-desc="Application Load Balancer 的 listener、target group、健康檢查閾值設計，以及用 ACM 把 TLS 憑證的簽發、驗證與掛載整條鏈寫進版本控制">入口上 IaC — ALB</a>：ALB listener、target group、健康檢查的完整設定</li>
<li>→ <a href="/blog/infra/03-network-foundation/" data-link-title="模組三：網路地基 — VPC 與分層" data-link-desc="VPC、public / private subnet 切分、route table、NAT、security group 設計">模組三：網路地基</a>：ALB 所在的 public subnet 與 security group 設計</li>
<li>→ <a href="/blog/infra/07-infra-as-pr/" data-link-title="模組七：infra 走 PR 流程與自動化護欄" data-link-desc="infra 變更走 PR → plan → review diff → 合併 → apply，配 fmt / validate / tflint / checkov / tfsec 與 Atlantis 自動化，讓基礎設施可審查、可回溯、可交接">模組七：infra 走 PR 流程</a>：憑證與 DNS 變更走 PR review</li>
</ul>
]]></content:encoded></item><item><title>ECS Fargate 成本分析與優化</title><link>https://tarrragon.github.io/blog/infra/05-core-services/ecs-fargate-cost-optimization/</link><pubDate>Fri, 26 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/infra/05-core-services/ecs-fargate-cost-optimization/</guid><description>&lt;p>Fargate 把運算的維運面外包給 AWS — 不需要管 EC2 instance、不需要管 AMI 更新、不需要管 capacity provider 的擴縮邏輯。這份簡化的代價是單位成本較高。當服務規模小或流量不穩定時，Fargate 的簡化值回票價；當服務規模穩定且持續運行時，EC2 launch type 的單位成本優勢會累積到值得切換的量級。本篇的目標是讓讀者能判斷自己的服務在成本曲線的哪個位置、以及有哪些槓桿可以調。&lt;/p>
&lt;h2 id="fargate-計價模型">Fargate 計價模型&lt;/h2>
&lt;p>Fargate 按 task 的 vCPU 時數和記憶體時數分別計費，從 task 啟動（pull image 完成、進入 RUNNING）到停止。計費的最小粒度是一分鐘，不足一分鐘按一分鐘算。&lt;/p>
&lt;p>以 ap-northeast-1（東京）為例的單價（截至撰寫時的量級參考，實際以 AWS 定價頁為準）：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>資源&lt;/th>
 &lt;th>單價（每小時）&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>1 vCPU&lt;/td>
 &lt;td>~$0.05056&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>1 GB RAM&lt;/td>
 &lt;td>~$0.00553&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>一個 1 vCPU / 2 GB 的 task 持續運行一個月（730 小時）的費用約為 $0.05056 × 730 + $0.00553 × 2 × 730 ≈ $44.97。這個數字是所有後續比較的基線。&lt;/p>
&lt;p>Fargate 的計費粒度還有一個常被忽略的面向：task 規格只能從 AWS 預定義的 vCPU/memory 組合中選。如果應用只需要 0.3 vCPU / 512 MB，最小可選的配置是 0.25 vCPU / 0.5 GB，但如果需要 0.3 vCPU / 1 GB，就得選 0.5 vCPU / 1 GB — 多付了 0.2 vCPU 的費用。這個「階梯式浪費」在小規格 task 上比例最高。&lt;/p>
&lt;h2 id="fargate-vs-ec2-launch-type-的成本比較">Fargate vs EC2 launch type 的成本比較&lt;/h2>
&lt;p>EC2 launch type 的成本結構不同：付的是 EC2 instance 的時數（不管上面跑幾個 task），加上 ECS 本身不收費。省的是 Fargate 的 markup，多的是 instance 管理（AMI 更新、capacity provider 設定、instance 閒置時仍計費）。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>場景&lt;/th>
 &lt;th>Fargate 月費&lt;/th>
 &lt;th>EC2（t3.medium）月費&lt;/th>
 &lt;th>差異&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>1 task, 1 vCPU / 2 GB, 持續&lt;/td>
 &lt;td>~$45&lt;/td>
 &lt;td>~$30（共享 instance）&lt;/td>
 &lt;td>+50%&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>5 tasks, 各 0.5 vCPU / 1 GB&lt;/td>
 &lt;td>~$113&lt;/td>
 &lt;td>~$30（1 台 t3.medium 裝得下）&lt;/td>
 &lt;td>+277%&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>20 tasks, 各 1 vCPU / 2 GB&lt;/td>
 &lt;td>~$900&lt;/td>
 &lt;td>~$240（4 台 t3.xlarge）&lt;/td>
 &lt;td>+275%&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>流量波動大，尖峰 10 tasks / 離峰 1&lt;/td>
 &lt;td>~$180（加權平均）&lt;/td>
 &lt;td>~$150（需預留尖峰容量）&lt;/td>
 &lt;td>+20%&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>幾個判讀要點：&lt;/p>
&lt;ul>
&lt;li>task 數量少且持續運行時，Fargate 的溢價比例最高（+50% 到 +277%），但絕對金額小（$15-$80/月的差距），不值得為此承擔 instance 管理的維運負擔&lt;/li>
&lt;li>task 數量多且持續運行時，EC2 的絕對節省量開始可觀（$660/月），這時候切換的維運成本有回報&lt;/li>
&lt;li>流量波動大時，Fargate 的優勢是按需計費 — 離峰時 task 數降下來就停止計費，EC2 instance 閒置時仍然計費。波動越大，Fargate 的成本效益越接近或超過 EC2&lt;/li>
&lt;/ul>
&lt;h2 id="fargate-spot">Fargate Spot&lt;/h2>
&lt;p>Fargate Spot 使用 AWS 的閒置容量，價格約為 on-demand 的 30%（折扣幅度 ~70%），代價是 AWS 可以隨時回收容量、task 會收到 SIGTERM 後被終止。&lt;/p></description><content:encoded><![CDATA[<p>Fargate 把運算的維運面外包給 AWS — 不需要管 EC2 instance、不需要管 AMI 更新、不需要管 capacity provider 的擴縮邏輯。這份簡化的代價是單位成本較高。當服務規模小或流量不穩定時，Fargate 的簡化值回票價；當服務規模穩定且持續運行時，EC2 launch type 的單位成本優勢會累積到值得切換的量級。本篇的目標是讓讀者能判斷自己的服務在成本曲線的哪個位置、以及有哪些槓桿可以調。</p>
<h2 id="fargate-計價模型">Fargate 計價模型</h2>
<p>Fargate 按 task 的 vCPU 時數和記憶體時數分別計費，從 task 啟動（pull image 完成、進入 RUNNING）到停止。計費的最小粒度是一分鐘，不足一分鐘按一分鐘算。</p>
<p>以 ap-northeast-1（東京）為例的單價（截至撰寫時的量級參考，實際以 AWS 定價頁為準）：</p>
<table>
  <thead>
      <tr>
          <th>資源</th>
          <th>單價（每小時）</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1 vCPU</td>
          <td>~$0.05056</td>
      </tr>
      <tr>
          <td>1 GB RAM</td>
          <td>~$0.00553</td>
      </tr>
  </tbody>
</table>
<p>一個 1 vCPU / 2 GB 的 task 持續運行一個月（730 小時）的費用約為 $0.05056 × 730 + $0.00553 × 2 × 730 ≈ $44.97。這個數字是所有後續比較的基線。</p>
<p>Fargate 的計費粒度還有一個常被忽略的面向：task 規格只能從 AWS 預定義的 vCPU/memory 組合中選。如果應用只需要 0.3 vCPU / 512 MB，最小可選的配置是 0.25 vCPU / 0.5 GB，但如果需要 0.3 vCPU / 1 GB，就得選 0.5 vCPU / 1 GB — 多付了 0.2 vCPU 的費用。這個「階梯式浪費」在小規格 task 上比例最高。</p>
<h2 id="fargate-vs-ec2-launch-type-的成本比較">Fargate vs EC2 launch type 的成本比較</h2>
<p>EC2 launch type 的成本結構不同：付的是 EC2 instance 的時數（不管上面跑幾個 task），加上 ECS 本身不收費。省的是 Fargate 的 markup，多的是 instance 管理（AMI 更新、capacity provider 設定、instance 閒置時仍計費）。</p>
<table>
  <thead>
      <tr>
          <th>場景</th>
          <th>Fargate 月費</th>
          <th>EC2（t3.medium）月費</th>
          <th>差異</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1 task, 1 vCPU / 2 GB, 持續</td>
          <td>~$45</td>
          <td>~$30（共享 instance）</td>
          <td>+50%</td>
      </tr>
      <tr>
          <td>5 tasks, 各 0.5 vCPU / 1 GB</td>
          <td>~$113</td>
          <td>~$30（1 台 t3.medium 裝得下）</td>
          <td>+277%</td>
      </tr>
      <tr>
          <td>20 tasks, 各 1 vCPU / 2 GB</td>
          <td>~$900</td>
          <td>~$240（4 台 t3.xlarge）</td>
          <td>+275%</td>
      </tr>
      <tr>
          <td>流量波動大，尖峰 10 tasks / 離峰 1</td>
          <td>~$180（加權平均）</td>
          <td>~$150（需預留尖峰容量）</td>
          <td>+20%</td>
      </tr>
  </tbody>
</table>
<p>幾個判讀要點：</p>
<ul>
<li>task 數量少且持續運行時，Fargate 的溢價比例最高（+50% 到 +277%），但絕對金額小（$15-$80/月的差距），不值得為此承擔 instance 管理的維運負擔</li>
<li>task 數量多且持續運行時，EC2 的絕對節省量開始可觀（$660/月），這時候切換的維運成本有回報</li>
<li>流量波動大時，Fargate 的優勢是按需計費 — 離峰時 task 數降下來就停止計費，EC2 instance 閒置時仍然計費。波動越大，Fargate 的成本效益越接近或超過 EC2</li>
</ul>
<h2 id="fargate-spot">Fargate Spot</h2>
<p>Fargate Spot 使用 AWS 的閒置容量，價格約為 on-demand 的 30%（折扣幅度 ~70%），代價是 AWS 可以隨時回收容量、task 會收到 SIGTERM 後被終止。</p>
<p>適用條件：task 能在 120 秒內優雅停止、應用有重試機制或上游有 load balancer 自動移除不健康的 target。批次處理、背景 worker、可中斷的佇列消費者是典型的 Spot 候選。對外直接服務的 API 通常混合部署 — 基線容量用 on-demand、彈性擴張部分用 Spot。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_ecs_service&#34; &#34;api&#34;</span> {<span class="c1">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1">  # ...
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"></span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  <span class="k">capacity_provider_strategy</span> {
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">    capacity_provider</span> <span class="o">=</span> <span class="s2">&#34;FARGATE&#34;</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">    weight</span>            <span class="o">=</span> <span class="m">1</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">    base</span>              <span class="o">=</span> <span class="m">2</span><span class="c1">  # 至少 2 個 on-demand task 保底
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span>  }
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl">  <span class="k">capacity_provider_strategy</span> {
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">    capacity_provider</span> <span class="o">=</span> <span class="s2">&#34;FARGATE_SPOT&#34;</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="n">    weight</span>            <span class="o">=</span> <span class="m">3</span><span class="c1">  # 擴張時 3/4 的 task 用 Spot
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"></span>  }
</span></span><span class="line"><span class="ln">14</span><span class="cl">}</span></span></code></pre></div><p><code>base = 2</code> 確保至少有兩個 on-demand task 在線（不會被回收），<code>weight</code> 比例讓後續擴張的 task 優先使用 Spot。中斷發生時 ECS 會自動在 on-demand 上補充，但補充需要時間（task 啟動 + health check 通過），這段期間服務容量會短暫下降。</p>
<h2 id="compute-savings-plans">Compute Savings Plans</h2>
<p>Compute Savings Plans 是對 Fargate（和 EC2、Lambda）的預付承諾折扣：承諾每小時固定消費 X 美元的運算量，換取 1 年或 3 年的折扣（1 年約 -20%、3 年約 -40%，視具體方案）。</p>
<p>關鍵判斷：承諾量（$/hr）設在實際用量的多少比例。保守做法是設在過去 3 個月最低用量的 80% — 這部分幾乎確定會用到，享受折扣；超過承諾量的部分自動按 on-demand 計費，不會浪費。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 查過去 90 天的 Fargate 用量趨勢</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">aws ce get-cost-and-usage <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --time-period <span class="nv">Start</span><span class="o">=</span>2026-03-01,End<span class="o">=</span>2026-06-01 <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>  --granularity MONTHLY <span class="se">\
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="se"></span>  --metrics <span class="s2">&#34;UnblendedCost&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="se"></span>  --filter <span class="s1">&#39;{&#34;Dimensions&#34;:{&#34;Key&#34;:&#34;SERVICE&#34;,&#34;Values&#34;:[&#34;Amazon Elastic Container Service&#34;]}}&#39;</span></span></span></code></pre></div><p>Savings Plans 跟 Fargate Spot 可以疊加：Spot task 的費用也能用 Savings Plans 折抵。先用 Savings Plans 降低基線成本，再用 Spot 降低彈性擴張的成本，兩層折扣疊起來可以把 Fargate 的實際單價壓到接近 EC2 on-demand。</p>
<h2 id="task-規格的-rightsizing">Task 規格的 rightsizing</h2>
<p>Fargate task 的 vCPU 和記憶體配置如果設得過大，多出來的資源每小時都在計費。rightsizing 的目標是讓 task 規格貼合實際使用量，但留足安全餘裕。</p>
<h3 id="量測實際使用量">量測實際使用量</h3>
<p>開啟 CloudWatch Container Insights 後，每個 task 的 CPU 和記憶體使用量會自動上報。觀察 7-14 天的 p95 值：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 查 ECS service 過去 7 天的 CPU p95</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">aws cloudwatch get-metric-statistics <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --namespace ECS/ContainerInsights <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>  --metric-name CpuUtilized <span class="se">\
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="se"></span>  --dimensions <span class="nv">Name</span><span class="o">=</span>ServiceName,Value<span class="o">=</span>api <span class="nv">Name</span><span class="o">=</span>ClusterName,Value<span class="o">=</span>prod <span class="se">\
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="se"></span>  --start-time 2026-06-19T00:00:00Z <span class="se">\
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="se"></span>  --end-time 2026-06-26T00:00:00Z <span class="se">\
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="se"></span>  --period <span class="m">3600</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="se"></span>  --statistics p95</span></span></code></pre></div><h3 id="判斷調整方向">判斷調整方向</h3>
<table>
  <thead>
      <tr>
          <th>p95 使用率</th>
          <th>判斷</th>
          <th>動作</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CPU &lt; 30%</td>
          <td>過度配置，浪費明顯</td>
          <td>降一級 vCPU</td>
      </tr>
      <tr>
          <td>CPU 30-70%</td>
          <td>合理範圍，有足夠餘裕應對尖峰</td>
          <td>維持</td>
      </tr>
      <tr>
          <td>CPU &gt; 80%</td>
          <td>餘裕不足，尖峰時可能觸發 throttling</td>
          <td>升一級 vCPU 或增加 task 數</td>
      </tr>
      <tr>
          <td>Memory &lt; 40%</td>
          <td>過度配置</td>
          <td>降一級 memory</td>
      </tr>
      <tr>
          <td>Memory &gt; 80%</td>
          <td>OOM kill 風險</td>
          <td>升一級 memory</td>
      </tr>
  </tbody>
</table>
<p>調整後觀察 3-5 天確認沒有效能退化再進入下一輪。每次只調一個維度（CPU 或 memory），避免同時改兩個變數無法歸因。</p>
<h3 id="fargate-可選的規格組合">Fargate 可選的規格組合</h3>
<p>Fargate 的 vCPU 和 memory 不能任意搭配。常用的組合：</p>
<table>
  <thead>
      <tr>
          <th>vCPU</th>
          <th>可選 Memory 範圍</th>
          <th>典型用途</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>0.25</td>
          <td>0.5 / 1 / 2 GB</td>
          <td>輕量 sidecar、cron job</td>
      </tr>
      <tr>
          <td>0.5</td>
          <td>1 / 2 / 3 / 4 GB</td>
          <td>小型 API、worker</td>
      </tr>
      <tr>
          <td>1</td>
          <td>2 / 3 / 4 / 5 / 6 / 7 / 8 GB</td>
          <td>標準 API、中型 worker</td>
      </tr>
      <tr>
          <td>2</td>
          <td>4 ~ 16 GB</td>
          <td>高負載 API、批次處理</td>
      </tr>
      <tr>
          <td>4</td>
          <td>8 ~ 30 GB</td>
          <td>資料處理、ML inference</td>
      </tr>
  </tbody>
</table>
<p>選的時候從最小的「能跑」組合開始，用 Container Insights 量測後再調。常見的浪費是把所有 task 都設成 1 vCPU / 2 GB — 一個只用 0.1 vCPU / 256 MB 的 sidecar 也配了同樣的規格。</p>
<h2 id="何時從-fargate-切到-ec2">何時從 Fargate 切到 EC2</h2>
<p>切換的判斷不只看成本差額，還要看維運能力。EC2 launch type 需要管理：AMI 更新（安全 patch）、instance draining（rolling update 時把 task 遷走再關 instance）、capacity provider 的擴縮邏輯、instance 的 security group 與 IAM role。</p>
<table>
  <thead>
      <tr>
          <th>判斷維度</th>
          <th>留在 Fargate</th>
          <th>切到 EC2</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>月費差額</td>
          <td>&lt; $200</td>
          <td>&gt; $500 且持續 3 個月</td>
      </tr>
      <tr>
          <td>團隊維運能力</td>
          <td>沒有專人管 instance</td>
          <td>有平台工程師或 DevOps</td>
      </tr>
      <tr>
          <td>流量型態</td>
          <td>波動大、有明顯離峰</td>
          <td>穩定、24/7 持續運行</td>
      </tr>
      <tr>
          <td>GPU 需求</td>
          <td>不需要</td>
          <td>需要（Fargate 不支援 GPU）</td>
      </tr>
      <tr>
          <td>啟動速度</td>
          <td>可接受 cold start</td>
          <td>需要 &lt;1s 啟動（EC2 instance 已在線）</td>
      </tr>
  </tbody>
</table>
<p>混合部署是常見的中間路線：基線容量用 EC2（成本低、啟動快），尖峰彈性用 Fargate Spot（按需、不需預留）。這需要同時維護兩種 capacity provider，複雜度較高。</p>
<h2 id="成本監控">成本監控</h2>
<p>把 ECS 的成本歸因到服務層級需要兩個機制：task 層的 tag propagation 和 Cost Explorer 的 tag 維度。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_ecs_service&#34; &#34;api&#34;</span> {<span class="c1">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1">  # ...
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"></span><span class="n">  propagate_tags</span> <span class="o">=</span> <span class="s2">&#34;SERVICE&#34;</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">  tags</span> <span class="o">=</span> {
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">    service</span>     <span class="o">=</span> <span class="s2">&#34;payment-api&#34;</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">    env</span>         <span class="o">=</span> <span class="s2">&#34;prod&#34;</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="n">    cost-center</span> <span class="o">=</span> <span class="s2">&#34;cc-payments&#34;</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  }
</span></span><span class="line"><span class="ln">10</span><span class="cl">}</span></span></code></pre></div><p><code>propagate_tags = &quot;SERVICE&quot;</code> 讓 service 的 tag 自動傳播到每個 task，Cost Explorer 就能按 <code>service</code> 或 <code>cost-center</code> 維度拆分 Fargate 費用。這跟<a href="/blog/infra/08-governance-habits/" data-link-title="模組八：治理好習慣 — 規模長大後不失控的最小節奏" data-link-desc="tagging 規範、secrets 不進 code、成本可見性、最小可行節奏，規模長大後不失控">模組八：治理好習慣</a>的 tagging 規範對齊 — tag 是成本可見性的地基。</p>
<p>定期（月初或月中）檢查 Cost Explorer 的 Fargate 費用趨勢：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">aws ce get-cost-and-usage <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  --time-period <span class="nv">Start</span><span class="o">=</span>2026-06-01,End<span class="o">=</span>2026-06-26 <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --granularity DAILY <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>  --metrics <span class="s2">&#34;UnblendedCost&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="se"></span>  --group-by <span class="nv">Type</span><span class="o">=</span>TAG,Key<span class="o">=</span>service <span class="se">\
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="se"></span>  --filter <span class="s1">&#39;{&#34;Dimensions&#34;:{&#34;Key&#34;:&#34;SERVICE&#34;,&#34;Values&#34;:[&#34;Amazon Elastic Container Service&#34;]}}&#39;</span></span></span></code></pre></div><p>費用突然跳升時，先看是 task 數增加（auto-scaling 觸發）還是單價變化（Savings Plans 過期或 Spot 中斷後自動回補為 on-demand）。這兩者的處理方式不同：前者檢查 scaling policy、後者檢查 Savings Plans 到期日和 Spot 回收頻率。</p>
<h2 id="跨分類引用">跨分類引用</h2>
<ul>
<li>→ <a href="/blog/infra/05-core-services/compute-ecs-eks/" data-link-title="運算平台上 IaC — ECS 與 EKS" data-link-desc="容器運算平台的 IaC 描述：ECS 與 EKS 選型、task definition 與映像版本解耦、IAM task role 分離、auto-scaling 策略">運算平台上 IaC</a>：ECS vs EKS 選型、Fargate 的定位</li>
<li>→ <a href="/blog/infra/08-governance-habits/" data-link-title="模組八：治理好習慣 — 規模長大後不失控的最小節奏" data-link-desc="tagging 規範、secrets 不進 code、成本可見性、最小可行節奏，規模長大後不失控">模組八：治理好習慣</a>：tagging 與成本可見性的地基</li>
<li>→ <a href="/blog/devops/08-cost-management/" data-link-title="模組八：成本管理" data-link-desc="雲端帳單怎麼不失控 — reserved instance、spot instance、right-sizing、成本監控告警">devops 模組八：成本管理</a>：運行期的 RI / Spot / rightsizing 策略</li>
</ul>
]]></content:encoded></item></channel></rss>