Github-Actions on Tarragon

CI/CD 失敗到修復發布流程

Wed, 06 May 2026 00:00:00 +0000

CI/CD 失敗處理的核心責任是把紅燈轉成明確的下一步路由。紅燈本身是驗證或交付層的訊號；工程流程要做的是找出失敗層、重現同一個條件、修正後重新讓 CI Pipeline 證明變更可發布。

失敗後先看什麼

失敗後第一步是定位 workflow 與 job。CI/CD 系統會把一次 push、pull request、tag 或 release 拆成多個 workflow，每個 workflow 下面又有多個 job；真正的下一步取決於是哪一層失敗。

失敗位置	常見原因	下一步路由
Lint / format	程式碼、文件或設定格式不符	回本機跑同一條 lint / format 命令
Test	單元、整合、瀏覽器或裝置測試回歸	下載 report，回本機用同條件重現
Build	編譯、bundle、package 或靜態產物失敗	回本機跑 production build 入口
Package	image、app bundle、artifact 產生失敗	檢查版本、簽章、registry 或路徑
Deploy	hosting、runtime、store 或權限設定	先確認 build artifact 是否已成功

Lint / format 失敗代表靜態契約沒有通過。常見情境是程式格式、文件格式、型別檢查、schema 或設定規則不符合規範。這類失敗的修復路徑通常很短：讀錯誤訊息、修正來源、必要時跑 formatter，再提交修正。

Test 失敗代表某個行為或契約沒有符合預期。這類失敗要先看 report、screenshot、trace、device log 或 error context，確認是功能真的回歸、測試假設過期，還是測試環境缺少 production-like artifact。直接改測試前，要先確認測試原本守的是哪個使用者或系統行為。

Build 失敗代表 pipeline 尚未產生可部署產物。這類失敗通常來自編譯錯誤、bundle 設定、依賴版本、環境變數、template 或資源路徑。修復時以專案定義的 production build 命令作為最小重現入口。

Deploy 失敗代表發布動作沒有完成。這類失敗需要先區分 artifact 是否存在、發布通道權限是否正確、環境保護是否放行。若測試與 build 已成功，deploy 失敗多半是發布通道問題；若 artifact 沒有產生，應回到 build 或 package 階段。

本機重現流程

本機重現的責任是讓修復建立在同一個驗證條件上。CI 是用乾淨環境執行的一組命令；只要能在本機跑出同樣的失敗，修復就能被快速驗證。

1make build
2make test
3make deploy-dry-run

Build 命令驗證 production artifact 是否能產生。這一步應該接近 CI 使用的 build 入口，避免開發模式遮蔽 production 問題。

Test 命令驗證產物或程式行為。前端可能是 browser test，後端可能是 integration / contract test，App 可能是 device test，Docker 可能是 image scan 或 smoke test。

Deploy dry-run 命令驗證發布前條件。高風險部署至少要能檢查 artifact、權限、環境與版本資訊；沒有 dry-run 的專案，也應保留對等的 preflight check。

修復與重新觸發

修復流程的核心是用新 commit 讓 CI 重新驗證。一般流程不需要刪掉失敗 commit，也不需要 force push；失敗 commit 留在歷史裡，後續 fix commit 會形成清楚的修復脈絡。

讀失敗 job 的 log 或 artifact。
在本機跑對應命令重現。
修改最小必要範圍。
跑同一條本機命令確認修復。
commit 並 push。
等 GitHub Actions 重新跑。

這個流程的好處是保留可追溯性。日後再看到同類失敗，可以從 commit history 與 CI log 找到當時的判讀方式。

發布 gate 路由

發布 gate 的責任是把「是否進入下一階段」變成明確條件。這一頁只處理失敗後的操作路由；required checks、job needs、environment protection 與 artifact handoff 的設計原理，獨立放在 CI gate 與 workflow 邊界。

常見處理情境

CI 失敗但本機通過時，優先檢查環境差異。常見差異包括語言版本、套件管理器版本、缺少子模組、缺少 build artifact、測試依賴未安裝、時區或檔案大小寫差異。這類問題要把版本與建置前置條件寫進 workflow、Makefile 或 script，讓重現條件成為專案的一部分。

測試不穩定時，優先把 Flaky Test 狀態標出來並建立 owner。短期可以隔離或重跑，長期要找到不穩定來源，例如等待條件錯誤、外部網路依賴、時間假設、測試資料不穩或動畫 transition 尚未完成。測試不穩定會降低 gate 信任度，因此它本身就是需要治理的 CI 問題。

Deploy 失敗但測試通過時，優先看 artifact 與權限。若 build output 存在且可下載，問題通常在部署通道、token permission 或 environment protection；若 artifact 缺失，就回到 build job。

反模式與替代做法

反模式	風險	替代做法
看到紅燈直接重跑	掩蓋 flaky 或環境問題	先看失敗 log，再決定是否重跑
用 `--no-verify` 或跳過 CI	把局部問題帶進主線	修掉 gate 或明確記錄例外
CI 與本機命令不同	本機通過但 CI 失敗	把命令收斂到 Makefile / npm script
測試直接打外部服務	網路與第三方狀態污染判斷	使用 fixture、mock 或可控環境

反模式的共同問題是讓 CI 失去判讀價值。CI 的目標是讓綠燈代表「這次變更在定義好的條件下可發布」。

最小可用流程

最小可用流程是讓每次變更都有同一條路徑。對小型靜態網站或個人 blog，先做到以下四件事，就能形成穩定發布節奏。

push 或 PR 觸發 lint / test / build。
production build 有單一入口。
測試失敗時保留 artifact 或 report。
deploy 只接受測試與 build 通過後的產物。

這套流程建立後，CI 紅燈就會成為清楚的路由訊號：哪一層壞、用哪個命令重現、修完後用哪個 gate 放行。

若變更涉及後端服務，可再對照 backend 知識卡的 Runbook、Rollback Strategy 與 Release Gate 進一步細化故障處理順序與放行條件。

下一步路由

需要理解 CI 在可靠性模組的位置：讀 6.1 CI pipeline。
需要看靜態站部署案例：讀本 blog 專案部署。
需要理解 CI gate 設計：讀 CI gate 與 workflow 邊界。
需要理解發布阻擋策略：讀 6.8 Release Gate 與變更節奏。

本 blog 專案的 GitHub Actions workflow

Wed, 06 May 2026 00:00:00 +0000

本 blog 的 GitHub Actions workflow 負責把內容檢查、瀏覽器回歸測試、Hugo 發布與 Claude 協作分成不同自動化流程。每條 workflow 都是一個獨立入口；維護時要先分清楚它是在保護內容品質、使用者行為、發布產物，還是協作流程。

Workflow 總覽

本專案目前有五條 workflow。三條屬於 CI / CD 主流程，兩條屬於 Claude 協作輔助流程。

Workflow	檔案	觸發條件	核心責任
`md-check`	`.github/workflows/md-check.yml`	push / pull request 到 `main`	檢查 content Markdown 契約
`Playwright tests`	`.github/workflows/playwright.yml`	push / pull request 到 `main`	驗證瀏覽器層行為與版面回歸
`Deploy Hugo site to Pages`	`.github/workflows/deploy.yml`	push 到 `main`	建置 Hugo、產生搜尋索引並部署
`Claude Code`	`.github/workflows/claude.yml`	issue / comment / review 叫 Claude	讓 Claude 讀 issue、PR 與 CI 結果
`Claude Code Review`	`.github/workflows/claude-code-review.yml`	PR opened / synchronize 等事件	對 PR 進行 Claude code review

這張表的責任是提供入口。看到 GitHub Actions 紅燈時，先對照 workflow 名稱，把失敗歸到內容檢查、瀏覽器測試、部署或協作流程。

`md-check`

md-check 的責任是讓 content/ 裡的 Markdown 維持同一套結構契約。它會先用 Go build 出 scripts/mdtools，再依序執行 formatter 檢查、lint 與卡片連結檢查。

1name: md-check
2on:
3  push:
4    branches: [main]
5  pull_request:
6    branches: [main]

這條 workflow 的核心步驟是：

actions/checkout@v6
actions/setup-go@v6
go build -o ../../bin/mdtools
./bin/mdtools fmt --check content/
./bin/mdtools lint content/
./bin/mdtools cards content/

md-check 失敗時，下一步是回本機跑同一組命令。fmt --check 失敗代表格式可由 fmt --fix 修正；lint 失敗代表標題、front matter、URL、code block 等結構契約不符；cards 失敗代表卡片連結、orphan 或 K4 規則需要修。

1./bin/mdtools fmt --check content/
2./bin/mdtools lint content/
3./bin/mdtools cards content/

維護這條 workflow 時，規則來源要和 Blog Markdown 寫作規範與 mdtools 檢查對齊。改 scripts/mdtools/internal/rules/ 時，也要同步更新規範文章，避免 CI 行為和文件描述分叉。

`Playwright tests`

Playwright tests 的責任是驗證使用者可見行為。它會先建出完整 Hugo site 與 Pagefind index，再用 Chromium 驗證搜尋、版面與互動。

1name: Playwright tests
2on:
3  push:
4    branches: [main]
5  pull_request:
6    branches: [main]

這條 workflow 的核心步驟是：

checkout，並包含 submodules
安裝 Hugo 0.148.2 extended
安裝 Node 24
npm ci
npx playwright install --with-deps chromium
make site
npx playwright test
失敗時上傳 playwright-report/

make site 是這條 workflow 的關鍵前置條件。它會產生 Hugo 靜態檔與三份 Pagefind index：pagefind、pagefind-title、pagefind-content。如果只跑 hugo --minify 就跑 Playwright，搜尋測試會因為缺少 index 而失敗。

Playwright 失敗時，下一步是下載 playwright-report 或讀 error context。若失敗發生在搜尋頁，先確認 make site 是否完整成功；若失敗發生在版面，先看 screenshot、bounding box 或 computed style；若失敗發生在互動，先看 selector 是否仍對準真實 DOM。

1make site
2npm test

維護這條 workflow 時，測試要守使用者行為，不應只守 implementation detail。像 TOC RWD 這類版面行為，可以用 viewport 測試固定桌面、筆電與手機三種狀態。

`Deploy Hugo site to Pages`

Deploy Hugo site to Pages 的責任是把 main 上的內容建置成 GitHub Pages artifact 並部署。它只在 push 到 main 時觸發，不在 pull request 上部署。

1name: Deploy Hugo site to Pages
2on:
3  push:
4    branches:
5      - main

這條 workflow 有兩個 job：

Job	責任	關鍵設定
`build`	checkout、Hugo build、Pagefind、artifact	`runs-on: ubuntu-latest`
`deploy`	發布 GitHub Pages	`needs: build`

build job 會先跑 hugo --minify，並把輸出寫到 hugo-build-output.txt。目前它設了 continue-on-error: true，所以 Hugo build 失敗時會進入 Claude Debug 步驟，嘗試讓 Claude 分析錯誤並 commit 修復。

Fail if build was not fixed 是第二道保護。若原本 Hugo build 失敗，workflow 會重新跑一次 hugo --minify；如果 Claude 沒修好，這一步會讓 workflow 停止。

Pagefind index 會在 Hugo build 後產生：

1npx -y pagefind --site public --root-selector main
2npx -y pagefind --site public --root-selector "article.article-content > h1" --output-subdir pagefind-title
3npx -y pagefind --site public --root-selector ".article-body" --output-subdir pagefind-content

Deploy 失敗時，下一步先分層判讀。若 build job 失敗，回到 Hugo 或 Pagefind；若 Upload artifact 成功但 deploy job 失敗，檢查 Pages environment、permission、artifact 與 GitHub Pages 設定。

這條 workflow 目前的注意事項是：deploy workflow 自己沒有直接 needs md-check 或 Playwright tests，因為它們是獨立 workflow。這是本專案目前的實際邊界；gate 設計原理見 CI gate 與 workflow 邊界。

`Claude Code`

Claude Code 的責任是提供互動式 Claude 協作入口。它不會在每次 push 自動修程式，而是在 issue、comment 或 review 內容包含 @claude 時觸發。

1on:
2  issue_comment:
3    types: [created]
4  pull_request_review_comment:
5    types: [created]
6  issues:
7    types: [opened, assigned]
8  pull_request_review:
9    types: [submitted]

這條 workflow 的 gate 寫在 job if。只有以下情境會真正執行：

issue comment 包含 @claude
pull request review comment 包含 @claude
pull request review body 包含 @claude
issue title 或 body 包含 @claude

這條 workflow 給 Claude actions: read 權限，讓它能讀 PR 上的 CI 結果。這對「請 Claude 看 CI 為什麼失敗」很重要，因為 Claude 需要讀 workflow run、job log 或 check 結果才能判斷。

維護這條 workflow 時，重點是權限最小化。它目前給的是 contents: read、pull-requests: read、issues: read、id-token: write、actions: read，適合互動分析；若未來要讓 Claude 直接 commit，才需要重新評估寫入權限與保護條件。

`Claude Code Review`

Claude Code Review 的責任是在 PR 事件發生時跑 Claude code review。它和 Claude Code 不同，前者是 PR review automation，後者是被 @claude 叫起來的互動入口。

1on:
2  pull_request:
3    types: [opened, synchronize, ready_for_review, reopened]

這條 workflow 使用 code-review@claude-code-plugins，prompt 是：

1/code-review:code-review ${{ github.repository }}/pull/${{ github.event.pull_request.number }}

它的責任是提供 review 視角。Claude review 可以指出風險、邏輯問題或測試缺口；真正阻擋合併與發布的責任仍在 Required Checks、測試 workflow 與 deploy gate。

維護這條 workflow 時，可以依 PR 類型決定是否加 path filter。若未來只想在程式碼或 workflow 變更時觸發，可打開 paths 設定；若希望文章內容也被 review，就維持目前全 PR 觸發。

本專案的發布阻擋邊界

本 blog 的發布阻擋邊界需要同時看 YAML 與 GitHub repository 設定。這一節只記錄本專案目前能從 YAML 判讀出的事實；required checks、environment protection 與 artifact handoff 的原理不在本頁展開。

目前從 YAML 可直接確認的阻擋關係是：

關係	是否在 YAML 中明確存在	說明
`deploy` 等 `build`	是	`deploy` job 有 `needs: build`
`deploy` 等 `md-check`	否	`md-check` 是另一條 workflow
`deploy` 等 Playwright	否	`Playwright tests` 是另一條 workflow
PR 需要通過測試才能合併	需查 repository 設定	需要看 GitHub branch protection 設定
Pages deploy 需要人工審核	需查 environment 設定	需要看 GitHub Pages environment protection 設定

若日後發現測試紅燈但 Pages 仍發布，本頁只負責指出目前 workflow 邊界；具體改法回到 CI gate 與 workflow 邊界判斷，並對照 Required Checks 與 Environment Protection。

失敗時的維護路由

失敗時的維護路由要先定位 workflow，再定位 job，再回到本機重現。這能避免在錯誤層修錯問題。

紅燈位置	優先看什麼	本機重現命令
`md-check`	mdtools 訊息	`./bin/mdtools lint content/`
`Playwright tests`	`playwright-report` / error context	`make site` 後 `npm test`
`Deploy` 的 Hugo build	`hugo-build-output.txt`	`hugo --minify`
`Deploy` 的 Pagefind	Pagefind command output	`make site`
`Deploy` 的 Pages step	artifact / permission / environment	GitHub Actions UI + Pages 設定
`Claude Code`	secret / permission / trigger `if`	檢查 `@claude` 觸發文字與 secrets
`Claude Code Review`	plugin marketplace / token	檢查 PR event、secret 與 action log

這份路由也可以當維護 checklist。新增 workflow 時，至少要補三件事：觸發條件、失敗時看哪個 artifact 或 log、本機要用哪條命令重現。

本專案維護注意事項

本專案維護注意事項的責任是記錄和目前 YAML 直接相關的操作提醒。這些提醒隨 workflow 實作改變而更新，不承擔通用 CI 設計原理。

Playwright tests 依賴 make site 產生 Pagefind index；搜尋測試失敗時先確認 production build 是否完整。
deploy.yml 的 Hugo build 使用 continue-on-error: true，後面用 Claude Debug 與 retry build 接住失敗。
Claude Code 目前是 read-oriented 互動入口；若未來要寫入 repo，需要重新審核 permission。
.github/workflows/*.yml 有實作變更時，要同步更新本頁，讓維護入口維持可信。

下一步路由

CI 紅燈處理流程：讀 CI 失敗到修復發布流程。
CI gate 設計原理：讀 CI gate 與 workflow 邊界。
CI 在可靠性模組的位置：讀 6.1 CI pipeline。
發布 gate 設計：讀 6.8 Release Gate 與變更節奏。
Markdown 檢查規則：讀 Blog Markdown 寫作規範與 mdtools 檢查。

Terraform CI Pipeline 設定指南

Fri, 26 Jun 2026 00:00:00 +0000

Terraform 的 PR 流程要發揮價值，plan 和 apply 需要在 CI 裡自動執行，而非在工程師的本機跑。本篇用 GitHub Actions 建立一條完整的 pipeline：PR 開啟時跑檢查和 plan、plan 結果貼回 PR comment 讓 reviewer 看、合併到主幹後才 apply。整條管線的 credential 用 OIDC 取得短期 token（見 OIDC Trust Policy 設定），不存任何長期 key。

Pipeline 的兩個階段

整條 pipeline 分成兩個觸發時機，各自承擔不同責任：

階段	觸發條件	責任	失敗時
Plan	PR 開啟或更新	檢查格式、驗證語法、靜態掃描、產出 plan diff	PR 無法合併
Apply	合併到 main	把 plan 過的變更套用到雲端	需要人工介入

兩個階段用不同的 IAM role：plan role 只有唯讀權限（能跑 terraform plan 但不能改任何資源），apply role 有寫入權限。這個分離確保 PR 階段的任何 code 都沒辦法偷偷改動雲端資源。

Plan 階段的完整 workflow

 1name: Terraform Plan
 2on:
 3  pull_request:
 4    paths:
 5      - 'infra/**'
 6
 7permissions:
 8  id-token: write
 9  contents: read
10  pull-requests: write
11
12jobs:
13  plan:
14    runs-on: ubuntu-latest
15    defaults:
16      run:
17        working-directory: infra/environments/prod
18
19    steps:
20      - uses: actions/checkout@v4
21
22      - uses: aws-actions/configure-aws-credentials@v4
23        with:
24          role-to-assume: arn:aws:iam::123456789012:role/infra-plan
25          aws-region: ap-northeast-1
26
27      - uses: hashicorp/setup-terraform@v3
28        with:
29          terraform_version: 1.9.0
30
31      - name: Format check
32        run: terraform fmt -check -recursive -diff
33
34      - name: Init
35        run: terraform init -input=false
36
37      - name: Validate
38        run: terraform validate
39
40      - name: TFLint
41        uses: terraform-linters/setup-tflint@v4
42        with:
43          tflint_version: latest
44      - run: tflint --recursive --format compact
45
46      - name: Plan
47        id: plan
48        run: |
49          terraform plan -no-color -input=false -out=tfplan \
50            -detailed-exitcode 2>&1 | tee plan-output.txt
51        continue-on-error: true
52
53      - name: Comment plan on PR
54        uses: actions/github-script@v7
55        with:
56          script: |
57            const fs = require('fs');
58            const plan = fs.readFileSync('infra/environments/prod/plan-output.txt', 'utf8');
59            const truncated = plan.length > 60000
60              ? plan.substring(0, 60000) + '\n\n... (truncated)'
61              : plan;
62            await github.rest.issues.createComment({
63              owner: context.repo.owner,
64              repo: context.repo.repo,
65              issue_number: context.issue.number,
66              body: `### Terraform Plan\n\`\`\`\n${truncated}\n\`\`\``
67            });
68
69      - name: Fail if plan errored
70        if: steps.plan.outcome == 'failure'
71        run: exit 1

各步驟的職責

Format check 驗證 HCL 是否符合標準排版。它不影響功能，但消除 diff 噪音——排版不一致時 PR diff 會混入純格式變更，reviewer 分不清哪些是邏輯改動。-diff flag 讓 CI 輸出具體哪幾行不符合，作者在本地跑 terraform fmt 就能修。

Init 初始化 provider 和 backend。-input=false 避免 CI 卡在等待互動式輸入。如果 backend 設定錯了（bucket 不存在、權限不足），這一步就會失敗，不會跑到後面浪費時間。

Validate 檢查 HCL 的語法和內部一致性——變數沒宣告、型別不匹配、必填參數缺漏。它不連線雲端，只讀 code，所以不需要 AWS credential 也能跑（但放在 init 之後是因為 validate 需要 provider schema）。

TFLint 做 provider 層的正確性檢查：instance type 在該 region 不存在、已棄用的參數、命名不符規範。它補的是 validate 抓不到的「語法對但值不對」的問題。

Plan 是整條 pipeline 的核心產出。-detailed-exitcode 讓 exit code 區分三種狀態：0 = 無差異、1 = 錯誤、2 = 有差異。-out=tfplan 把 plan 結果存成二進位檔，apply 階段可以直接用這份 plan 執行，避免 plan 和 apply 之間的時間差導致不一致。

Comment 把 plan 輸出貼回 PR，reviewer 看 code diff 的同時看到 plan 的實際變更。plan 輸出可能很長（幾百行），超過 GitHub comment 上限時截斷，但保留開頭（通常包含 add/change/destroy 的摘要行）。

Apply 階段

 1name: Terraform Apply
 2on:
 3  push:
 4    branches: [main]
 5    paths:
 6      - 'infra/**'
 7
 8permissions:
 9  id-token: write
10  contents: read
11
12jobs:
13  apply:
14    runs-on: ubuntu-latest
15    environment: production
16    defaults:
17      run:
18        working-directory: infra/environments/prod
19
20    steps:
21      - uses: actions/checkout@v4
22
23      - uses: aws-actions/configure-aws-credentials@v4
24        with:
25          role-to-assume: arn:aws:iam::123456789012:role/infra-apply
26          aws-region: ap-northeast-1
27
28      - uses: hashicorp/setup-terraform@v3
29        with:
30          terraform_version: 1.9.0
31
32      - name: Init
33        run: terraform init -input=false
34
35      - name: Plan (verify)
36        run: terraform plan -no-color -input=false -detailed-exitcode
37
38      - name: Apply
39        run: terraform apply -auto-approve -input=false

environment protection rule

environment: production 這一行啟用 GitHub 的環境保護功能。在 repo 的 Settings → Environments → production 設定：

Required reviewers：指定至少一個人 approve 才能執行 apply job
Wait timer：合併後等 N 分鐘才開始 apply（給人反應時間）
Deployment branches：限定只有 main branch 能觸發

這層保護讓高風險的變更（plan 顯示 destroy 或 replace）在 apply 前多一道人工確認。日常低風險變更（加一個 tag、調一個參數）可以直接通過。取捨點是：每次 apply 都要人按確認會拖慢頻繁的小變更，可以用 deployment rule 的條件只攔 production 環境。

Apply 階段重跑 plan 的理由

apply 之前重跑一次 plan，是為了驗證合併後的現實跟 PR review 時看到的一致。PR 從開啟到合併可能隔了幾小時或幾天，期間有人可能手動改了雲端資源（drift）或別的 PR 先 apply 了。重跑 plan 確認差異跟預期一致，不一致就停下來而非盲目 apply。

如果使用了 plan 階段的 -out=tfplan 保存 plan 檔，apply 可以改為 terraform apply tfplan 直接執行已 review 過的 plan。代價是 plan 檔需要跨 job 傳遞（GitHub Actions 的 artifact），且 plan 檔有時效——state 在 plan 之後被修改，apply 會拒絕執行。

多環境的 pipeline 設計

管理 dev / staging / prod 三個環境時，pipeline 有兩種常見結構：

單 workflow 加 matrix：一份 YAML 用 strategy.matrix 跑三個環境，每個環境有自己的 working directory 和 IAM role。好處是維護一份 YAML；代價是三個環境的 plan 都在同一次 PR run 裡，reviewer 要看三份 plan 輸出。

每環境獨立 workflow：三份 YAML 各自觸發在對應環境目錄的變更上（paths: ['infra/environments/dev/**']）。好處是只有改到的環境才跑、PR comment 乾淨；代價是三份 YAML 有重複。

多數團隊起步時用單 workflow + matrix，環境數量超過三個或各環境的 apply 策略不同（dev 自動、prod 要 approval）時切到獨立 workflow。

安全邊界

CI pipeline 是 infra 變更的自動化執行者，它的安全性等同於 apply role 的權限。幾個邊界要守住：

OIDC claim 收斂：apply role 的 trust policy 只允許特定 repo 的 main branch 假扮（見 OIDC Trust Policy 設定）。如果 claim 只驗 repo 不驗 branch，任何人在 feature branch 推一個修改過的 workflow 就能觸發 apply。

Workflow 修改的 review：.github/workflows/ 底下的 YAML 變更應該跟 infra code 一樣走 PR review。修改 workflow 等於修改 pipeline 的行為——加一個 terraform destroy step 就能在合併時清掉整個環境。GitHub 的 CODEOWNERS 功能可以強制特定人 review workflow 變更。

Secret 與 environment variable：OIDC 取代了存在 repo secrets 裡的 access key，但 workflow 可能還用到其他 secret（Terraform Cloud token、Slack webhook URL）。這些 secret 要限定在特定 environment 才能存取，不開放給所有 branch。

本篇聚焦 GitHub Actions。如果團隊選擇 Atlantis（常駐服務、內建 state lock 與 apply 語意），見主文章的 Atlantis 段的選型討論。

跨分類引用

→ OIDC Trust Policy 設定：pipeline 的 credential 來源
→ checkov / tfsec 規則配置：pipeline 裡的靜態安全掃描怎麼配
→ infra 走 PR 流程與自動化護欄：pipeline 背後的審查原則
→ 模組四：環境分離與模組化：多環境的目錄結構決定 pipeline 的 working directory

OIDC Trust Policy 設定指南

Fri, 26 Jun 2026 00:00:00 +0000

OIDC 聯合讓 CI/CD pipeline 用短期 token 取代長期 access key 存取雲端資源。設定本身不複雜，但 trust policy 的 claim 條件寫錯一個字就會變成「任何 repo 都能假扮這個 role」或「完全無法 assume」。本篇是 GitHub Actions 與 AWS 之間的 OIDC 聯合的完整設定步驟，從建立 provider 到 trust policy 設計到測試驗證。其他 CI 平台（GitLab CI、CircleCI）的原理相同，差別只在 issuer URL 和 claim 結構：

平台	Issuer URL	sub claim 格式範例
GitHub Actions	`token.actions.githubusercontent.com`	`repo:{org}/{repo}:ref:refs/heads/{branch}`
GitLab CI	`gitlab.com`	`project_path:{group}/{project}:ref_type:branch:ref:main`
CircleCI	`oidc.circleci.com/org/{org-id}`	`org/{org-id}/project/{project-id}/user/{user-id}`

本篇以 GitHub Actions 為主，其他平台替換 issuer URL 和 sub condition 即可。

建立 OIDC Provider

OIDC provider 是 AWS 帳號裡的一個資源，聲明「我信任這個外部 identity provider 簽發的 token」。GitHub Actions 的 OIDC issuer URL 是固定的，每個 AWS 帳號只需要建一個 provider。

1resource "aws_iam_openid_connect_provider" "github" {
2  url             = "https://token.actions.githubusercontent.com"
3  client_id_list  = ["sts.amazonaws.com"]
4  thumbprint_list = ["ffffffffffffffffffffffffffffffffffffffff"]
5}

client_id_list 設為 sts.amazonaws.com 是 GitHub 官方建議的 audience 值。thumbprint_list 在 2023 年之後 AWS 不再用它驗證 GitHub 的憑證鏈（改用 AWS 自己維護的根憑證清單），但欄位仍然是必填，填 40 個 f 作為佔位值即可。

這個 provider 建一次就好。多個 role 可以共用同一個 provider，差別在各自的 trust policy 怎麼寫。

Trust Policy 設計：claim 收斂

Trust policy 決定「誰能假扮這個 role」。OIDC token 裡帶有多個 claim（描述「這是哪個 repo、哪個 branch、哪個 workflow 在跑」），trust policy 用 condition 比對這些 claim，全部命中才允許 assume。

最小可行的 trust policy

 1data "aws_iam_policy_document" "ci_trust" {
 2  statement {
 3    actions = ["sts:AssumeRoleWithWebIdentity"]
 4
 5    principals {
 6      type        = "Federated"
 7      identifiers = [aws_iam_openid_connect_provider.github.arn]
 8    }
 9
10    condition {
11      test     = "StringEquals"
12      variable = "token.actions.githubusercontent.com:aud"
13      values   = ["sts.amazonaws.com"]
14    }
15
16    condition {
17      test     = "StringLike"
18      variable = "token.actions.githubusercontent.com:sub"
19      values   = ["repo:my-org/my-app:ref:refs/heads/main"]
20    }
21  }
22}

兩個 condition 各守一個邊界。aud 驗證 audience 對不對（防止其他用途的 token 被拿來 assume）。sub 驗證請求來自哪個 repo 和 branch——這是最關鍵的收斂點。

sub claim 的結構

GitHub Actions 的 sub claim 格式是 repo:{owner}/{repo}:{context}，其中 context 隨觸發方式不同：

觸發方式	sub claim 值
push to branch	`repo:my-org/my-app:ref:refs/heads/main`
pull request	`repo:my-org/my-app:pull_request`
environment deploy	`repo:my-org/my-app:environment:production`
tag push	`repo:my-org/my-app:ref:refs/tags/v1.0.0`
manual dispatch	`repo:my-org/my-app:ref:refs/heads/main`

Trust policy 的 sub condition 要根據實際需要選擇收斂到哪個層級。只允許 main branch 的 push 就寫 repo:my-org/my-app:ref:refs/heads/main；只允許 production environment 的 deploy 就寫 repo:my-org/my-app:environment:production。

environment-based 收斂（推薦）

GitHub Actions 的 environment 功能讓 sub claim 帶上 environment 名稱。搭配 environment protection rules（required reviewers、wait timer），可以在 trust policy 層和 GitHub 層各設一道 gate：

1condition {
2  test     = "StringEquals"
3  variable = "token.actions.githubusercontent.com:sub"
4  values   = ["repo:my-org/my-app:environment:production"]
5}

Workflow 裡對應的設定：

1jobs:
2  apply:
3    environment: production
4    permissions:
5      id-token: write
6      contents: read

只有 workflow 宣告了 environment: production 且通過 environment 的 protection rules 後，runner 拿到的 token 才會帶上 environment:production 的 sub claim，才能 assume 這個 role。

Plan Role 與 Apply Role 分離

把 plan 和 apply 拆成兩個 role，各自給最小權限。plan 只需要 read 權限（讀 state、讀雲端現況），apply 需要 write 權限（建立/修改/刪除資源）。分離的好處是 PR 階段的 plan 即使被攻破，攻擊者也只能讀不能改。

 1resource "aws_iam_role" "infra_plan" {
 2  name               = "infra-plan"
 3  assume_role_policy = data.aws_iam_policy_document.plan_trust.json
 4}
 5
 6resource "aws_iam_role" "infra_apply" {
 7  name               = "infra-apply"
 8  assume_role_policy = data.aws_iam_policy_document.apply_trust.json
 9}
10
11resource "aws_iam_role_policy_attachment" "plan_readonly" {
12  role       = aws_iam_role.infra_plan.name
13  policy_arn = "arn:aws:iam::aws:policy/ReadOnlyAccess"
14}

Trust policy 的差異：plan role 允許任何 branch 的 PR 觸發（repo:my-org/my-app:pull_request）；apply role 只允許 main branch 或 production environment（repo:my-org/my-app:environment:production）。

 1jobs:
 2  plan:
 3    if: github.event_name == 'pull_request'
 4    permissions:
 5      id-token: write
 6      contents: read
 7      pull-requests: write
 8    steps:
 9      - uses: aws-actions/configure-aws-credentials@v4
10        with:
11          role-to-assume: arn:aws:iam::123456789012:role/infra-plan
12          aws-region: ap-northeast-1
13      - run: terraform plan -out=plan.tfplan
14
15  apply:
16    if: github.ref == 'refs/heads/main'
17    environment: production
18    permissions:
19      id-token: write
20      contents: read
21    steps:
22      - uses: aws-actions/configure-aws-credentials@v4
23        with:
24          role-to-assume: arn:aws:iam::123456789012:role/infra-apply
25          aws-region: ap-northeast-1
26      - run: terraform apply -auto-approve

常見設定錯誤

audience 不匹配

1Error: Not authorized to perform sts:AssumeRoleWithWebIdentity

最常見的原因是 trust policy 的 aud condition 值跟 OIDC provider 的 client_id_list 不一致。兩者都要是 sts.amazonaws.com。如果用了舊版的 configure-aws-credentials action（v1），它預設用 sigstore 作為 audience，跟 sts.amazonaws.com 對不上。確認 action 版本是 v4+。

sub condition 太寬

1condition {
2  test     = "StringLike"
3  variable = "token.actions.githubusercontent.com:sub"
4  values   = ["repo:my-org/*"]
5}

這允許 my-org 底下任何 repo 的任何 branch assume 這個 role。如果組織裡有公開 repo 或 fork 權限寬鬆的 repo，攻擊者可以在那些 repo 裡觸發 workflow 來 assume 生產環境的 role。至少收斂到 repo 層級（repo:my-org/my-app:*），生產環境收斂到 branch 或 environment。

sub condition 太緊

1condition {
2  test     = "StringEquals"
3  variable = "token.actions.githubusercontent.com:sub"
4  values   = ["repo:my-org/my-app:ref:refs/heads/main"]
5}

這只允許 push to main 觸發的 workflow。PR 觸發的 workflow 拿到的 sub 是 repo:my-org/my-app:pull_request，跟這個 condition 不匹配，plan 階段會失敗。如果 plan 需要在 PR 階段跑，plan role 的 trust policy 要加 PR 的 sub pattern。

忘記設 permissions

1jobs:
2  deploy:
3    # 缺少 permissions 區塊
4    steps:
5      - uses: aws-actions/configure-aws-credentials@v4

GitHub Actions 的 OIDC token 只有在 workflow 宣告 permissions: { id-token: write } 時才會簽發。缺了這一行，configure-aws-credentials 拿不到 token，報「OIDC token not available」。這個錯誤訊息不直觀——它說的是 token 不存在，不是權限不夠。

多帳號時忘記指定 provider

如果組織有多個 AWS 帳號，每個帳號都要各自建 OIDC provider。trust policy 的 Federated principal 要指向本帳號的 provider ARN，不能跨帳號引用。跨帳號部署時，workflow 用不同的 role-to-assume 切換帳號，每個帳號的 role 各自信任同一個 GitHub OIDC issuer 但是各自獨立的 provider 資源。

測試與驗證

設定完成後的驗證步驟：

手動觸發 workflow：push 一個無害的 commit 到 main、開一個 test PR，觀察 configure-aws-credentials 步驟是否成功
檢查 CloudTrail：搜尋 AssumeRoleWithWebIdentity 事件，確認 source identity 和 assumed role 正確
反向驗證：從一個不在 trust policy 允許範圍的 repo 或 branch 觸發 workflow，確認 assume 被拒絕
權限範圍驗證：在 plan job 裡嘗試一個 write 操作（如 aws s3 rm），確認被拒絕——驗證 plan role 的 read-only 限制確實生效

1# 在 CloudTrail 搜尋 OIDC assume 事件
2aws cloudtrail lookup-events \
3  --lookup-attributes AttributeKey=EventName,AttributeValue=AssumeRoleWithWebIdentity \
4  --max-items 5

驗證通過後，這套 OIDC 設定就取代了所有存放在 CI 環境變數裡的 access key。原有的 key 可以排程停用和刪除，排程的節奏見access key 輪替。trust policy 的持續維護重點是：新增 repo 時 sub condition 要同步更新、組織改名時 issuer 的 repo 路徑要全面修正。

時程參考：OIDC provider 建立 + trust policy 設計 + workflow 驗證約需 1-2 小時。OIDC provider 與 IAM role 本身不產生額外費用。

跨分類引用

→ 身分與憑證地基：OIDC 的概念基礎與權限邊界設計
→ infra 走 PR 流程：plan/apply 的 CI pipeline 怎麼用這裡設定好的 role
→ 跨帳號策略：多帳號環境下的 OIDC provider 配置

Jenkins → GitHub Actions：Pipeline 5 段 lifecycle 的對位 + 翻譯

Tue, 19 May 2026 00:00:00 +0000

本文是跨 vendor migration playbook、cross-link Jenkins 跟 GitHub Actions。跑 migration-playbook-methodology 6 維 audit 後對映 Schema = High（Groovy DSL ↔ YAML workflow）→ Type A phased translation。

Pipeline 5 段 lifecycle 的對位 + 翻譯

本文按 pipeline lifecycle 5 段 組織內容（variant E）— 不是「為什麼遷」driver 開頭，是 Jenkins vs GHA 對 5 段各自的處理：

Lifecycle 段	Jenkins 機制	GHA 機制
1. Source / SCM	SCM polling / webhook trigger	`on: [push, pull_request]` event
2. Build / Package	`stage('Build') { sh 'mvn package' }`	`jobs.build.steps[].run: mvn package`
3. Test / 並行 matrix	`parallel { ... }` + agents	`jobs.test.strategy.matrix: ...`
4. Security scan	Plugin（Snyk / SonarQube / Aqua）	Action（snyk/actions / sonarsource-actions）
5. Deploy / promote	Deploy plugin + approval gate	`environment: production` + reviewer approval

跑 6 維 diff dimension audit：

維度	評估	等級
Schema / API	Groovy DSL ↔ YAML、syntax 完全不同	High
Operational model	Self-hosted Jenkins → GHA SaaS / self-hosted runners	Medium
Paradigm	Imperative pipeline → declarative workflow + events	Medium
Components	Jenkins + plugins → GHA + actions marketplace	Low
Application change	Build script 多數不改、CI integration 端要改	Low
Data topology	同單一 build state	Low

Schema = High（其他 Medium-Low）→ Type A phased translation 為主、加 paradigm + operational 獨立段。

為什麼遷：cost / vendor / cloud-native 三條 driver

Cost：Jenkins self-hosted 是「免費 software + 高 ops cost」、GHA 按 minute 計費對中小團隊更便宜
Vendor consolidation：repository 已在 GitHub、整合進 GHA 省一個外部系統
Cloud-native：GHA matrix build + reusable workflow 對 cloud-native deploy（K8s / serverless）有 first-class action

Phase 0：Audit + classify

 1# Jenkins workspace 盤點
 2find . -name "Jenkinsfile" -o -name "*.groovy"
 3# 列所有 pipeline file
 4
 5# 統計 plugin 使用
 6# Jenkinsfile 內 import / @Library / sh "tool plugin..."
 7grep -rE "@Library|import|tools\s*\{" Jenkinsfile*
 8
 9# 每 pipeline 評估 complexity
10# - Simple linear pipeline: 1-3 stage、無 shared library
11# - Medium: parallel stage + 2-5 shared library
12# - Complex: 條件分支 + 動態 stage + 10+ plugin / 5+ shared library

Audit output：

列「100 個 pipeline、35 simple / 50 medium / 15 complex」
每 complexity level 估翻譯時間（simple 0.5 day / medium 2 day / complex 5-10 day）
Plugin 依賴清單對應 GHA action 替代品

Phase 1：Schema 對位（Groovy DSL ↔ YAML）

 1// Jenkins Declarative Pipeline
 2pipeline {
 3  agent { label 'docker-build' }
 4  stages {
 5    stage('Test') {
 6      parallel {
 7        stage('Unit') { steps { sh 'mvn test' } }
 8        stage('Integration') { steps { sh 'mvn verify' } }
 9      }
10    }
11  }
12  post {
13    failure { mail to: 'devops@', subject: 'Build failed' }
14  }
15}

 1# GHA Workflow 對等
 2name: CI
 3on: [push]
 4jobs:
 5  test:
 6    runs-on: [self-hosted, docker-build]
 7    strategy:
 8      matrix:
 9        suite: [unit, integration]
10    steps:
11      - uses: actions/checkout@v4
12      - name: Run ${{ matrix.suite }}
13        run: |
14          case "${{ matrix.suite }}" in
15            unit) mvn test ;;
16            integration) mvn verify ;;
17          esac
18  notify-failure:
19    needs: test
20    if: failure()
21    runs-on: ubuntu-latest
22    steps:
23      - uses: dawidd6/action-send-mail@v3
24        with:
25          to: devops@
26          subject: Build failed

對位差異：

parallel { ... } → strategy.matrix（粒度不同、matrix 是「同 step 不同參數」、parallel 是「不同 step」）
post.failure → 獨立 job + if: failure()
@Library shared library → reusable workflow（uses: ./.github/workflows/reusable.yml）
Jenkins tools { jdk 'java17' } → setup-java action（手動配 toolchain）

Phase 2：Translation pipeline（3-tier hybrid）

對應 Splunk → Elastic translation 同 3-tier：

Tier 1：community tool（jenkins-to-actions converter、cover 簡單 pipeline 30-50%）
Tier 2：LLM-assisted（Claude / GPT 翻 medium complexity、人工 verify）
Tier 3：manual（shared library 改 reusable workflow / conditional 動態 stage 重寫）

Phase 3：Parallel run（雙 CI 跑 4-8 週）

1Repository ──┬─→ Jenkins webhook ──→ Jenkinsfile pipeline
2             └─→ GitHub Action ────→ .github/workflows/ci.yml
3
4Compare:
5- 同 commit 兩端結果一致
6- Latency / cost / artifact location 對齊

Diff dashboard 列「test pass rate / build time / failure mode」三 metric、跑到 95%+ 一致才進 cutover。

Phase 4：Cutover + cleanup

Disable Jenkins webhook
GHA 成 primary CI
Jenkins 留 standby 2 週 fallback
Decommission Jenkins controller + agents

Production 故障演練

Case 1：Shared library equivalence、reusable workflow 表達不足

徵兆：複雜 Jenkins shared library（含 Groovy class / closure / 動態變數）翻成 reusable workflow 後失準、某些動態邏輯無法表達。

根因：Jenkins Groovy 是 imperative + 完整 programming language；GHA reusable workflow 是 declarative YAML、limited expressiveness。

修法：

複雜邏輯外包到 script：reusable workflow 只當 orchestrator、複雜邏輯放 .github/scripts/*.sh 或 actions/javascript-action
自定 composite action：multi-step logic 包進 composite action、reuse 程度比 reusable workflow 高
退役過度設計的 shared library：trans 過程暴露 90% library code 其實只用 10%

Case 2：Ephemeral workspace、build cache 失敗

徵兆：cutover 後 build time 從 5 分鐘漲到 20 分鐘；Maven / Gradle / node_modules / Docker layer 每次都重抓。

根因：Jenkins agent workspace persistent、build cache 跨 build 保留；GHA ephemeral runner 每次新 VM、cache 預設沒帶。

修法：

actions/cache@v4：cache key 用 hashFiles('**/pom.xml') 等 lock file、cross-build 復用
Self-hosted runner with cache：critical pipeline 跑 self-hosted runner、persistent volume
Docker layer cache：用 docker/build-push-action 配 BuildKit cache、不 rebuild full image

Case 3：Plugin 不對等、CI feature 退化

徵兆：Jenkins 用 50+ plugin、GHA action marketplace 找不到對應；team 對 SonarQube quality gate / Jira integration / custom report 等失去 first-class 支援。

根因：Jenkins plugin ecosystem 20+ 年累積、GHA marketplace 5 年；某些 niche plugin 在 GHA 沒對等 action。

修法：

API-based integration：用 curl 對 vendor API 直接 call、不依賴 plugin / action
自寫 action：critical feature 自寫 composite / JavaScript action、publish 到 marketplace
退役舊 plugin：trans 期間 audit plugin 真實使用、80% 可退役

Case 4：Self-hosted runner setup + scaling

徵兆：production workload 需要 GPU / large memory runner；GHA hosted runner spec 不夠、想用 self-hosted runner、發現 scaling / security / monitoring 比 Jenkins agent 複雜。

根因：GHA self-hosted runner 是 ephemeral、scaling 需要 runner controller（actions-runner-controller on K8s）；跟 Jenkins agent / Kubernetes plugin 對應但 setup 不同。

修法：

actions-runner-controller (ARC)：K8s-native runner scaling、跟 Jenkins K8s plugin 對應
Runner labels：用 label 路由 job（runs-on: [self-hosted, gpu, linux]）
Security：ephemeral runner 用 short-lived token、不跨 job persist secret

Case 5：Matrix build vs parallel stage 表達差

徵兆：Jenkins 有 動態 parallel（runtime 決定要跑哪些 stage、按 input 變動）；GHA matrix 是 static at workflow load time、表達不到。

根因：GHA matrix 是 declarative、workflow parse 時 expand；runtime 動態決定 stage 需要用 if: condition + 多 job。

修法：

動態 matrix：用 jobs.set-matrix 先跑一個 job 算 matrix、輸出 JSON、後續 job strategy.matrix: ${{ needs.set-matrix.outputs.matrix }}
conditional job：每個 dynamic stage 寫獨立 job + if: 控制觸發
重設計：90% 動態邏輯其實可改 static matrix + condition、純 runtime 動態通常是 over-engineering

Capacity / cost

維度	Self-managed Jenkins	GitHub Actions
Compute cost	EC2 + agent licenses	per-minute billing（free tier + over-cap）
Operational FTE	0.5-1.5 FTE	0.1-0.3 FTE
Plugin / action ecosystem	20+ 年成熟	5 年快速成長
Cold start	Agent ready < 1 min	Hosted runner 30-60s spin-up
Self-hosted scaling	Jenkins K8s plugin	ARC（actions-runner-controller）
Security	Self-managed VPC + secret	OIDC + repository secret + environment
Migration cost	-	1-3 FTE × 1-3 個月

判讀：100+ pipeline organization 切 GHA 通常 6-12 月 ROI 持平、之後省 ops cost；< 30 pipeline 早就該切。

整合 / 下一步

跟 GitLab CI 對位

GitLab CI YAML 語法跟 GHA 接近、shared library 對應 include:、self-hosted runner 對等；Jenkins → GitLab CI migration 流程跟本文鏡像對稱、3-tier translation pipeline 通用。

跟 Circle CI 對位

CircleCI orb 對等 GHA composite action；跨 SaaS CI 切換比 Jenkins → GHA 簡單（都 YAML-based）。

反向 migration（GHA → Jenkins）

少數 enterprise（金融 / 政府）合規要求 self-hosted CI / on-prem；GHA → Jenkins 鏡像對稱、注意 Jenkins shared library 表達力更強、reusable workflow 內 dynamic 邏輯可不必拆。

下一步議題

Reusable workflow + composite action 混用：reusable workflow 適合 跨 repo orchestration、composite action 適合 單 repo logic encapsulation
OIDC + cloud deploy：用 OIDC token 取代 long-lived cloud credential、是 GHA migration 順便升級的機會
Cost optimization：minute-based billing 對 high-volume CI 需要 monitoring + budget alert

本 blog 專案部署

Wed, 06 May 2026 00:00:00 +0000

本 blog 專案部署是前端靜態站部署的一個具體案例。這個資料夾只記錄本專案實際使用的 Hugo、Pagefind、Playwright、GitHub Pages 與 Claude workflow，不把這些細節當成所有 CI/CD 場域的通用規則。

專案定位

本專案的部署產物是靜態網站。Hugo 負責產生 HTML，Pagefind 負責產生搜尋索引，GitHub Pages 負責 hosting，Playwright 負責驗證搜尋與版面行為。

文件	責任
GitHub Actions workflow	記錄本專案 `.github/workflows/` 的實際設定

與通用 CI/CD 的關係

本資料夾是實例層。通用 gate 原理、不同部署場域差異與失敗處理流程放在上層文章；本資料夾只回答「這個 blog 專案現在怎麼部署、失敗時要看哪裡」。術語定義統一回連 CI 知識卡片。

下一步路由

本專案 workflow：讀 GitHub Actions workflow。
前端部署通用注意事項：讀前端部署 CI/CD。
CI gate 原理：讀 CI gate 與 workflow 邊界。
Markdown CI 規則：讀 Blog Markdown 寫作規範與 mdtools 檢查。

CI step silent hang：時間真空才是訊號、happy log 反而是 anti-signal

Thu, 28 May 2026 00:00:00 +0000

核心議題：CI step 看起來「跑了很久才 timeout」時，要分辨「真的時間不夠」跟「silent hang 占滿時間」 — 兩者修法完全不同。Silent hang 的訊號是「最後一行 happy log 到 cancel 之間有大段時間真空」、不是「最後一行錯誤訊息」。第一次歸因錯誤後、第二次 fail 不該再加 timeout、該停下來重看 detailed log。 案例骨幹：本 blog 的 Playwright CI 一直 timeout、初診「cache 缺失 + timeout 太緊」加了 cache + bump timeout、仍 timeout。重看 detailed log 發現 chromium 下載 2 秒完成、之後 24 分 31 秒完全沒任何 log 才被 cancel — Playwright 1.59 在 Node.js 24.16.0 的 extract-zip regression（microsoft/playwright#41000、上游 nodejs/node#63487）。升 Playwright 1.60.0 後該 step 從 25 分鐘卡死降到 22 秒。

1. Silent hang 是 happy log 的 anti-signal

CI step timeout 時、第一個本能是看「step 跑了多久」。15 分鐘 timeout 然後被砍、直覺判斷是「時間不夠、bump timeout」。這個直覺對應的失敗模式是「step 真的需要 16 分鐘才能跑完」。

但有另一種失敗模式長得很像、修法完全不同：silent hang — step 在某個點之後就不再輸出任何 log、process 仍在執行（沒有 crash）、直到外部 timeout 才被砍。表面看跟「時間不夠」一樣（step 跑很久才被 cancel）、但根因是 process 本身卡死、給多少時間都跑不完。

辨識 silent hang 的關鍵訊號是「最後一行 happy log 到 cancel 訊息之間有大段時間真空」。「Happy log」指的是看起來成功的訊息（例：下載 100% 完成、build succeeded、X tests passed）— 這類訊息特別會誤導判斷、因為它讓人以為任務在進展。Silent hang 開始之前的最後一行通常正是這種 happy log、是正常結束訊號的反面。

三類 timeout 模式的對照

訊號	可能根因	修法
整個 step 進度持續、最後階段加速到 timeout	時間真的不夠	bump timeout
有失敗訊息（exception / non-zero exit）之後 timeout	code 邏輯錯	看訊息修
最後一行 log 之後有大段時間真空、然後 cancel	Silent hang、可能 upstream bug	查 upstream issue tracker、不是加 timeout

第三種最容易誤判、因為「log 之間沒輸出」沒被當成訊號 — 但訊息真空本身就是訊號。寫 debug log 的人會記得補 error 訊息、但 silent hang 通常發生在工具內部的某個沒輸出 log 的等待點、所以沒有 error 訊息可看。

2. 為什麼「cache 缺失 + bump timeout」的初診是 false positive

第一次看 CI fail log 時、有三件容易抓到的事：

workflow YAML 裡的 timeout-minutes: 15
step 跑了 15m 6s（幾乎等於 timeout 上限）
step 名稱是 Install Playwright browsers（要下載 170 MiB）

直覺合成的結論：「cache 缺失 + timeout 太緊」。這結論看起來「應該對」 — 因為這兩個都是「Install Playwright browsers」眾所周知的優化點。修法：加 actions/cache + bump timeout 25 min。

修完仍 timeout、但這次跑 25m 6s（一樣頂到上限）。

這時的訊號應該是「同樣的 step 在 1.67 倍的 timeout 下仍頂到上限」 — 如果是時間不夠、bump 之後該往中間靠（譬如完成在 18-20 min）；如果一直頂到上限、意思是 step 不會自己結束、是 hang。

但初診時很容易略過這個訊號、轉而繼續想「是不是 cache step 設定有問題？」。這個歸因方向是錯的、因為前置假設「cache 是瓶頸」本身就沒驗證過。

一輪 false positive 的 anatomy

步驟	容易做的	該做的
看到 timeout	假設「時間不夠」	先區分「時間不夠」vs「silent hang」
看 high-level log	假設「下載慢」	應該看下載前後 timestamp 比對
提解法	加 cache + bump timeout	應該先確認瓶頸真的在下載
解法仍 fail	假設「cache 沒 hit」	應該意識到「同個 step 又頂到上限」是 hang 訊號

每一步單看都合理、合起來就是把 false positive 越雕越精緻。這個 anatomy 對任何「初診沒驗證就改」的場景都適用、不限 CI。

3. WRAP 的 R 在第二次 fail 時是 stop 訊號

WRAP 決策框架的 R（Reality Test）原則是「需要什麼事證才能證明這個方法可行？」。它不只是決策前的檢查、更是連續失敗後的 stop 訊號。

第二次 fail 時、繼續同方向加 timeout 是自動駕駛模式。WRAP 在這個位置該提醒的事：

「兩次同類修法都沒解、是不是前置假設錯了？」
「我有沒有資料去判斷真正卡哪？」（資料充足度閘門）
「同類問題的 base rate 是什麼？」（基本率思考）

Stop 訊號的觸發條件是「同方向修法連續 fail 2 次」、不是「fail 3 次」。第二次就該回到資料層；第三次已經是浪費 cycle 而且強化錯誤假設。

實際上第二次 fail 後做的對的事是停下來、grep detailed log 的 timestamp 序列、發現「下載完成」跟「cancel」之間有 24 分鐘空白 — 這時才確認是 silent hang。如果第二次沒做這個轉折、第三次大概率是「換更大的 timeout」或「換不同的 cache key」、仍 fail。

4. Detailed log 的關鍵讀法：找「沒輸出的時間段」

CI 平台的 step log 通常很長、人眼掃容易跳過。看 silent hang 嫌疑時、讀法不是順序讀、是抓四個 timestamp：

Step 開始的 timestamp（log header 通常有）
Step 結束（cancel / fail）的 timestamp
最後一行有意義輸出的 timestamp
計算 #3 到 #2 之間的時間真空

真空夠大（> 1 分鐘）+ #3 是 happy log = silent hang 嫌疑高。

GitHub Actions 用 gh CLI 的具體做法：

1# 取某個 step 的所有 log（filter step 名稱）
2gh run view  --log --job  | rg "Install Playwright browsers"
3
4# 抓最後幾行看真空尾巴
5gh run view  --log --job  | rg "Install Playwright browsers" | tail -3

本案例的最後 3 行（簡化過）：

12026-05-27T09:59:44.110Z  | 100% of 170.4 MiB
22026-05-27T10:24:15.201Z  ##[error]The operation was canceled.

24 分 31 秒真空、最後一行 happy log 是「下載 100% 完成」 — silent hang 確認。

這個讀法的核心是「時間真空優先於訊息內容」。技術人員習慣讀訊息內容找 error keyword、但 silent hang 沒有 error keyword 可找、只有時間真空。轉個訊號類型才看得到。

5. Upstream issue 搜尋的優先序

Silent hang 確認後、下一步通常不是繼續 reason 根因、是去查 upstream issue tracker。Silent hang 多半是工具 / 依賴的 bug、而非自己 config 錯 — 因為 config 錯通常有 error message、不會 silent。

查詢策略：

1gh api 'search/issues?q=repo:/++is:issue&per_page=10&sort=updated'

關鍵是 keyword 選擇用「症狀詞」而不是「猜測詞」。症狀詞描述讀者實際觀察到的現象（hangs after download、stuck during extract），猜測詞描述讀者推測的根因（slow、timeout、network issue）。猜測詞會找到大量無關 issue；症狀詞通常直接命中。

本案例查詢 playwright install hangs chromium 第二筆結果就是 issue #41000、標題完全匹配「playwright install chromium hangs after download completes on Node.js 24.16.0 (extract-zip)」。Issue 詳情指向上游 nodejs/node#63487、給出兩個 workaround（升 Playwright 1.60.0 或 pin Node 24.15.0）。從查詢到確認根因、全程不到 5 分鐘。

為什麼 issue tracker 該優先於 self-reasoning

技術人員的 instinct 是「自己想出根因」。但 CI silent hang 這類問題、根因通常在工具版本、runtime 版本、OS、container image 的微妙交互、不在自己的 codebase。Reasoning 找不到的東西、社群 issue tracker 經常已經有人回報過。

「先 reason 再查」跟「先查再 reason」的取捨：

問題範圍	哪個優先	為什麼
自己 codebase 內的邏輯 bug	reason	自己最熟、reasoning 通常較快
Upstream tool / runtime / OS / container 範圍	查 issue	自己沒上游知識、reasoning 容易卡在錯誤前置假設
兩者交界（自己 config 觸發 upstream bug）	並行	先查找 known issue、同時 reason 自己 config

Silent hang 預設屬於第二類、應該優先查 issue tracker。

6. 整合：訊號 → 行動 mapping

把本案例的經驗整理成可重用的訊號表：

訊號	行動
Step timeout 且最後一行是 happy log	計算 timestamp 真空、確認是否 silent hang
同方向修法 2 次都 fail	停止、回到資料層、不再加 timeout / retry
Silent hang 確認	用症狀詞查 upstream issue tracker
Issue 命中且有 workaround	套 workaround、不要先 reason
Issue 沒命中	才回到 self-debug、加 verbose log（`DEBUG=` env）

這張表的順序很重要：每一步的「該做的事」是下一步的「前置條件」。略過任一步、後面的判斷會建立在錯誤假設上。

適用範圍

「Silent log 是 happy log 的 anti-signal」這個原則對所有非互動 process（CI、cron job、background worker、container init）都適用：

Docker build 卡住（特別是 RUN apt-get / npm install / pip install）— 同類 silent hang 模式
CI cache restore 卡住 — 大量小檔案的 cache 操作可能 silent hang
Database migration 卡住 — schema 變更 + 長 transaction 可能 silent hang
任何 process 跑時間接近 timeout 上限被 cancel — 先檢查是否 silent hang 才提解法

「WRAP R 在第二次 fail 時是 stop 訊號」這條原則不限 CI、適用所有「同方向修法重複 fail」的場景：debug、設定調校、效能優化。

參考資料

microsoft/playwright issue #41000 — 本案例的 upstream issue（Playwright 1.57-1.59 在 Node 24.16.0 extract-zip hang）
nodejs/node issue #63487 — Node 24.16 extract-zip / yauzl regression 上游
同 blog 文章：WRAP 決策框架的 R 階段操作 — Reality Test 詳細用法

用 Claude Code GitHub Actions 自動除錯 CI 建置失敗

Wed, 04 Mar 2026 00:00:00 +0000

這是什麼

Claude Code GitHub Actions 讓 Claude 直接參與你的 GitHub 工作流程，主要功能：

互動式助手 — 在 PR/Issue 留言 @claude，Claude 會分析程式碼並回覆
自動 Code Review — PR 開啟時自動審查變更
CI 除錯修復 — build 失敗時自動分析錯誤並修復

完整功能說明參考官方文件。

設定方式

`/install-github-app`（推薦）

在 Claude Code 終端執行 /install-github-app，它會引導你完成所有設定。

流程中的關鍵步驟：

選擇 repo — 指定要安裝的 GitHub repository
安裝 Claude GitHub App — 自動安裝到指定 repo，授予 Contents、Issues、Pull requests 的 Read & Write 權限
選擇認證方式 — 選擇 long-life token 會產生 OAuth token，自動寫入 GitHub Secrets 為 CLAUDE_CODE_OAUTH_TOKEN
建立 workflow 檔案 — 自動建立並 push 兩個 workflow：
- claude.yml — @claude 互動回覆
- claude-code-review.yml — PR 自動 code review

完成後不需要額外設定。

手動設定（使用 Anthropic API Key）

如果不想用 /install-github-app，可以手動操作：

前往 github.com/apps/claude 安裝 App 到你的 repo
到 repo 的 Settings → Secrets and variables → Actions，新增 ANTHROPIC_API_KEY
手動建立 workflow 檔案到 .github/workflows/

兩種認證方式的差異：

認證方式	Secret 名稱	適用對象
OAuth Token	`CLAUDE_CODE_OAUTH_TOKEN`	Pro/Max 用戶，`/install-github-app` 自動設定
API Key	`ANTHROPIC_API_KEY`	直接使用 Anthropic API，需手動到 console.anthropic.com 取得

加入 CI 自動除錯

/install-github-app 建立的 workflow 只處理 @claude 互動和 code review。如果你想在 build 失敗時自動觸發 Claude 修復，需要修改既有的 deploy workflow。

首先，補上 Claude 需要的權限（原本可能只有 contents: read）：

1permissions:
2  contents: write        # Claude 需要寫入修復後的檔案
3  pull-requests: write   # Claude 可能需要建立 PR
4  issues: write          # Claude 回報結果
5  pages: write           # 原本的 deploy 權限
6  id-token: write        # 原本的 deploy 權限

然後在 build 步驟加入 Claude 除錯邏輯：

 1# 在原本的 build step 加上 continue-on-error 和 id
 2- name: Build
 3  id: hugo-build
 4  run: hugo --minify 2>&1 | tee hugo-build-output.txt
 5  continue-on-error: true
 6
 7# Build 失敗時觸發 Claude 除錯
 8- name: Claude Debug on Build Failure
 9  if: steps.hugo-build.outcome == 'failure'
10  uses: anthropics/claude-code-action@v1
11  with:
12    # 依你的認證方式擇一
13    claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
14    # anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
15    prompt: |
16      Hugo build failed. Here is the error output:
17
18      $(cat hugo-build-output.txt)
19
20      Please analyze the error, find the problematic file(s),
21      fix the YAML front matter or content issue, and commit the fix.
22    claude_args: "--max-turns 10"
23
24# 修復後重新 build 驗證
25- name: Retry build after fix
26  if: steps.hugo-build.outcome == 'failure'
27  run: hugo --minify

核心設計：

continue-on-error: true — build 失敗不中斷流程，讓後續 Claude 步驟有機會執行
if: steps.hugo-build.outcome == 'failure' — 只在失敗時觸發，正常 build 不消耗 API 額度
修復後重新 hugo --minify 驗證是否成功

計費方式

計費取決於你使用哪種認證方式：

認證方式	計費來源	說明
OAuth Token	訂閱額度（Pro/Max）	跟 claude.ai 網頁、Claude Code CLI、Claude Desktop 共用同一個額度池
API Key	獨立 API 計費	按 token 用量付費，與訂閱額度完全分開

OAuth token 的額度是共用的，GitHub Actions 跑多了會擠壓你日常在 claude.ai 和 CLI 的使用額度。如果 CI 觸發頻繁，建議改用 API Key 避免互相影響。

詳細的費率可參考 Claude 定價頁面。

降低成本的設定

設定	說明
`--max-turns 10`	限制迭代次數，避免無限循環
只在 `failure` 時觸發	正常 build 不消耗 API 額度
`@claude` 觸發詞	互動模式只在明確呼叫時才啟動

搭配 CLAUDE.md

在 repo 根目錄建立 CLAUDE.md，Claude 會自動讀取作為上下文，提升修復準確度。

Github-Actions on Tarragon

CI/CD 失敗到修復發布流程

失敗後先看什麼

本機重現流程

修復與重新觸發

發布 gate 路由

常見處理情境

反模式與替代做法

最小可用流程

下一步路由

本 blog 專案的 GitHub Actions workflow

Workflow 總覽

md-check

Playwright tests

Deploy Hugo site to Pages

Claude Code

Claude Code Review

本專案的發布阻擋邊界

失敗時的維護路由

本專案維護注意事項

下一步路由

Terraform CI Pipeline 設定指南

Pipeline 的兩個階段

Plan 階段的完整 workflow

各步驟的職責

Apply 階段

environment protection rule

Apply 階段重跑 plan 的理由

多環境的 pipeline 設計

安全邊界

跨分類引用

OIDC Trust Policy 設定指南

建立 OIDC Provider

Trust Policy 設計：claim 收斂

最小可行的 trust policy

sub claim 的結構

environment-based 收斂（推薦）

Plan Role 與 Apply Role 分離

常見設定錯誤

audience 不匹配

sub condition 太寬

sub condition 太緊

忘記設 permissions

多帳號時忘記指定 provider

測試與驗證

跨分類引用

Jenkins → GitHub Actions：Pipeline 5 段 lifecycle 的對位 + 翻譯

Pipeline 5 段 lifecycle 的對位 + 翻譯

為什麼遷：cost / vendor / cloud-native 三條 driver

Phase 0：Audit + classify

Phase 1：Schema 對位（Groovy DSL ↔ YAML）

Phase 2：Translation pipeline（3-tier hybrid）

Phase 3：Parallel run（雙 CI 跑 4-8 週）

Phase 4：Cutover + cleanup

Production 故障演練

Case 1：Shared library equivalence、reusable workflow 表達不足

Case 2：Ephemeral workspace、build cache 失敗

Case 3：Plugin 不對等、CI feature 退化

Case 4：Self-hosted runner setup + scaling

Case 5：Matrix build vs parallel stage 表達差

Capacity / cost

整合 / 下一步

跟 GitLab CI 對位

跟 Circle CI 對位

反向 migration（GHA → Jenkins）

下一步議題

相關連結

本 blog 專案部署

專案定位

與通用 CI/CD 的關係

下一步路由

CI step silent hang：時間真空才是訊號、happy log 反而是 anti-signal

1. Silent hang 是 happy log 的 anti-signal

三類 timeout 模式的對照

2. 為什麼「cache 缺失 + bump timeout」的初診是 false positive

一輪 false positive 的 anatomy

3. WRAP 的 R 在第二次 fail 時是 stop 訊號

4. Detailed log 的關鍵讀法：找「沒輸出的時間段」

5. Upstream issue 搜尋的優先序

為什麼 issue tracker 該優先於 self-reasoning

`md-check`

`Playwright tests`

`Deploy Hugo site to Pages`

`Claude Code`

`Claude Code Review`

`/install-github-app`（推薦）