Kubernetes on Tarragon

Kubernetes Graceful Shutdown：termination 序列跟你以為的不一樣

Mon, 18 May 2026 00:00:00 +0000

本文是 Kubernetes overview 的 implementation-layer deep article。Overview 已說明 K8s 在 deployment platform 譜系的定位、本文聚焦 pod termination 這個 production 最常踩、被誤解最深的議題：序列、配置、五個 case、跟 service mesh 整合。

Graceful shutdown 沒做對、500 期間每次 deploy 都吃 502

最常見的觸發場景：deploy 新 image、prometheus alert 在 5 分鐘內收到一波 502 / 503、SRE 翻 application log 看到「正在處理 request」「connection closed」交替出現。Application 本身沒 bug、但 K8s 在 pod terminate 時跟 traffic 來源 沒對齊步調、舊 pod 還在處理請求時就被 SIGKILL、新 request 還在打到準備關閉的 pod 上。

很多團隊修法是 把 terminationGracePeriodSeconds 從 30 拉到 120、暫時掩蓋問題；但症狀會在下次 rolling update / HPA scale-down / node drain 時換個形式回來。根因在 termination 序列 — pod 不是收到 SIGTERM 就 graceful、序列裡每一步出錯都有不同 fail mode。

Termination 序列：五步、每步都能爆

K8s 收到 delete pod 請求後、發生的事 按時間 是：

時序	事件	動作來源
t=0	API server 標 pod 為 Terminating	kubelet 收到 delete
t=0	Pod 從 Service Endpoints 移除（async）	endpoint controller
t=0	kubelet 跑 preStop hook（若有定義）	container runtime
t=preStop 結束	container 收到 SIGTERM	container runtime
t=SIGTERM + terminationGracePeriodSeconds	container 收到 SIGKILL	container runtime

關鍵誤解：

「pod 從 Service 移除」跟「container 收到 SIGTERM」是平行、不是序列。Endpoint controller 更新 Endpoints object → kube-proxy 重新寫 iptables → 各 node 的 traffic 才真正停 — 這條鏈通常需要 1-5 秒；同時間 SIGTERM 已經發給 application。
preStop hook 是「container 還在跑、SIGTERM 還沒發」期間執行。pre-Stop 設 sleep 10 是 production 標準作法 — 用 sleep 讓 endpoint controller 有時間把 pod 從 Service 移除、避免 SIGTERM 期間還有新 request 進來。
terminationGracePeriodSeconds 是 從 preStop 開始 計時、不是從 SIGTERM。preStop sleep 10s + application 30s graceful = 至少要設 40s。
graceful 不是 framework 自動的。Application 必須 主動處理 SIGTERM：拒絕新 request、等 in-flight 完成、close DB connection、flush log。沒處理 SIGTERM、container 會在 grace period 後被強殺。
readiness probe 在 Terminating 期間 仍會被執行、但結果不影響 traffic（已經從 Endpoints 移除）。但若 application 沒主動讓 readiness fail、service mesh / external LB 可能仍在送 request（依不同 mesh 行為）。

配置全圖

Deployment spec

 1apiVersion: apps/v1
 2kind: Deployment
 3spec:
 4  template:
 5    spec:
 6      terminationGracePeriodSeconds: 60          # SIGTERM 後 60s 才 SIGKILL
 7      containers:
 8        - name: app
 9          lifecycle:
10            preStop:
11              exec:
12                command: ["/bin/sh", "-c", "sleep 10"]
13          readinessProbe:
14            httpGet:
15              path: /healthz/ready
16              port: 8080
17            periodSeconds: 5
18            failureThreshold: 2

時序：t=0 preStop 開始 sleep 10s → t=10s container SIGTERM → t=70s SIGKILL（不是 t=60s、是 60s after SIGTERM）。

Application 處理 SIGTERM（Go 範例）

 1sigs := make(chan os.Signal, 1)
 2signal.Notify(sigs, syscall.SIGTERM)
 3
 4server := &http.Server{Addr: ":8080"}
 5go server.ListenAndServe()
 6
 7<-sigs                                              // 等 SIGTERM
 8log.Println("SIGTERM received, draining...")
 9
10// 1. readiness fail（讓 mesh-aware 流量停）
11ready.Store(false)
12
13// 2. wait 5s 讓 readiness probe failureThreshold 觸發
14time.Sleep(5 * time.Second)
15
16// 3. graceful shutdown server（拒新請求、等 in-flight）
17ctx, cancel := context.WithTimeout(context.Background(), 45*time.Second)
18defer cancel()
19server.Shutdown(ctx)
20
21// 4. close DB / cache / message consumer
22db.Close()
23consumer.Stop()
24
25// 5. flush log + exit
26logger.Sync()

關鍵：server.Shutdown(ctx) 是 拒新請求、等 in-flight、ctx timeout 設 grace period 減去 preStop sleep 跟 readiness fail 等待時間（60s - 10s - 5s = 45s）。

Production 故障演練

Case 1：Rolling update 期間 502 / 503

徵兆：每次 deploy 後 5 分鐘內 LB / ingress log 一波 502 / 503、application log 顯示「context canceled」「connection closed by peer」、新 pod 已 ready 但舊 pod 在 grace period 內仍收 request。

根因：沒設 preStop sleep、container 收到 SIGTERM 後立刻 server.Shutdown()、但 kube-proxy 還沒把舊 pod 從 iptables 移除、新 request 持續送到舊 pod、舊 pod 已拒收。

修法：preStop sleep 10、讓 endpoint propagation 完成再進入 SIGTERM 流程。

Case 2：Connection drain race，long-running request 被中斷

徵兆：deploy 後 application log 有大量 context canceled 對應到 long-running endpoint（例：報表生成、檔案上傳）、user 端看到 transaction 失敗、但短 request 沒事。

根因：long-running endpoint 處理時間 > terminationGracePeriodSeconds、server.Shutdown(ctx) ctx timeout 設太短、in-flight 強制中斷。

修法：

把 long-running endpoint 改 async（背景 job + status endpoint）、HTTP request 立刻 return job ID
短期：terminationGracePeriodSeconds 拉到 long-running 99 percentile + buffer
application 側 ctx timeout = grace period - preStop - readiness fail wait

Case 3：Init container 在 grace period 期間重啟、SIGTERM 沒到 main

徵兆：pod 顯示 Terminating 但 phase 一直在 Running、main container restart count + 1、application log 沒看到「SIGTERM received」。

根因：init container 用 restartPolicy: Always（K8s 1.28+ sidecar 模式）、或 main container 在 SIGTERM 前先 crash 觸發 restart、kubelet 在 restart 後 不重發 SIGTERM、main container 跑到 grace period 結束直接 SIGKILL。

修法：

Sidecar container（restartPolicy: Always）的 preStop 也要設 sleep、跟 main 同 lifecycle
main container readinessProbe 失敗時 別自動 restart（restartPolicy: OnFailure + crashLoopBackOff 觀察）
觀察 kubectl describe pod 的 events、SIGTERM 沒發出來會有 Killing container event 缺失

Case 4：StatefulSet 串行終止、總時間 = pod 數 × grace period

徵兆：StatefulSet rolling update / scale-down 比 Deployment 慢 N 倍（N = replica 數）、deploy 一個 5 replica 的 statefulset 要 5 分鐘以上。

根因：StatefulSet 預設 podManagementPolicy: OrderedReady — pod 串行終止 + 串行創建、每個 pod 至少要 grace period 完成才動下一個。Deployment 用 RollingUpdate 預設 maxUnavailable=25% 平行終止。

修法：

StatefulSet 改 podManagementPolicy: Parallel（若 application 不要求嚴格順序）
嚴格順序情境（Cassandra / Kafka / etcd）保留 OrderedReady、但 grace period 設 單 pod 必要時間、不要設 總時間能承受
接受序列化代價、把 deploy 排在低流量時段

Case 5：Job / CronJob 不 graceful、SIGTERM 直接 SIGKILL

徵兆：CronJob 在 Job timeout / pod eviction 時不 graceful、寫一半的 file 留在 PVC、下次跑時 corrupt；application log 沒「SIGTERM received」、直接斷。

根因：Job 的 activeDeadlineSeconds 到期 / node eviction 觸發時、K8s 對 Job pod 仍會發 SIGTERM、但 很多 batch framework（Spring Batch / Argo Workflow worker）沒處理 SIGTERM、application 沒主動 checkpoint。

修法：

Batch application 處理 SIGTERM、checkpoint 進度寫 storage、下次跑時 resume
不適合 checkpoint 的 batch、保證 idempotent re-run、SIGKILL 後重跑不會 corrupt
Job spec 加 terminationGracePeriodSeconds（預設 30、batch 通常要 60-300）

規模影響

Graceful shutdown 的成本主要在 deploy 時間 跟 capacity buffer：

規模因素	影響
terminationGracePeriod 60s	單 pod deploy ~70-80s（含 preStop + grace + new pod startup）
Deployment 100 replica + maxSurge 25%	全 deploy ~5-10 分鐘、需要 25% extra capacity（25 replica buffer）
StatefulSet 串行 + 60s grace	10 replica 約 10-12 分鐘、deploy window 要在低流量時段
HPA scale-down 跟 graceful 一起跑	scale-down 觸發 → preStop + grace + new metric → 下次 scale 判斷、avg 反應週期 ≈ 3-5 分鐘

實務 default：

Web service：terminationGracePeriodSeconds: 60、preStop sleep 10、application graceful 45s
Backend worker（消費 queue）：terminationGracePeriodSeconds: 120、preStop 不 sleep（用 readiness 控）、application 處理當前 message + commit offset
Batch job：terminationGracePeriodSeconds: 300、checkpoint pattern
StatefulSet（DB / queue）：grace period 對齊 vendor 建議（Kafka 90s、PostgreSQL 60s）

跟其他元件整合

Service mesh（Istio / Linkerd）

Service mesh sidecar（envoy / linkerd-proxy）也有自己的 termination — 通常比 main container 晚一點關。配置原則：

mesh sidecar 設 terminationGracePeriodSeconds 比 main 多 5-10s、main 處理完才換 sidecar
Istio 1.12+ 的 proxy.istio.io/config.holdApplicationUntilProxyStarts 控啟動順序、shutdown 也要對應
mTLS 環境 graceful 多一道：在 SIGTERM 後等 mesh 主動 close cert rotation、不要硬斷

Readiness probe 跟 mesh-aware traffic

純 K8s Service（kube-proxy iptables）：endpoint 移除後 已建立 connection 仍會跑完、新 connection 不來。Mesh-aware traffic（service mesh / external LB with health check）：要 readiness fail 才會停送。

修法：application graceful 第一步是 ready.Store(false) + 等 readiness probe 至少 fail 一次（5-10s）、才開始 server.Shutdown。

跟 Pod Disruption Budget（PDB）的衝突

Node drain 時 PDB 限制可同時 unavailable 的 pod 數、graceful shutdown 拖長會讓 drain 卡住。對策：

緊急 drain（node 硬體故障）：kubectl drain --grace-period=30 --force、接受短時間 502
正常 drain（升級 / 維運）：PDB 設 minAvailable: 、容許單 pod 慢慢 graceful
不要設 maxUnavailable: 0、會讓 drain 卡死

下一步

Application graceful 寫法：12-factor app disposability 章節給 framework-agnostic 模板、各語言 SDK 寫法見對應 framework
Queue consumer 的 graceful：訊息 ack / offset commit 必須在 SIGTERM 內完成、否則 duplicate message — 對應 03 message queue 模組的 consumer-design 段
跨 region / 多 cluster 的 graceful：multi-cluster service mesh（Istio multicluster / Linkerd multicluster）的 traffic shift 期間 graceful 行為跟單 cluster 不同、需要對齊 mesh 配置

Docker Swarm → Kubernetes：5 個 Swarm production cluster 撞牆數據

Tue, 19 May 2026 00:00:00 +0000

本文是跨 vendor migration playbook、cross-link Docker Swarm 跟 Kubernetes。跑 migration-playbook-methodology 6 維 audit 後對映 Paradigm = High（Swarm 簡單 container orchestration → K8s declarative resource model）→ Type E paradigm shift。

5 個 Swarm production cluster 撞牆數據

從 2020-2024 觀察 5 個中型 organization 的 Swarm production cluster lifecycle、典型撞牆點：

Cluster	規模 (peak)	撞牆點	觸發遷移時間
A (SaaS startup)	80 service / 12 node	service discovery latency 升、無 sidecar mesh	2022
B (E-commerce)	150 service / 25 node	rolling update + canary 邏輯自寫複雜	2023
C (Fintech)	60 service / 15 node	secret rotation + RBAC 自管、合規難	2023
D (Media)	200 service / 40 node	autoscaling 自寫、預測流量失敗	2024
E (Logistics)	100 service / 20 node	multi-region 不支援	2024

5 個共同 pattern：

Swarm 簡單但 ceiling 100-200 service / 20-40 node
跨 service 治理（mesh / RBAC / secret / autoscale）需要外掛工具、複雜度反超 K8s
無 multi-region native、災備受限
生態縮、社群活躍度低、新 feature 緩

撞牆點不是「Swarm 跑不動」、是「Swarm 不會幫你解 跨 service 治理 問題、要自寫」。Kubernetes 不是 simpler、是 把治理問題納入框架。

為什麼遷：ceiling / ecosystem / multi-region 三條 driver

Driver	觸發
Ceiling	Swarm 跑 100-200 service 後 service discovery latency / scheduling 跟不上
Ecosystem	K8s ecosystem (Helm / Operator / mesh / GitOps) 成熟、Swarm 對等工具缺
Multi-region	Swarm 不支援、K8s 多 cluster federation 成熟

反向 driver（K8s → Swarm）：

純 internal tool / 小規模（< 30 service）、K8s 過度複雜
Edge / IoT scenario、Swarm footprint 小

6 維 audit

維度	等級
Schema / API	High（docker-compose stack.yml → K8s YAML、syntax 完全不同）
Operational	Medium（Swarm 自管 → K8s self-host or managed）
Paradigm	High（簡單 container orchestration → declarative resource model）
Components	Low（同 1 個 orchestration 系統）
Application change	Low（container image 不變）
Data topology	Low

Schema + Paradigm 雙 High → Type E paradigm shift 為主、Schema 高維獨立段。

Paradigm 對位

概念	Swarm	K8s
Workload unit	Service	Deployment + Pod + Service
Stack 定義	stack.yml (docker-compose 格式)	YAML manifest (multiple resources)
Networking	Overlay network (built-in)	CNI plugin (Calico / Cilium / etc)
Service discovery	DNS-based built-in	DNS-based (CoreDNS) + Service object
Load balancing	Built-in routing mesh	Service + Ingress + LoadBalancer
Secret management	Docker secrets	K8s Secret + 外部 Vault / Secrets Manager
Rolling update	`docker service update --image ...`	Deployment + rolling update + readiness probe
Autoscaling	手動 scale	HPA (Horizontal Pod Autoscaler)
RBAC	Limited (Swarm enterprise)	First-class (Role / RoleBinding / ServiceAccount)
Persistent storage	Volume + driver plugin	PV / PVC + CSI driver
Service mesh	無 (要外掛 Traefik)	Istio / Linkerd / Cilium
GitOps	無 native	Argo CD / Flux (first-class)

Schema gap：docker-compose vs K8s YAML

 1# Docker Swarm stack.yml
 2version: '3.8'
 3services:
 4  webapp:
 5    image: myapp:1.0
 6    deploy:
 7      replicas: 3
 8      update_config:
 9        parallelism: 1
10      restart_policy:
11        condition: on-failure
12    networks:
13      - frontend
14    ports:
15      - "8080:8080"

 1# K8s equivalent (Deployment + Service + Ingress)
 2apiVersion: apps/v1
 3kind: Deployment
 4metadata:
 5  name: webapp
 6spec:
 7  replicas: 3
 8  strategy:
 9    type: RollingUpdate
10    rollingUpdate:
11      maxSurge: 1
12      maxUnavailable: 0
13  selector:
14    matchLabels: { app: webapp }
15  template:
16    metadata:
17      labels: { app: webapp }
18    spec:
19      containers:
20        - name: webapp
21          image: myapp:1.0
22          ports:
23            - containerPort: 8080
24          readinessProbe:
25            httpGet:
26              path: /healthz
27              port: 8080
28          resources:
29            requests:
30              cpu: 100m
31              memory: 128Mi
32            limits:
33              cpu: 500m
34              memory: 512Mi
35---
36apiVersion: v1
37kind: Service
38metadata:
39  name: webapp
40spec:
41  selector: { app: webapp }
42  ports:
43    - port: 8080
44      targetPort: 8080

1 Swarm service → 2-3 K8s resource（Deployment + Service + 可能 Ingress / HPA）；application 不改但 deployment 端工作量 5-10x。

Migration 流程

Partial migration + 混合架構

跟 Kafka ↔ NATS / etcd → Consul 同 Type E pattern：

 11. Audit application：列所有 Swarm stack + service
 22. 分類處理 plan:
 3   - 簡單 stateless: 先切 K8s (低風險)
 4   - Stateful (DB / queue): 評估 K8s operator 或保留 Swarm
 5   - Critical service: 雙跑期確認 K8s 行為對等
 63. K8s cluster 建置:
 7   - Managed (EKS / GKE / AKS) vs self-host (kubeadm)
 8   - 配 ingress controller / cert-manager / monitoring
 94. Application 遷移 (per stack)
10   - 寫 K8s YAML / Helm chart
11   - 配 readiness/liveness probe / resource request
12   - Networking + secret 對位
135. Cutover + Swarm decommission
14   - 部分 stack 切完、評估 Swarm 是否保留 (legacy / edge)
15   - 多數 organization 完全 decommission Swarm

整體 3-6 個月、依 stack 數量跟 application 複雜度。

Production 故障演練

Case 1：Networking model 差、cross-service connectivity 失效

徵兆：cutover 後 service A 連 service B 失敗、Swarm 端 tasks.service_b DNS 對位 K8s 端 service-b.namespace.svc.cluster.local 不通。

根因：Swarm overlay network 內 service-to-service 用 short name (service_b)、K8s 用 FQDN；application 端 service URL 寫死。

修法：

Application 端用 short name + cluster DNS search domain
K8s 端設 dnsPolicy: ClusterFirst 預設、確認 kubectl get svc -A 對應
NetworkPolicy 預設 deny-all、明示 allow rule

Case 2：Secret rotation 從 Swarm secrets 換 Vault / Secrets Manager

徵兆：原本 Swarm 用 docker secret 旋轉 secret、切 K8s 後 K8s Secret 是 static value、rotation 不自動。

根因：K8s Secret 是 K8s-native 但 not auto-rotated、需要外部 Vault / Secrets Manager + agent (vault-agent-injector / external-secrets-operator)。

修法：

K8s 端 deploy external-secrets-operator + AWS Secrets Manager / Vault integration
Application 端 mount file or env variable、不在 code 寫死
Rotation 走 vendor-side、K8s 端 sidecar 自動 reload

Case 3：Readiness probe 沒設、rolling update 期間 traffic loss

徵兆：cutover 後 deploy 期間 application 5-10% request 失敗；發現 pod startup 完成前就接 traffic。

根因：Swarm 簡單 restart_policy 沒對等 probe 概念；K8s 預設 deploy 後 immediate ready、若沒 readiness probe、startup 時間長的 application 會在未 ready 時接流量。

修法：

必加 readiness probe：HTTP / TCP / exec check
配 initial delay：JVM application 預留 30-60s
配 minReadySeconds：deployment 端設 30s 確保 stable

Case 4：HPA 預設不啟、autoscaling 失效

徵兆：Swarm 端寫了 cron-based autoscale script、切 K8s 後 script 失效、流量高峰沒 scale up。

根因：K8s HPA 不是預設啟動、需要 明示配置 + metrics-server install。

修法：

 1apiVersion: autoscaling/v2
 2kind: HorizontalPodAutoscaler
 3metadata:
 4  name: webapp-hpa
 5spec:
 6  scaleTargetRef:
 7    apiVersion: apps/v1
 8    kind: Deployment
 9    name: webapp
10  minReplicas: 3
11  maxReplicas: 20
12  metrics:
13    - type: Resource
14      resource:
15        name: cpu
16        target:
17          type: Utilization
18          averageUtilization: 70

裝 metrics-server / Keda（event-driven autoscaling）+ 配 HPA per Deployment。

Case 5：YAML 維護地獄、Helm / Kustomize 配置遲

徵兆：cutover 後 K8s YAML 從 5 個檔（Swarm stack）變 50+ 個 K8s manifest；每個 application 端要改一個 config 都要動 N 個 file。

根因：K8s YAML 是 very verbose、不像 docker-compose 簡潔；缺 templating 跟 environment 抽象。

修法：

Helm chart：對 application 包成 chart、用 values.yaml 抽象環境差異
Kustomize：base + overlay pattern、不靠 templating
GitOps with Argo CD / Flux：宣告式部署、降 manual kubectl 操作

Capacity / cost

維度	Docker Swarm	Kubernetes (managed)
Cluster cost (mid-tier)	$300-800 / mo	$500-1500 / mo（EKS/GKE/AKS control plane + nodes）
Operational FTE	0.3-0.8	0.5-1.5（除非 managed、降到 0.3-0.7）
Ecosystem maturity	低、衰退	高、active growth
Multi-region	不支援	多 cluster federation 成熟
Migration cost	-	2-4 FTE × 3-6 個月
Long-term ROI	Negative（社群縮）	Positive（feature growth）

判讀：< 30 service 小 organization 可不切；50+ service 開始撞 Swarm ceiling、值得評估；100+ service / multi-region 必切。

整合 / 下一步

跟 Service mesh 整合

Cutover 後順便評估 Istio / Linkerd / Cilium service mesh、cover mTLS / observability / traffic policy；不要在 Swarm migration 後立刻上 mesh、分階段。

跟 GitOps 整合

K8s + Argo CD / Flux 是 natural pair；migration 時直接走 GitOps、避免 manual kubectl 操作累積。

跟 Vault → AWS Secrets Manager 對齊

Swarm secrets → K8s Secret → external secrets management 是 3-step 演進、不是 1-step；migration 期間先用 K8s Secret、之後切 Vault / Secrets Manager。

Kubernetes on Tarragon

Kubernetes Graceful Shutdown：termination 序列跟你以為的不一樣

Graceful shutdown 沒做對、500 期間每次 deploy 都吃 502

Termination 序列：五步、每步都能爆

配置全圖

Deployment spec

Application 處理 SIGTERM（Go 範例）

Production 故障演練

Case 1：Rolling update 期間 502 / 503

Case 2：Connection drain race，long-running request 被中斷

Case 3：Init container 在 grace period 期間重啟、SIGTERM 沒到 main

Case 4：StatefulSet 串行終止、總時間 = pod 數 × grace period

Case 5：Job / CronJob 不 graceful、SIGTERM 直接 SIGKILL

規模影響

跟其他元件整合

Service mesh（Istio / Linkerd）

Readiness probe 跟 mesh-aware traffic

跟 Pod Disruption Budget（PDB）的衝突

下一步

相關連結

Docker Swarm → Kubernetes：5 個 Swarm production cluster 撞牆數據

5 個 Swarm production cluster 撞牆數據

為什麼遷：ceiling / ecosystem / multi-region 三條 driver

6 維 audit

Paradigm 對位

Schema gap：docker-compose vs K8s YAML

Migration 流程

Partial migration + 混合架構

Production 故障演練

Case 1：Networking model 差、cross-service connectivity 失效

Case 2：Secret rotation 從 Swarm secrets 換 Vault / Secrets Manager

Case 3：Readiness probe 沒設、rolling update 期間 traffic loss

Case 4：HPA 預設不啟、autoscaling 失效

Case 5：YAML 維護地獄、Helm / Kustomize 配置遲

Capacity / cost

整合 / 下一步

跟 Service mesh 整合

跟 GitOps 整合

跟 Vault → AWS Secrets Manager 對齊

相關連結