Paradigm-Shift on Tarragon

從 Firestore 遷往自建 relational：撞牆驅動的 Type E 重建模、存取模型反轉與並行期

Tue, 16 Jun 2026 00:00:00 +0000

本文是 Firestore overview 的 migration playbook。寫作參照 Migration Playbook 寫作方法論。BaaS 託管平台整場遷出的資產線盤點與並行期總覽見 10.3 託管形態遷出；本文聚焦資料層的跨 paradigm 重建模。

「我們把 Firestore 整包匯出，匯進 PostgreSQL 就好。」這句話低估了遷移的真正內容 — Firestore 遷往自建 relational 的難點是反轉整個存取模型，搬資料只是其中最容易的一條線。Firestore 是 client 用 SDK 直連資料庫、授權寫在 Security Rules；自建 relational 是 client 打自己的後端 API、授權在後端中介層。資料可以匯出，但反正規化的 document 形狀、沿查詢限制長出來的資料模型、realtime listener 與 offline 同步能力，都沒有 1:1 的對應物。字面意義的「匯出再匯入」只搬走了最容易的那部分。本文走 paradigm shift 結構：先講為何字面遷移不成立、再講哪些該遷哪些先留、最後才是階段化執行。

遷移的 driver：三面牆，不是「relational 比較好」

Firestore 遷往自建很少因為「relational 比較好」這種空泛動機，而是撞到 0.21 BaaS 段描述的三面具體的牆。先確認 driver 真的成立、再啟動遷移：

Driver	撞牆訊號	遷移要解的問題
報表 / 分析查詢	跨 collection 報表查不出來、已經在維護資料複製管線	把資料放回支援 JOIN / aggregation 的 relational
成本曲線轉折	read / write 計費隨流量線性成長、超過自建 + cache 的成本	用自管資料庫 + 應用層快取壓低單位成本
授權控制面失控	Security Rules 長到難以測試 / review、授權邏輯沒有版本治理	把授權拉回後端 API 中介層、可測試可審查

No-go condition：產品仍以多裝置 realtime 同步與 offline-first 為核心賣點、且查詢需求簡單、成本仍在舒適區 → 先不要遷。這些正是 Firestore 的主場，硬遷會把 realtime / offline 這層平台白送的能力變成自己要重建的工程。遷移前先問「撞的是哪面牆」，三面牆都沒撞到就是 0.22 講的偽自建。

逐能力遷出是常態而非整包搬離：0.22 的「成長期 SaaS」例子就是只把撞牆的資料層搬到自管 PostgreSQL、認證留在原平台。本文預設的也是這種逐能力遷出 — 遷的是資料層，不一定連認證、儲存一起搬。

6 維 diff audit：主導維度是 paradigm + application change

遷移前先盤點 source 跟 target 的差異落在哪幾維、決定 playbook 結構：

維度	Firestore → 自建 relational	程度
Schema / API	document / collection → 正規 table、SDK query → 後端 API + SQL	High
Operational model	serverless 全託管 → 自管 / managed 資料庫、自己擔 backup / failover	High
Paradigm	client 直連 + 規則授權 → API 中介 + 後端授權	High
Components 數量	單一平台 → 新增一層自建後端服務 + 資料庫	High
Application change	前端拔 SDK 改打 API、realtime / offline 要重建	High
Data topology	平台複製 → 自己設計 replica / 多 region / DR	Medium

主導維度是 paradigm 與 application change：六維裡五維落在 High。這定義了結構 — Type E paradigm shift（排除 schema 翻譯 Type A 和 drop-in Type B）：存取模型反轉、部分能力重建、可能長期混合（資料層自建、認證仍留平台）。

為什麼字面遷移不成立：存取模型反轉

Firestore 的存取模型是 前端即客戶端、資料庫直接面向公網、授權在規則層；自建 relational 是 前端打後端、後端面向資料庫、授權在服務層。這個反轉是遷移的核心難點，不在資料搬運。

反正規化 document → 正規 schema：

Firestore 為了繞開查詢限制，常把關聯資料冗餘寫進同一 document（一份資料複製多處）
遷往 relational 要把冗餘拆回正規化 table、重建外鍵關係，這是逆向工程：要先讀懂當初為什麼這樣存
反過來說，有些 document 的巢狀結構在 relational 用 JSONB 保留更省事（見 PostgreSQL jsonb）— 不是所有 document 都要拆成 table

Security Rules 授權 → 後端授權：

Firestore 的授權邏輯散在 Security Rules DSL 裡，遷移要把每一條規則翻譯成後端 API 的權限檢查
這層翻譯是安全敏感的：漏一條規則等於開一個越權查詢的洞，對應 1.5 資料層紅隊

SDK 直連 → API 中介：

前端原本用 Firestore SDK 直接讀寫，遷移後要拔掉 SDK、改打自建 API
這是 application 層的大改，不是資料庫換連線字串

realtime listener / offline persistence → 自己重建：

snapshot listener 的即時推送、offline 讀寫快取，是平台白送的能力
自建要用 WebSocket / SSE 重建即時層（見 03 訊息佇列與 presence 設計）、用前端本地儲存重建 offline — 這是遷移最容易被漏估的工作量

所以遷移的第一步不是匯資料，是盤點 application 對 Firestore 的所有依賴面：查詢路徑、授權規則、realtime 訂閱、offline 行為。這份清單決定哪些能直接遷、哪些要重建、哪些先留在平台。

哪些該遷、哪些先留（逐能力混合）

Type E 的本質是不收斂 — 不必把所有 Firebase 能力一次搬完。判讀標準：

Workload / 能力特徵	去向
需要報表 / JOIN / aggregation 的資料	遷自建 relational
讀取量大、成本敏感、access pattern 穩定的資料	遷自建 + 應用層快取
仍以 realtime 同步為核心、查詢簡單的資料	先留 Firestore / 或最後再遷
認證（Firebase Auth）	可留平台、逐能力決定（見 0.22）
檔案儲存（Firebase Storage）	可留平台、與資料層解耦後再評估

0.22 的成長期 SaaS 是這個判讀的 case anchor：撞牆的是資料層的 query 複雜度與成本，遷的就是資料層，認證留在原地。混合不是過渡失敗，是逐能力選型的穩態。

Phase plan：存取模型反轉的階段化

paradigm shift 的階段化把不可逆動作放到最後、每階段有獨立驗證門檻：

Phase 1：依賴面盤點

列出 application 對 Firestore 的所有讀寫路徑、Security Rules 授權條件、realtime 訂閱點、offline 行為。標每項的頻率、安全敏感度、是否可重建。這份清單不完整不進下一階段。

Phase 2：relational 重建模

把反正規化 document 設計回正規 schema、決定哪些巢狀結構用 JSONB 保留。同步設計後端 API 的端點與授權檢查、把 Security Rules 逐條翻譯成服務層權限。對應 1.2 schema design 與 1.5 資料層紅隊。

Phase 3：自建後端 + dual-write

立起自建後端 API 與資料庫，前端關鍵寫入路徑同時寫 Firestore 與新後端。Firestore 仍是 source of truth、新庫累積資料。dual-write 要處理一邊失敗的補償（對應 1.9 Reconciliation）。

Phase 4：backfill 歷史資料

把 Firestore 既有 document 按新 schema 轉換寫入新庫。backfill 與 dual-write 並行時要處理覆蓋順序，backfill 不能蓋掉 dual-write 的新值。轉換過程記 checksum / row count 對照。

Phase 5：shadow read 驗證

讀路徑同時打 Firestore 與新後端、比對結果、記錄差異但仍以 Firestore 回應用戶。差異率降到可接受才進 cutover。對應 1.7 Schema Migration Rollout 證據的 evidence 方法。

Phase 6：漸進 cutover + 重建即時層

前端逐步把讀寫從 Firestore SDK 切到自建 API（按比例 / 按功能模組），保留切回能力。若產品需要 realtime，這階段要把 snapshot listener 換成自建即時層（WebSocket / SSE）並驗證延遲與斷線重連。cutover 完成後資料層的 source of truth 轉到自建；未遷的能力（認證、儲存）仍在平台 — 混合架構成立。

Evidence：每階段的前進依據

每個階段用資料證明可前進、不靠感覺：

階段	Evidence
dual-write	雙寫成功率、寫入失敗補償紀錄、兩邊 document / row 數差異
backfill	已轉換比例、轉換錯誤數、checksum 對照、反正規化還原正確性抽查
shadow read	新舊結果差異率、差異分類（建模差異 vs 真錯誤）、授權翻譯漏洞掃描
cutover	切流比例、新 API latency p99、error rate、realtime 推送延遲、rollback 是否觸發

這些 evidence 對齊 4.20 Observability Evidence Package（Source / Time range / Query link / Owner / Data quality）與 6.8 release gate。授權翻譯這項要特別當成 gate 條件 — 它是安全邊界、不只是功能正確性。

Cutover 與 rollback 決策

資料庫切流失敗代價高、加上這裡牽涉授權正確性，決策權責要寫清楚：

cutover window：選低流量時段、明確切流比例階梯（如 1% → 10% → 50% → 100%），按功能模組切比按全站切安全
rollback condition：新 API error rate / latency 超閾值、shadow read 差異率異常、或發現授權翻譯漏洞 → 切回 Firestore
decision owner：誰有權喊停、依據什麼 evidence、記錄在 8.19 incident decision log
realtime 連續性：若即時層同步切換，要驗證切換期間訂閱不中斷、或明確告知短暫降級

對應 rollback window、rollback condition。

Cleanup 與長期混合

Type E 的 cleanup 通常不是「關掉整個 Firebase」— 多數情況認證、儲存仍留平台：

已遷資料路徑的 Firestore collection、Security Rules、dual-write code path 退役
shadow read 比對 code 移除
前端殘留的 Firestore SDK 依賴清掉（資料層已不走它）
但 Firebase Auth / Storage 若仍在用，保留；明確標示哪條資料路徑的 source of truth 是自建庫、哪條仍在平台
Firestore 的資料匯出備份保留到確認新庫穩定，對應 10.3 的並行期退役判準

混合架構不是遷移失敗、是逐能力選型的穩態 — 撞牆的資料層自建、沒撞牆的認證 / 儲存留在平台。

失敗模式

production 常見的 5 個踩雷：

Case 1：只匯資料、漏了存取模型反轉

把 Firestore 匯出匯進 PostgreSQL 就以為遷完、忘了前端還在打 SDK、授權還在 Security Rules。修法：依賴面盤點是 Phase 1、資料搬運只是其中一條線，存取模型反轉才是主體。

Case 2：Security Rules 翻譯漏洞

把規則翻成後端授權時漏一條、開了越權查詢的洞、上線後資料外洩。修法：授權翻譯要逐條對照 + 紅隊驗證（1.5）、當成 cutover gate 條件、不是功能 bug。

Case 3：反正規化還原錯誤

document 的冗餘副本拆回 table 時還原錯關係、新庫資料關聯接錯。修法：Phase 2 先讀懂當初為何反正規化、backfill 後抽查還原正確性、shadow read 比對抓出建模差異。

Case 4：低估 realtime / offline 重建工作量

以為遷資料庫就好、上線才發現 snapshot listener 與 offline 同步整層要自己重建、進度爆炸。修法：依賴面盤點就把 realtime 訂閱點與 offline 行為標出來、列入工作量、必要時這層最後遷或先保留。

Case 5：dual-write 一邊失敗沒補償

dual-write 時新庫寫成功 Firestore 失敗（或反之）、兩邊分歧、cutover 後資料不完整。修法：dual-write 要有失敗補償（記錄、重試、標記人工對帳），對應 1.9 Reconciliation。

Anti-recommendation：產品仍重度依賴 realtime / offline、或團隊還沒有自建後端與資料庫的營運能力（backup、failover、授權設計）→ 先不要遷。可先把一塊撞牆最明顯、realtime 需求最低的資料（例如報表來源資料）試點、累積自建營運經驗再擴大。

容量與成本：crossover 判讀

遷移的成本判讀關鍵是 遷移後的總帳、不是只看 Firestore 帳單：

遷移當下：高 read 流量下，自管資料庫 + 應用層快取的單位成本常低於 Firestore 的 per-read 計費
但要加回自建的隱性成本：後端服務的開發與維運、資料庫的 backup / failover / 擴容、realtime 層的重建與維護、團隊人力
判讀分層：撞到成本牆且已有後端團隊 → 自建總帳通常划算；仍是小團隊、realtime 是核心、流量不大 → Firestore 的「平台白送能力」可能仍比自建總帳便宜

Scope warning：crossover 隨流量形狀、region pricing、團隊成本結構變動、無通用閾值。遷移省下的 Firestore 帳單要扣掉自建後端 + 資料庫 + 即時層的維運成本後再比，不是直接拿兩邊資料庫帳單對照。

接回 0.6 成本、風險與選型取捨、1.10 KV / Document DB 容量規劃。

邊界與整合

跟其他遷移路徑的關係

保留 document model：若只是要逃離 Firestore 的查詢限制、但 document 形狀仍適合，遷 MongoDB 比遷 relational 的 paradigm 跨度小、不必反正規化還原
整包託管遷出：若連認證、儲存一起搬離 Firebase，整場資產線盤點與並行期走 10.3 託管形態遷出、本文是其中資料層那一條
反向視角：哪些資料當初就不該進 Firestore（報表來源、強一致交易），見 Firestore overview 的不適用場景

Sibling 與 cross-link

Firestore overview — 服務定位與查詢邊界
1.6 資料庫轉換實作 — 通用 dual-write / shadow read / cutover 框架
1.5 資料層紅隊 — Security Rules 授權翻譯的安全驗證
1.9 Reconciliation 與 Data Repair — dual-write 失敗補償與資料對帳
從 RDS / MongoDB 遷往 DynamoDB — 同為 Type E paradigm shift 的對照（方向相反：遷入 NoSQL vs 遷出 BaaS）
0.21 交付形態選型 / 0.22 能力級買 vs 建 — 遷移 driver 的選型層背景

Docker Swarm → Kubernetes：5 個 Swarm production cluster 撞牆數據

Tue, 19 May 2026 00:00:00 +0000

本文是跨 vendor migration playbook、cross-link Docker Swarm 跟 Kubernetes。跑 migration-playbook-methodology 6 維 audit 後對映 Paradigm = High（Swarm 簡單 container orchestration → K8s declarative resource model）→ Type E paradigm shift。

5 個 Swarm production cluster 撞牆數據

從 2020-2024 觀察 5 個中型 organization 的 Swarm production cluster lifecycle、典型撞牆點：

Cluster	規模 (peak)	撞牆點	觸發遷移時間
A (SaaS startup)	80 service / 12 node	service discovery latency 升、無 sidecar mesh	2022
B (E-commerce)	150 service / 25 node	rolling update + canary 邏輯自寫複雜	2023
C (Fintech)	60 service / 15 node	secret rotation + RBAC 自管、合規難	2023
D (Media)	200 service / 40 node	autoscaling 自寫、預測流量失敗	2024
E (Logistics)	100 service / 20 node	multi-region 不支援	2024

5 個共同 pattern：

Swarm 簡單但 ceiling 100-200 service / 20-40 node
跨 service 治理（mesh / RBAC / secret / autoscale）需要外掛工具、複雜度反超 K8s
無 multi-region native、災備受限
生態縮、社群活躍度低、新 feature 緩

撞牆點不是「Swarm 跑不動」、是「Swarm 不會幫你解 跨 service 治理 問題、要自寫」。Kubernetes 不是 simpler、是 把治理問題納入框架。

為什麼遷：ceiling / ecosystem / multi-region 三條 driver

Driver	觸發
Ceiling	Swarm 跑 100-200 service 後 service discovery latency / scheduling 跟不上
Ecosystem	K8s ecosystem (Helm / Operator / mesh / GitOps) 成熟、Swarm 對等工具缺
Multi-region	Swarm 不支援、K8s 多 cluster federation 成熟

反向 driver（K8s → Swarm）：

純 internal tool / 小規模（< 30 service）、K8s 過度複雜
Edge / IoT scenario、Swarm footprint 小

6 維 audit

維度	等級
Schema / API	High（docker-compose stack.yml → K8s YAML、syntax 完全不同）
Operational	Medium（Swarm 自管 → K8s self-host or managed）
Paradigm	High（簡單 container orchestration → declarative resource model）
Components	Low（同 1 個 orchestration 系統）
Application change	Low（container image 不變）
Data topology	Low

Schema + Paradigm 雙 High → Type E paradigm shift 為主、Schema 高維獨立段。

Paradigm 對位

概念	Swarm	K8s
Workload unit	Service	Deployment + Pod + Service
Stack 定義	stack.yml (docker-compose 格式)	YAML manifest (multiple resources)
Networking	Overlay network (built-in)	CNI plugin (Calico / Cilium / etc)
Service discovery	DNS-based built-in	DNS-based (CoreDNS) + Service object
Load balancing	Built-in routing mesh	Service + Ingress + LoadBalancer
Secret management	Docker secrets	K8s Secret + 外部 Vault / Secrets Manager
Rolling update	`docker service update --image ...`	Deployment + rolling update + readiness probe
Autoscaling	手動 scale	HPA (Horizontal Pod Autoscaler)
RBAC	Limited (Swarm enterprise)	First-class (Role / RoleBinding / ServiceAccount)
Persistent storage	Volume + driver plugin	PV / PVC + CSI driver
Service mesh	無 (要外掛 Traefik)	Istio / Linkerd / Cilium
GitOps	無 native	Argo CD / Flux (first-class)

Schema gap：docker-compose vs K8s YAML

 1# Docker Swarm stack.yml
 2version: '3.8'
 3services:
 4  webapp:
 5    image: myapp:1.0
 6    deploy:
 7      replicas: 3
 8      update_config:
 9        parallelism: 1
10      restart_policy:
11        condition: on-failure
12    networks:
13      - frontend
14    ports:
15      - "8080:8080"

 1# K8s equivalent (Deployment + Service + Ingress)
 2apiVersion: apps/v1
 3kind: Deployment
 4metadata:
 5  name: webapp
 6spec:
 7  replicas: 3
 8  strategy:
 9    type: RollingUpdate
10    rollingUpdate:
11      maxSurge: 1
12      maxUnavailable: 0
13  selector:
14    matchLabels: { app: webapp }
15  template:
16    metadata:
17      labels: { app: webapp }
18    spec:
19      containers:
20        - name: webapp
21          image: myapp:1.0
22          ports:
23            - containerPort: 8080
24          readinessProbe:
25            httpGet:
26              path: /healthz
27              port: 8080
28          resources:
29            requests:
30              cpu: 100m
31              memory: 128Mi
32            limits:
33              cpu: 500m
34              memory: 512Mi
35---
36apiVersion: v1
37kind: Service
38metadata:
39  name: webapp
40spec:
41  selector: { app: webapp }
42  ports:
43    - port: 8080
44      targetPort: 8080

1 Swarm service → 2-3 K8s resource（Deployment + Service + 可能 Ingress / HPA）；application 不改但 deployment 端工作量 5-10x。

Migration 流程

Partial migration + 混合架構

跟 Kafka ↔ NATS / etcd → Consul 同 Type E pattern：

 11. Audit application：列所有 Swarm stack + service
 22. 分類處理 plan:
 3   - 簡單 stateless: 先切 K8s (低風險)
 4   - Stateful (DB / queue): 評估 K8s operator 或保留 Swarm
 5   - Critical service: 雙跑期確認 K8s 行為對等
 63. K8s cluster 建置:
 7   - Managed (EKS / GKE / AKS) vs self-host (kubeadm)
 8   - 配 ingress controller / cert-manager / monitoring
 94. Application 遷移 (per stack)
10   - 寫 K8s YAML / Helm chart
11   - 配 readiness/liveness probe / resource request
12   - Networking + secret 對位
135. Cutover + Swarm decommission
14   - 部分 stack 切完、評估 Swarm 是否保留 (legacy / edge)
15   - 多數 organization 完全 decommission Swarm

整體 3-6 個月、依 stack 數量跟 application 複雜度。

Production 故障演練

Case 1：Networking model 差、cross-service connectivity 失效

徵兆：cutover 後 service A 連 service B 失敗、Swarm 端 tasks.service_b DNS 對位 K8s 端 service-b.namespace.svc.cluster.local 不通。

根因：Swarm overlay network 內 service-to-service 用 short name (service_b)、K8s 用 FQDN；application 端 service URL 寫死。

修法：

Application 端用 short name + cluster DNS search domain
K8s 端設 dnsPolicy: ClusterFirst 預設、確認 kubectl get svc -A 對應
NetworkPolicy 預設 deny-all、明示 allow rule

Case 2：Secret rotation 從 Swarm secrets 換 Vault / Secrets Manager

徵兆：原本 Swarm 用 docker secret 旋轉 secret、切 K8s 後 K8s Secret 是 static value、rotation 不自動。

根因：K8s Secret 是 K8s-native 但 not auto-rotated、需要外部 Vault / Secrets Manager + agent (vault-agent-injector / external-secrets-operator)。

修法：

K8s 端 deploy external-secrets-operator + AWS Secrets Manager / Vault integration
Application 端 mount file or env variable、不在 code 寫死
Rotation 走 vendor-side、K8s 端 sidecar 自動 reload

Case 3：Readiness probe 沒設、rolling update 期間 traffic loss

徵兆：cutover 後 deploy 期間 application 5-10% request 失敗；發現 pod startup 完成前就接 traffic。

根因：Swarm 簡單 restart_policy 沒對等 probe 概念；K8s 預設 deploy 後 immediate ready、若沒 readiness probe、startup 時間長的 application 會在未 ready 時接流量。

修法：

必加 readiness probe：HTTP / TCP / exec check
配 initial delay：JVM application 預留 30-60s
配 minReadySeconds：deployment 端設 30s 確保 stable

Case 4：HPA 預設不啟、autoscaling 失效

徵兆：Swarm 端寫了 cron-based autoscale script、切 K8s 後 script 失效、流量高峰沒 scale up。

根因：K8s HPA 不是預設啟動、需要 明示配置 + metrics-server install。

修法：

 1apiVersion: autoscaling/v2
 2kind: HorizontalPodAutoscaler
 3metadata:
 4  name: webapp-hpa
 5spec:
 6  scaleTargetRef:
 7    apiVersion: apps/v1
 8    kind: Deployment
 9    name: webapp
10  minReplicas: 3
11  maxReplicas: 20
12  metrics:
13    - type: Resource
14      resource:
15        name: cpu
16        target:
17          type: Utilization
18          averageUtilization: 70

裝 metrics-server / Keda（event-driven autoscaling）+ 配 HPA per Deployment。

Case 5：YAML 維護地獄、Helm / Kustomize 配置遲

徵兆：cutover 後 K8s YAML 從 5 個檔（Swarm stack）變 50+ 個 K8s manifest；每個 application 端要改一個 config 都要動 N 個 file。

根因：K8s YAML 是 very verbose、不像 docker-compose 簡潔；缺 templating 跟 environment 抽象。

修法：

Helm chart：對 application 包成 chart、用 values.yaml 抽象環境差異
Kustomize：base + overlay pattern、不靠 templating
GitOps with Argo CD / Flux：宣告式部署、降 manual kubectl 操作

Capacity / cost

維度	Docker Swarm	Kubernetes (managed)
Cluster cost (mid-tier)	$300-800 / mo	$500-1500 / mo（EKS/GKE/AKS control plane + nodes）
Operational FTE	0.3-0.8	0.5-1.5（除非 managed、降到 0.3-0.7）
Ecosystem maturity	低、衰退	高、active growth
Multi-region	不支援	多 cluster federation 成熟
Migration cost	-	2-4 FTE × 3-6 個月
Long-term ROI	Negative（社群縮）	Positive（feature growth）

判讀：< 30 service 小 organization 可不切；50+ service 開始撞 Swarm ceiling、值得評估；100+ service / multi-region 必切。

整合 / 下一步

跟 Service mesh 整合

Cutover 後順便評估 Istio / Linkerd / Cilium service mesh、cover mTLS / observability / traffic policy；不要在 Swarm migration 後立刻上 mesh、分階段。

跟 GitOps 整合

K8s + Argo CD / Flux 是 natural pair；migration 時直接走 GitOps、避免 manual kubectl 操作累積。

跟 Vault → AWS Secrets Manager 對齊

Swarm secrets → K8s Secret → external secrets management 是 3-step 演進、不是 1-step；migration 期間先用 K8s Secret、之後切 Vault / Secrets Manager。

Sentry → Honeycomb：trace 不是 error、是不同 observability paradigm

Tue, 19 May 2026 00:00:00 +0000

本文是跨 vendor migration playbook、cross-link Sentry 跟 Honeycomb。跑 migration-playbook-methodology 6 維 audit 後對映 Paradigm = High（error tracking ↔ wide-event observability）→ Type E paradigm shift。

Trace 不是 error、是不同 paradigm

把 Sentry → Honeycomb 當「trace tool 替換」是最常見的誤判 — Sentry trace 是 error 上下文、Honeycomb trace 是 observability 第一性：

概念	Sentry	Honeycomb
核心 paradigm	Error tracking + transaction trace	High-cardinality wide-event observability
第一性 unit	Error event	Wide event (span with N fields)
Trace 角色	Error 的「附帶 context」	Observability 主軸、每 event 是 trace span
Sampling	Error 全收 + transaction sample	Adaptive sampling、保留 anomaly
Query model	Filter + group by + aggregation	High-cardinality 多維 query (BubbleUp / heatmap)
User base	Developer (debug error)	SRE + Platform (debug system behavior)
Cost model	Per-error event + transaction	Per-event (wide event volume)

核心差異不在「Honeycomb 是 better Sentry」、在「兩者是不同 observability paradigm」：

Sentry 適合 application-level error debug — 拿到 error stack trace + minimal context、快速 fix
Honeycomb 適合 system-level behavior debug — 看流量分佈 / 多維 correlation / 異常 outlier、找 為什麼這個 user 在這個時段在這個 endpoint 慢

Migration scope 包含 paradigm reset — 不是 SDK 換、是 SRE / Dev team 對 observability 的心智模型重設。

為什麼遷：observability 成熟度 / cardinality / cost 三條 driver

Driver	觸發
Observability 成熟度	Application 規模到跨多 service / multi-tenant、Sentry error tracking 不夠細、SRE 要看 high-cardinality 多維 query
High-cardinality	Sentry tag system 限制 cardinality（~1000 unique value）、Honeycomb native 支援 millions cardinality
Cost	Per-error pricing 對 high-error volume 場景爆、Honeycomb per-event 在 wide event 場景更可預測

反向 driver（Honeycomb → Sentry）：

Pure error tracking 場景、Honeycomb wide-event 過度設計
Frontend / mobile 客戶端 error tracking、Sentry 對 web/mobile/desktop SDK 成熟度高

6 維 audit

維度	等級
Schema / API	Medium（event schema 概念不同、SDK 完全換）
Operational	Low（兩者都 SaaS、operational 對等）
Paradigm	High（error tracking ↔ wide-event observability）
Components	Low（同 1 個 observability vendor）
Application change	High（SDK 換 + instrumentation 重設計）
Data topology	Low

Paradigm = High（其他 Low-Medium）→ Type E paradigm shift；application change 雖 High 但是 paradigm 的 downstream。

結構：partial migration + 混合架構是 long-term default

跟 Kafka ↔ NATS / Redis → Memcached 同 Type E pattern：

不存在 complete migration：Sentry 對 frontend error tracking 強項、Honeycomb 對 backend system observability 強項
長期混合架構：frontend / mobile 保留 Sentry、backend / SRE 走 Honeycomb
Application 重設計：instrumentation 用 OpenTelemetry、避免 vendor SDK lock-in

Application 重設計範例

 1# Before: Sentry SDK
 2import sentry_sdk
 3sentry_sdk.init(dsn='https://x@sentry.io/y')
 4
 5try:
 6    process_order(order_id)
 7except Exception as e:
 8    sentry_sdk.capture_exception(e)
 9    raise
10
11# After: OpenTelemetry + Honeycomb
12from opentelemetry import trace
13from opentelemetry.sdk.trace import TracerProvider
14from opentelemetry.sdk.trace.export import BatchSpanProcessor
15from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
16
17trace.set_tracer_provider(TracerProvider())
18trace.get_tracer_provider().add_span_processor(
19    BatchSpanProcessor(OTLPSpanExporter(endpoint='https://api.honeycomb.io', headers={'x-honeycomb-team': 'YOUR_API_KEY'}))
20)
21tracer = trace.get_tracer(__name__)
22
23with tracer.start_as_current_span('process_order') as span:
24    span.set_attribute('order.id', order_id)
25    span.set_attribute('user.id', user_id)
26    span.set_attribute('order.amount', order.amount)  # high-cardinality 自然
27    span.set_attribute('order.region', region)
28    try:
29        process_order(order_id)
30        span.set_status(trace.Status(trace.StatusCode.OK))
31    except Exception as e:
32        span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
33        span.record_exception(e)
34        raise

差異：

Sentry 只 capture exception + 簡 context
Honeycomb 對每 operation 寫 wide event 含 high-cardinality field（user.id / order.amount / order.region）
SRE 端能跑 WHERE order.region = "us-west-2" AND duration > 5000 的 multi-dim query

Migration 流程

 11. Audit application：列所有 Sentry SDK 使用 + capture pattern
 22. 分類處理 plan:
 3   - Pure error tracking (frontend): 保留 Sentry
 4   - Backend system trace: 切 Honeycomb / OTel
 5   - Error + context (混合): 雙寫期 evaluate
 63. OpenTelemetry instrumentation 化:
 7   - 用 OTel SDK 取代 vendor SDK
 8   - Honeycomb 是 OTLP target、跟 vendor lock 解耦
 94. Backend application 切 Honeycomb (3-6 個月)
105. Frontend / mobile 保留 Sentry
116. SRE training: Honeycomb BubbleUp / heatmap / multi-dim query

Production 故障演練

Case 1：Event schema 對位失敗、SRE 不會用 BubbleUp

徵兆：切 Honeycomb 後 SRE 用 Sentry 思維 — 找 error → fix；Honeycomb BubbleUp / heatmap 沒人會用、observability 退化到 只看 error count。

根因：Sentry → Honeycomb migration 不只是 tool 換、是 observability mindset 換；SRE 沒培訓 wide-event query / BubbleUp anomaly detection。

修法：

SRE training：1-2 週 hands-on Honeycomb BubbleUp + heatmap + multi-dim query
Migration scope 含 sample query playbook：每個 incident type 對應 Honeycomb query 寫成 runbook
保留 Sentry frontend / mobile：不要逼 SRE 全切、保留 paradigm fit 的部分

Case 2：Sampling 行為差、production cost 飛

徵兆：切 Honeycomb 後第 1 個月 event volume 比 Sentry 高 100x；帳單暴漲。

根因：Sentry 對 transaction 端 sample（10% 預設）、error 全收；Honeycomb 端 每 span 都 wide event、application 端沒設 sampling 全送、event volume 爆。

修法：

Honeycomb Refinery (sampling proxy)：deploy refinery 在 application 端跟 Honeycomb 之間、tail-based sampling
Sample rule：保留 anomaly (error / slow / outlier)、drop boring success 90%+
Cost monitoring 第一週密集：cardinality + event volume + cost dashboard、catch 預期外 spike

Case 3：Error grouping 失效

徵兆：切 Honeycomb 後 相似 error 沒被 group 成「同類 issue」、SRE 看每 event 獨立、failure 模式淹沒在 noise。

根因：Sentry 自動 error grouping (by stack trace fingerprint)、Honeycomb 沒對等 — wide event 是 first-class、event grouping 需要 application 端 explicit 設 error.type field。

修法：

Application 端設 error type field：span.set_attribute('error.type', exception_class)
Honeycomb derived column：用 derived column 算 error fingerprint
保留 Sentry error tracking：純 error grouping 場景 Sentry 強項、別硬切

Case 4：Cost 模型差、預估錯

徵兆：切 Honeycomb 後預估 50% cost saving、實際只省 10-15%。

根因：Sentry per-error pricing 對 error-heavy application 貴；Honeycomb per-event pricing 對 wide event volume application 貴；如果 application 是 event volume 高但 error 少、Honeycomb 反而貴。

修法：

Pre-migration 估：用 OTel pilot 跑 1-2 週、估真實 event volume
Sample rule 設計：retention 7 天 hot + 30 天 cold + 1 年 archive、降 cost
混合架構保留：frontend / mobile 走 Sentry、backend 走 Honeycomb、避免一邊 cost 爆

Case 5：Alert paradigm 不對等

徵兆：Sentry alert 簡單（error rate / latency p99 threshold）、Honeycomb trigger 配置複雜（SLO + burn rate + BubbleUp）；SOC 學習曲線 1-2 個月。

修法：

Migration 含 alert rebuild scope：Honeycomb trigger 不直接對位 Sentry alert、要重寫
SLO-driven alert：用 Honeycomb SLO 取代 Sentry threshold alert、降 alert fatigue
PagerDuty integration：兩家都支援、routing rule 跟 dedup 要 review

Capacity / cost

維度	Sentry	Honeycomb
Pricing model	Per-error + transaction	Per-event (wide event)
Cost (mid-tier)	$500-2000 / mo	$400-3000 / mo (依 event volume)
Sampling	Built-in transaction sampling	Refinery (additional component)
Cardinality	~1000 unique value / tag	Millions / field
Application complexity	Low (SDK + capture exception)	Medium (OTel + wide event instrument)
Migration cost	-	2-4 FTE × 2-3 個月

整合 / 下一步

跟 OpenTelemetry 整合

OTel 是 vendor-neutral instrumentation、Honeycomb 是 OTLP backend；application 端 OTel 化後可以同時 ship 到多個 backend（dev 端 Jaeger / production 端 Honeycomb / fallback 端 Tempo）。

跟 Datadog → Grafana Stack 對位

兩條 observability 路線：

Grafana Stack (Mimir / Loki / Tempo)：self-host or Grafana Cloud、open source baseline
Honeycomb：SaaS-only、focus wide-event observability

選擇取決於 observability paradigm：trace-heavy 走 Tempo / Honeycomb、metric-heavy 走 Mimir / Datadog。

etcd → Consul：KV + N 個 extras feature matrix

Tue, 19 May 2026 00:00:00 +0000

本文是跨 vendor migration playbook、cross-link etcd 跟 Consul。跑 migration-playbook-methodology 6 維 audit 後對映 Paradigm = High（pure KV → service mesh paradigm）→ Type E paradigm shift；跟 Redis → Memcached（paradigm reduction）對偶、本文是 paradigm expansion（upgrade）方向。

KV + N 個 extras：feature matrix

概念	etcd	Consul
核心 paradigm	Pure KV with Raft consensus	Service mesh（KV + 6 個其他）
Data store	KV with versioned values + watch	KV + service catalog + health checks + sessions
API style	gRPC + HTTP/REST	HTTP/REST + gRPC（Connect）+ DNS
Service discovery	無（application 自管）	Built-in（DNS / HTTP API）
Health check	無	Built-in（HTTP / TCP / script / TTL）
Service mesh	無	Connect（mTLS + intentions + service-to-service）
Multi-DC	不支援（per-cluster only）	Built-in WAN federation
ACL system	RBAC (etcd 3.5+)	Token-based ACL + namespaces (Enterprise)
Lock primitive	Lease + transaction	Session + KV check-and-set
Watch event model	Event stream（gRPC stream）	Long-polling blocking query (X-Consul-Index)
Distributed config	KV + watch	KV + watch + template rendering (consul-template)
Use case 對映	K8s control plane / 純 distributed KV	Service mesh + service discovery + config + KV

核心差異不在「Consul 多功能」、在「Consul 是 service mesh paradigm」：service discovery / health check / Connect mTLS 是 first-class、KV 只是其中一個 sub-feature。

跑 6 維 diff dimension audit：

維度	評估	等級
Schema / API	KV API 對位 + 多 N 個 extra API	Medium
Operational model	兩者 Raft-based、ops similar	Low
Paradigm	Pure KV → service mesh	High
Components	同 1 cluster	Low
Application change	KV API 改 + 新增 service registration / health	Medium
Data topology	單 DC → multi-DC（如果用 federation）	Low-Medium

Paradigm = High（其他 Low-Medium）→ Type E paradigm shift；KV 是 sub-feature、不是 migration scope 全部。

為什麼遷：3 條 expansion driver

Service mesh adoption：本來用 etcd 跑 K8s control plane、現在 application 端要 service mesh（mTLS / intentions / 流量切換）、Consul 一站式 cover
Multi-DC strategy：etcd 不支援跨 DC、要 active-passive failover；Consul WAN federation 支援 active-active 多 DC
Configuration management：consul-template + envconsul 比 etcd watch + 自寫 reloader 簡單

反向 driver（Consul → etcd）：

純 K8s control plane scenario、不需要 service discovery / health check / mesh、etcd 簡單足夠
Resource constraint：Consul agent 比 etcd 更吃資源、low-end VM 上不夠

Paradigm expansion 路線

跟 Redis → Memcached paradigm reduction（移除 features）對偶、Consul 是 補進 features：

 1etcd KV pattern         → Consul KV API (1:1 對位)
 2etcd watch              → Consul blocking query / consul-template
 3etcd lease + lock       → Consul session + KV CAS
 4
 5(額外加進)
 6無                      → Consul service registration (services.json / API)
 7無                      → Consul health check (HTTP / TCP / TTL)
 8無                      → Consul service discovery (DNS / HTTP)
 9無                      → Consul Connect (mTLS + intentions)
10無                      → Consul WAN federation (multi-DC)
11無                      → Consul ACL token + policy

Migration 不只是 KV API 對位、是 application 增能。

API 對位

1# etcd basic KV
2etcdctl put /myapp/config/db_url 'postgres://...'
3etcdctl get /myapp/config/db_url
4
5# Consul KV (對位)
6consul kv put myapp/config/db_url 'postgres://...'
7consul kv get myapp/config/db_url

1# etcd watch
2etcdctl watch --prefix /myapp/config/
3
4# Consul blocking query (long polling)
5curl 'http://consul:8500/v1/kv/myapp/config?recurse&index=5&wait=10s'
6# X-Consul-Index header 為 watch cursor

 1# etcd transaction (multi-key atomic)
 2etcdctl txn < 3compares:
 4mod("/myapp/lock") = "0"
 5success requests:
 6put /myapp/lock "owner1"
 7EOF
 8
 9# Consul session + KV CAS (對位)
10SESSION_ID=$(curl -X PUT 'http://consul:8500/v1/session/create' | jq -r .ID)
11curl -X PUT 'http://consul:8500/v1/kv/myapp/lock?acquire='$SESSION_ID -d 'owner1'
12# 若失敗 lock 已被別人持有

Application 重設計

 1# Before: etcd
 2import etcd3
 3etcd = etcd3.client(host='etcd', port=2379)
 4etcd.put('/myapp/config/db_url', 'postgres://...')
 5db_url = etcd.get('/myapp/config/db_url')[0]
 6
 7# After: Consul (KV-only)
 8import consul
 9c = consul.Consul(host='consul', port=8500)
10c.kv.put('myapp/config/db_url', 'postgres://...')
11_, kv = c.kv.get('myapp/config/db_url')
12db_url = kv['Value']
13
14# (額外加進) After: Consul service discovery
15c.agent.service.register(
16    name='myapp',
17    service_id='myapp-1',
18    address='10.0.0.10',
19    port=8080,
20    check=consul.Check.http('http://10.0.0.10:8080/health', '10s', '5s', '30s')
21)
22
23# DNS-based discovery (其他 service 找 myapp)
24# dig +short myapp.service.consul SRV

Migration 流程

 11. Pre-migration audit
 2   - 列 etcd 使用的所有 application
 3   - 評估每個 application 是否 *需要* Consul extras（service discovery / health / mesh）
 4   - 純 KV use case 標 *low-effort migration*、用得到 extras 標 *value-add migration*
 5
 62. Consul cluster build
 7   - 跨 DC 設計（WAN federation 規劃）
 8   - ACL system 配置（不要 default open）
 9   - 性能 sizing（Consul agent 比 etcd 重）
10
113. Application migration（per-app）
12   - 純 KV: SDK 換、API 對位、cutover
13   - Service discovery: 加 registration + health check + DNS lookup
14   - Service mesh: 加 Connect proxy + intentions
15
164. Dual-run period
17   - etcd 仍跑、application 漸進切到 Consul
18   - 每 application cutover 後驗證
19
205. etcd decommission
21   - 確認所有 application 已切
22   - K8s control plane（如果是 etcd 唯一 user）保留不切

整體 2-4 個月、依 application 數量跟 extras 採用程度。

Production 故障演練

Case 1：KV API 對位看似 1:1、watch event model 不同

徵兆：application 端從 etcd watch 切 Consul blocking query 後、event 處理 latency 從 50ms 漲到 1-5s；應用以為 event push 即時、實際變 polling。

根因：etcd watch 是 gRPC stream、event 即時 push；Consul blocking query 是 long-polling、有 wait timeout、event 在 timeout 內到才即時收到。

修法：

降 wait timeout 跟業務需求對齊（default 5min、可設 10s）
多 instance 並發 polling：N 個 application instance 各自 polling、降單點 event 延遲
架構：critical event 用 Consul event API（PUT /v1/event/fire/）+ blocking query event endpoint、跟 KV change 分開
保留 etcd for critical watch：mission-critical watch 用 etcd 不切

Case 2：Session-based lock 跟 etcd lease 差

徵兆：原本 etcd lease 5s TTL、lease holder application 失聯時 5s 內 lock 自動釋放；切 Consul session 後、session TTL 仍生效、但 health check 整合複雜、偶發 lock not released。

根因：Consul session 有兩種模式 — delete（session expire 時 release lock）vs release（release lock 但 KV 保留）；TTL 配 health check 時行為複雜。

修法：

1# 明示 session behavior
2session_id = c.session.create(
3    name='myapp-lock',
4    ttl=15,           # 15s TTL
5    behavior='delete' # session 過期時 lock 自動 release
6)
7c.kv.put('myapp/lock', 'owner1', acquire=session_id)

session TTL 範圍 10s-86400s、不能 < 10s（etcd 可以 1s）；critical low-latency lock 不適用 Consul。

Case 3：Multi-DC failover、KV 寫到 wrong DC

徵兆：跨 DC 部署後、某 application 寫 KV、但 read 不到；發現 application 端 hardcode 一個 DC 端點、write 到 us-east 但 read 來自 us-west。

根因：Consul WAN federation 跨 DC 不自動同步 KV；KV 是 per-DC、跨 DC sync 需要 Consul Enterprise license 或自管 consul-replicate。

修法：

每 application instance 連 local DC Consul：write/read 同 DC
KV replication 跨 DC：用 consul-replicate 自管、或升 Enterprise
Architecture：跨 DC 共享 config 改用 DB-backed config（持久 + 跨 DC）+ Consul KV 只存 DC-local config

Case 4：ACL system 預設 open、cutover 後曝險

徵兆：Consul cluster 上線 1 個月後 SOC 跑 audit、發現任何 application 都能 read 任何 KV；ACL 沒設、所有 token 都全權限。

根因：Consul ACL 預設 disabled、需要 bootstrap；很多 setup tutorial 簡化跳過 ACL、cutover 後沒補。

修法：

 1# Bootstrap ACL system
 2consul acl bootstrap
 3# 生成 management token、保留為 root credential
 4
 5# 建 policy
 6consul acl policy create -name 'myapp-readonly' \
 7  -rules 'key_prefix "myapp/" { policy = "read" }'
 8
 9# 建 token 給 application
10consul acl token create -policy-name 'myapp-readonly'

Production setup 第一步就 bootstrap ACL、不可以延後。

Case 5：Health check failure 連鎖、service discovery 失效

徵兆：某 application instance 因 GC pause 5 秒未 respond health check、被 Consul 標 failed；DNS query 不返回該 instance；流量切走；GC 結束後 instance 仍 healthy 但 Consul 端 still failed、需要 minutes recover。

根因：Consul health check 失敗後進入 critical state、需要 連續 N 次成功 才回 passing；default 1-2 次成功即可、但實際時間視 check interval 而定。

修法：

success_before_passing 設低（1）讓快速恢復
failures_before_critical 設高（3-5）容忍 transient failure
Multi-check strategy：HTTP + TCP + script check 三軸、不靠單 check
Application-side hint：JVM application 配 MaxGCPauseMillis 限制 GC pause < health check interval

Capacity / cost

維度	etcd	Consul
Cluster baseline	3-5 node Raft cluster	3-5 server + N agent (per host)
Memory per node	2-8GB	4-16GB（含 agent）
Operational FTE	0.2-0.5	0.5-1.0（多 features 多運維）
Feature surface	Pure KV	KV + service mesh + multi-DC + ACL
Setup complexity	Low	Medium-High
Multi-DC support	不支援	Built-in WAN federation
License	Apache 2.0 (open)	MPL 2.0 (community) / commercial (enterprise)
Migration cost	-	1-3 FTE × 2-4 個月

判讀：純 KV use case 走 etcd；service mesh / multi-DC / discovery 需求大走 Consul；混合 deployment 是 long-term default（K8s control plane 仍跑 etcd、service mesh 跑 Consul）。

整合 / 下一步

跟 Kubernetes 對位

K8s control plane 永遠用 etcd、不切 Consul；Consul 是 K8s 外的 service mesh + 跨 cluster discovery。兩者並存、不互斥。

跟 Vault 整合

Consul + Vault 是 HashiCorp 同生態、Consul 跑 service discovery / mesh、Vault 跑 secrets；Consul ACL token 可從 Vault dynamic engine 取得。

跟 Istio / Linkerd 對位

Consul Connect 是 service mesh paradigm、跟 Istio / Linkerd 並列；多數 K8s-native organization 用 Istio / Linkerd、Consul 強項在 跨 K8s + VM + multi-DC mesh。

反向 migration（Consul → etcd）

少數 organization 簡化 stack 時做、流程鏡像對稱、但 退掉 service mesh / multi-DC 是有意識降級、不能假裝功能等價。

下一步議題

Consul Connect production rollout：mesh adoption 是 incremental、per-service intentions 漸進
Multi-DC topology 設計：active-active vs active-passive、依 RPO/RTO 跟 cost trade-off
跟 Kubernetes Gateway API 整合：service mesh paradigm 在 K8s 內 vs 外整合策略

Redis → Memcached：Memcached 不是 simpler Redis、是 cache paradigm

Tue, 19 May 2026 00:00:00 +0000

本文是跨 vendor migration playbook、cross-link Redis 跟 Memcached。跑 migration-playbook-methodology 6 維 audit 後對映 Paradigm = High（multi-paradigm → pure cache）→ Type E paradigm shift；本文是 paradigm reduction（downgrade 方向）的 dogfood。

Memcached 不是 simpler Redis、是 cache paradigm

把 Redis → Memcached 當「移除 Redis 功能」是最常見的誤判：

概念	Redis	Memcached
核心 paradigm	Multi-paradigm（KV + 資料結構 + pub/sub + script）	Pure cache（KV + TTL）
Value 類型	String / Hash / List / Set / Sorted Set / Stream / Bitmap / HyperLogLog	byte string only
Atomic operations	100+（INCR / LPUSH / ZADD / …）	INCR / DECR / APPEND / CAS
Server-side scripting	Lua scripts (`EVAL`)	無
Pub/Sub	Native	無
Persistence	RDB / AOF	無（restart 全失）
Replication	Async / sync replication	無
Cluster	Redis Cluster + Sentinel HA	Memcached cluster（client-side sharding）
Eviction policy	8 種（LRU / LFU / random / …）	LRU only
Expiration accuracy	TTL 精確到 ms	TTL 精確到 second、lazy expiration

核心差異不在「Memcached 少了 Redis 功能」、在「Memcached 是不同的 cache paradigm」。 Redis 的 features（hash / sorted set / pub/sub）多數 不該移除、是 重新分配到對應 specialized service：

Hash / sorted set → application 端用 JSON + 自管 index
Pub/Sub → message queue（NATS / Redis Streams / Kafka）
Lua scripts → application code
Persistence → 真正需要的 data 該存 DB、不是 cache
Replication / cluster → Memcached 自己 cluster strategy

為什麼遷：simplification / cost / ops 三條 driver

Operational simplification：Memcached 沒 persistence / replication / cluster mode、ops surface 縮小、團隊不用懂 Redis 25+ command family
Cost：對 純 cache use case 而言、Memcached 每 GB 比 Redis 便宜（memory efficiency 略勝 + 無 persistence overhead）
Strict cache discipline：Memcached 逼 application code 把「真正的 cache」跟「半 persistent state」分開、避免 Redis 變 poor man’s database

反向 driver（Memcached → Redis）：

Application 寫到 Memcached 後發現需要 atomic counter / leaderboard / queue / lock、應該升 Redis（不是繼續 wrap Memcached）

跑 6 維 audit

維度	評估	等級
Schema / API	Redis 命令集 → Memcached 命令集、相容度 < 20%	High
Operational model	兩者都簡單、Memcached 略簡單	Low
Paradigm	Multi-paradigm → pure cache	High
Components	同 1 個 cache service	Low
Application change	必改（任何 hash / list / sorted set / pubsub 用法）	High
Data topology	同 single instance / cluster	Low

3 維 High（Schema / Paradigm / Application change）多軸高、主導維度 = Paradigm → Type E paradigm shift；Schema + Application change 抽獨立段補充。

結構：類 Type E + paradigm reduction 分配路線

 11. Memcached 不是 simpler Redis（concept reverse 開頭）
 22. 為什麼遷
 33. 6 維 audit
 44. Paradigm reduction 路線（Redis features 對應的 specialized service）
 55. Schema 差段（Redis vs Memcached command set）
 66. Application 重設計（per-call-site refactor）
 77. Migration 流程（漸進、部分 use case 切）
 88. Production 故障演練
 99. Capacity / cost
1010. 整合 / 下一步

10 章節、220-260 行。比 Type E（Kafka ↔ NATS）多 paradigm reduction 路線 段。

Paradigm reduction 路線

Redis features 對應的 specialized service：

 1Redis Hash           → Application 端 JSON.stringify + Memcached SET
 2                       (or 直接存 DB + Memcached cache layer)
 3
 4Redis List (queue)   → NATS / Kafka / RabbitMQ / SQS
 5
 6Redis List (stack)   → Application 端用 array + 自管 LIFO
 7
 8Redis Set            → Application 端用 array + dedup OR 用 DB unique index
 9
10Redis Sorted Set     → Application 端用 ordered list + comparator
11                       OR PostgreSQL + index
12
13Redis Stream         → Kafka / Redis Streams (保留) / NATS JetStream
14
15Redis Pub/Sub        → NATS Core / Redis Streams / Kafka
16
17Redis Lua script     → Application code（避免 atomic 假設）
18
19Redis distributed lock → Consul / etcd / DB advisory lock / Redis (保留)
20
21Redis Bitmap         → DB bit column / 應用端 bitset
22
23Redis HyperLogLog    → DB approx_count_distinct / 應用端 cardinality estimator

Migration scope 包含 每個 Redis-specific feature use case 對應的 service 評估；不是「移除」、是「重新分配」。

Application 重設計

 1# Before: Redis hash
 2redis.hset('user:123', 'email', 'a@b.com')
 3redis.hset('user:123', 'name', 'Alice')
 4user = redis.hgetall('user:123')
 5
 6# After: Memcached + JSON
 7import json
 8user_data = {'email': 'a@b.com', 'name': 'Alice'}
 9mc.set('user:123', json.dumps(user_data))
10user = json.loads(mc.get('user:123') or '{}')

1# Before: Redis sorted set (leaderboard)
2redis.zadd('leaderboard', {'alice': 100, 'bob': 95})
3top_10 = redis.zrevrange('leaderboard', 0, 9, withscores=True)
4
5# After: PostgreSQL + index + Memcached cache
6# Persistent: write to DB
7# Cache: pre-compute top 10 in DB query, cache in Memcached
8mc.set('leaderboard:top10', json.dumps(db.query('SELECT user, score FROM scores ORDER BY score DESC LIMIT 10')))

1# Before: Redis distributed lock
2with redis.lock('resource:1', timeout=10):
3    process_resource()
4
5# After: PostgreSQL advisory lock OR Consul session
6with db.advisory_lock(resource_id):
7    process_resource()

每個 Redis-specific pattern 都要 per-call-site refactor、不是 SDK 換。

Migration 流程

跟 Kafka ↔ NATS 同 partial migration：

 11. Audit application code、列所有 Redis call site + feature 使用
 22. 按 feature 分類處理 plan:
 3   - Pure KV (GET/SET/DEL/TTL): 切 Memcached 直接
 4   - Hash → JSON + Memcached: per-call-site refactor
 5   - List/Sorted Set: 評估是 queue / leaderboard / 其他用途、對應 service
 6   - Pub/Sub: 移到 message queue
 7   - Lock: 移到 DB 或保留 Redis
 83. 部分 application 先切（純 KV use case）
 94. 複雜 use case 逐步 refactor 到對應 service
105. Memcached 跑 production 後、Redis 可降為 *narrow scope*（只跑剩餘 Redis-specific feature）
11   或完全退役（如果 application 已 refactor 乾淨）
126. 長期混合架構：Memcached cache layer + DB persistent state + 可選的 Redis（locks / specialty）

整體 3-12 個月、依 Redis-specific feature 使用深度。

Production 故障演練

Case 1：Hash → JSON 後 GET/SET round-trip 變 N+1

徵兆：cutover 後 application latency p99 從 5ms 漲到 50ms；profiling 顯示「為了改 user.email、要先 GET user object → modify → SET」、原本 Redis HSET 1 個 round-trip 現在 2 個。

根因：JSON-encoded value 不能 partial update、每次改一欄都要 read-modify-write。

修法：

Application 端 cache JSON object in memory：read-modify-write 仍 1 個 SET、但 read 是 memory
Compare-and-swap (CAS)：Memcached CAS 防止 concurrent update lost
Field-level cache key：把 hash 拆成 N 個 Memcached key（user:123:email / user:123:name）、避開 JSON

Case 2：Sorted set leaderboard 退化、recomputation cost 爆

徵兆：原本 Redis leaderboard ZADD + ZREVRANGE < 1ms；切 DB-backed leaderboard 後 SELECT ... ORDER BY ... LIMIT 10 在 1M+ row 跑 100-500ms。

根因：Memcached 不支援 sorted set、leaderboard 必須在 DB 算、N 大時 sort 慢。

修法：

Cache pre-computed top N：DB scheduled job 每分鐘算 top 100、寫 Memcached、application 讀 cache 不直查 DB
Materialized view + index：DB 端用 materialized view + index、毫秒級 query
保留 Redis sorted set：leaderboard 是 Redis 強項、不該退到 Memcached、走混合架構

Case 3：Pub/Sub 移除、缺 fan-out 機制

徵兆：原本 Redis Pub/Sub 跑 cache invalidation broadcast、N 個 application instance 都收 invalidation msg；切 Memcached 後失去 broadcast、cache stale。

根因：Memcached 沒 Pub/Sub；application 需要外部 fan-out 機制。

修法：

NATS / Redis Streams + consumer group：each application instance 是 consumer、收 invalidation
Database trigger + LISTEN/NOTIFY：PostgreSQL LISTEN/NOTIFY 對中型 fan-out 足夠
Architecture rethink：是否真需要 broadcast invalidation？通常用 TTL-based cache + cache key versioning 就能 cover 多數 invalidation use case

Case 4：Atomic INCR 沒對等、race condition

徵兆：rate limiter / counter pattern 切 Memcached、mc.incr(key) 在 key 不存在時 return None（不 auto-init 為 0）；application 端 if None: mc.set(key, 1) race condition、低機率 counter reset。

根因：Memcached INCR 對 missing key 不像 Redis 自動 init；application 端 init logic 容易 race。

修法：

1# 用 ADD（atomic put-if-absent）
2mc.add(key, 0)  # only sets if missing
3mc.incr(key)    # always works after add

ADD + INCR 兩個 atomic operation 合起來 race-free。

Case 5：Eviction policy 差異、production cache hit rate 降

徵兆：cutover 後 cache hit rate 從 95% 降到 80%；profiling 發現「重要 key 沒在 cache」、新 key 一直擠走熱 key。

根因：Redis 預設 allkeys-lfu (least frequently used)、長期熱 key 不被擠；Memcached 只有 LRU、單純按 access time、burst access 的 cold key 擠走 long-tail hot key。

修法：

Memory headroom：Memcached memory 限制拉高 30-50%、避免 eviction pressure
Application-side cache priority：critical key 用 no-expiration set + 主動 refresh
保留 Redis for LFU workload：long-tail hot key 場景 Redis LFU 更合適、不該退 Memcached

Capacity / cost

維度	Redis	Memcached
Memory efficiency	baseline	+10-20%（無 metadata overhead）
Throughput	~100K ops/s single-thread	~500K-1M ops/s multi-threaded
Latency p99	1-3ms	0.5-1ms
Persistence overhead	5-15% CPU	0
Operational FTE	0.3-0.8	0.1-0.3
Application complexity	Low（feature 豐富）	Higher（feature 移到 application）
Cost per GB memory	baseline	略低（無 persistence I/O / replication overhead）

判讀：純 cache use case 走 Memcached 省 ops + 略省 cost；application 已用 Redis-specific feature 不該切；混合架構是 long-term default。

整合 / 下一步

跟 Redis → DragonflyDB 對比

兩條路：

DragonflyDB：保留 Redis paradigm、優化 throughput + memory；application 不用改
Memcached：退到 pure cache paradigm、application 必須改、但 ops 簡化

選擇取決於 是否真的需要 Redis multi-paradigm features：用得到就 DragonflyDB / Redis、用不到就 Memcached。

跟 NATS 整合

Redis Pub/Sub 移除後、應用端 fan-out / messaging 需求轉到 NATS / Redis Streams / Kafka；本文 cross-link migration playbook Kafka ↔ NATS 有 paradigm shift 流程參考。

下一步議題

Memcached Cluster strategy：client-side consistent hashing vs server-side cluster mode、ops 簡化 vs scalability 取捨
Long-term mixed architecture：80% Memcached + 20% Redis 是常見 stable state、不一定要完全消除 Redis

MySQL 5.7 → 8.0 Major Version Upgrade：character set / authentication / atomic DDL 三條 paradigm 同時換軌

Tue, 19 May 2026 00:00:00 +0000

本文是 MySQL 內 version upgrade migration playbook、走 Migration playbook methodology Type E paradigm shift 結構。

5.7 → 8.0 看起來是 minor bump（從 5.7.40 升到 8.0.36）、但不是。Oracle 把這個 release boundary 當成 清庫存的機會 — 同時推出 3 個 behavioral paradigm shift：

Paradigm	5.7 default	8.0 default	影響
Character set	latin1 / utf8（=utf8mb3）	utf8mb4	string column 儲存 + emoji / 4-byte UTF-8
Authentication plugin	mysql_native_password	caching_sha2_password	client / library 需要支援新 plugin
DDL atomicity	Non-atomic（crash 留 orphan）	Atomic（crash recovery 乾淨）	開發信心、crash recovery 行為

對應 任意一個 paradigm 升級失誤、production 都會 down。三條同時換、必須 三條都規劃。

這條 upgrade 比 PostgreSQL major-version-upgrade 工作量大 — PG major upgrade 主要是 pg_upgrade 工具流程、MySQL 是 behavioral compatibility audit + ecosystem 全 review。

為什麼是 Type E（不是 minor upgrade）

跑 6 維 diff dimension audit：

維度	評	說明
Schema	Medium	SQL 一致、reserved keyword 新增、collation 預設變
Operational	Medium-High	binary upgrade flow 簡單、但 ecosystem 工具兼容性 audit 工作量大
Paradigm	High	3 條 default paradigm shift（charset / auth / atomic DDL）
Components	Low	同 MySQL 引擎、不引新 component
App change	Medium-High	client library / driver / connection string 都可能要改
Topology	Low	部署 topology 不變

Paradigm = High + App change = Medium-High → Type E paradigm shift。

雖然是 同一個 vendor 的 major version、實際的 application 行為差異 跨越多個 paradigm、6 type 框架仍適用、結構走 partial migration 收斂。

4-phase upgrade

Phase 1：Pre-check audit

8.0 升級前用 MySQL Shell upgrade checker + 手動 audit：

1mysqlsh root@5.7-primary.example.com -- util check-for-server-upgrade

Upgrade checker 報告：

Reserved keyword 衝突（5.7 不是 keyword 但 8.0 是、例如 WINDOW / RANK / LATERAL）
舊 character set / collation 使用點（latin1 / utf8mb3）
Deprecated feature 使用（GROUP BY 隱含 ORDER BY 等）
Datatype 變動（DATETIME 行為微差）

手動 audit：

Application driver / library 版本是否支援 caching_sha2_password
Connection string 內 default-authentication-plugin 設定
ORM / framework 是否假設 utf8 而非 utf8mb4

完成標準：寫出 blocker list（必須在升級前修） + warning list（可在升級後處理）。

Phase 2：Shadow upgrade — Replica 升 8.0

從 non-critical replica 升起。先升一個 replica、跑 production traffic（read-only）2-4 週：

 1# 1. Stop replica
 2systemctl stop mysql
 3
 4# 2. Backup（XtraBackup）
 5xtrabackup --backup --target-dir=/backup/pre-upgrade
 6
 7# 3. Install MySQL 8.0 binary（apt / yum 升級）
 8apt-get install mysql-server-8.0
 9
10# 4. 啟動 8.0、自動 upgrade data dictionary
11systemctl start mysql
12
13# 5. 8.0 自動跑 server-upgrade（8.0.16+ 內建、mysql_upgrade utility 已 deprecated）
14# 若 5.7 升 8.0.16 之前 server、才需要手動跑 mysql_upgrade -u root -p
15
16# 6. 重新 attach 為 5.7 primary 的 replica（8.0 replica 可 attach 5.7 primary）
17CHANGE MASTER TO MASTER_AUTO_POSITION=1;
18START SLAVE;

跑 production read traffic 觀察：

Query result 是否跟 5.7 一致（特別 character set 相關）
Replication lag 是否在 baseline 範圍
8.0-specific feature 是否需要（hash join / window function 等）

Phase 3：Promote 8.0 為 primary

確認 shadow replica 穩定後：

 1# 1. 升其他 replica 到 8.0
 2# （per-replica 跑 Phase 2 流程）
 3
 4# 2. Application application 改用 8.0-compatible driver
 5# 把 connection string 加 default-authentication-plugin=caching_sha2_password
 6# 或仍用 mysql_native_password（user 端設定）
 7
 8# 3. Failover：promote 8.0 replica 為 primary
 9# 用 Orchestrator / 自管 failover 流程
10
11# 4. 5.7 primary 變成 8.0 replica、升 5.7 → 8.0

完成標準：所有 server 都是 8.0、application 連 8.0 endpoint 無 error。

Phase 4：Decommission 5.7 + 套用 8.0 paradigm

完成 binary upgrade 不是真正完成 — 還要逐步遷移 paradigm：

Character set 升級：歷史 latin1 / utf8 table 改 utf8mb4
```
1ALTER TABLE orders CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
```
每張 table 走 gh-ost / pt-osc（避免 production 阻塞）
Authentication 升級：逐步把 user 從 mysql_native_password 改 caching_sha2_password
```
1ALTER USER 'app'@'%' IDENTIFIED WITH caching_sha2_password BY 'new_password';
```
需確認 application driver 已支援新 plugin（多數 modern driver OK、legacy 可能要升級）
Reserved keyword 處理：column / table 名稱跟新 reserved word 衝突的、改名
```
1ALTER TABLE events RENAME COLUMN window TO event_window;
```

多數 org 在 Phase 3 停留更久 — paradigm 升級不是一次 big bang、是漸進。

5 個 Production 踩雷

1. Authentication plugin — Application 突然連不上

升 8.0 後 new user 預設用 caching_sha2_password、舊 application driver（< 5 年版本）不支援、connect error: Authentication plugin 'caching_sha2_password' cannot be loaded。

修法：

先升 driver：每個 application 升級 mysql-connector-* 到支援 caching_sha2 的版本（多數 modern release 已支援）
短期 workaround：用 mysql_native_password（new user 顯式 create with IDENTIFIED WITH mysql_native_password）
設 default_authentication_plugin=mysql_native_password、強制保留舊 default

2. Character set 4-byte UTF-8 — Emoji 進不去

5.7 latin1 / utf8（=utf8mb3）column 升 8.0 後 仍是 utf8mb3、不會自動升 utf8mb4。Application 寫入 emoji（4-byte UTF-8）會被 truncate / 拒絕。

修法：

逐 table CONVERT：gh-ost / pt-osc 跑 ALTER TABLE ... CONVERT TO CHARACTER SET utf8mb4
新建 table 預設用 utf8mb4（character_set_server=utf8mb4 設定）
Application 連線 charset 設定一致（character_set_client / connection / results）

3. Reserved keyword — Application query 突然 syntax error

5.7 跑得好的 query：

1SELECT window, rank FROM events;

8.0 報錯：window 跟 rank 都是 reserved keyword、必須 backtick：

1SELECT `window`, `rank` FROM events;

修法：

Phase 1 upgrade checker 已抓出來、Application code review 改 SQL
推薦 predefer table / column 名 backtick policy（一律加 backtick、避免未來 reserved word 衝突）
ORM 多數會自動 backtick、raw SQL 容易踩

4. Group Replication / 新 feature 開了就不能 rollback

8.0 升級後 誘惑使用 8.0-only feature：

Group Replication（5.7 也有但 8.0 更穩）
Resource Group（5.7 沒有）
Histograms（5.7 沒有）
CTE / window function（5.7 沒有）

一旦 application 用了這些 feature、不能 rollback 5.7（feature 不存在、query 失敗）。

修法：

Phase 1-3 期間禁用 8.0-only feature、保留 rollback option
Phase 4 完成 且穩定運作 30+ 天後、才開始 evaluate 8.0-only feature
加 8.0-only feature 時 明確記錄不可 rollback

5. Collation default 變動 — Sort order 跟 unique 行為改變

5.7 utf8mb4 預設 collation = utf8mb4_general_ci、8.0 預設 = utf8mb4_0900_ai_ci。兩者排序行為不一致：

utf8mb4_general_ci：簡化 collation、不嚴格遵循 Unicode
utf8mb4_0900_ai_ci：Unicode 9.0 compliance、accent-insensitive

對 已存在的 table、collation 不會被 8.0 升級改變（保留 5.7 設定）。但 新建 table 預設用 0900_ai_ci、UNION / JOIN 跨不同 collation 的 column 可能 error: Illegal mix of collations。

修法：

統一 collation：要麼 所有 table 改 0900_ai_ci、要麼 所有 table 保留 general_ci
Schema migration 走 OSC 工具
Application 內 sort-dependent logic（leaderboard / search ranking）要驗證新 collation 結果

Capability gap：5.7 有但 8.0 沒有

少數 8.0 拿走的能力：

Query Cache：5.7 內建（但已 deprecated）、8.0 完全移除。Query cache 在高並發場景 actually slowing down、移除是好事
InnoDB MEMORY engine：5.7 部分支援、8.0 限制更多
Some MyISAM optimizations：8.0 強制 InnoDB-first、MyISAM-specific 工作流 broken

對 Query Cache user：升 8.0 前評估是否依賴、考慮改 application-side cache（Redis）。

容量與成本對照

項目	5.7	8.0
Cost	Free (CE) / Enterprise	Free (CE) / Enterprise
升級 hosts × 時間	-	per-instance ~30 分鐘 binary upgrade
Application 改動	-	driver upgrade + SQL review
Character set conversion	-	per-table OSC、大表小時級
Ops headcount	-	1-2 個 DBA × 2-4 週
對 production 影響	-	Phase 2-3 漸進升級、無大 downtime

5.7 → 8.0 upgrade 整體成本是 1-2 個 FTE 月 規模。對中型 deployment（100+ DB）可能更多。

何時不升

App 用 Query Cache 重度：8.0 沒了、要 application 改造
Old driver 不能升：legacy enterprise application 用 10 年前 driver、driver vendor 已倒、無法升 8.0-compatible
Compliance freeze：某些金融 / 醫療場景 freeze technology 多年、升級需要重 audit + recertification
5.7 已 EOL（2023-10）後仍堅持不升：security risk 高、應該 優先升

跟 PostgreSQL Major Version Upgrade 對比

維度	MySQL 5.7 → 8.0	PostgreSQL N → N+1
Tool	binary upgrade + 自動 server-upgrade（8.0.16+；舊版用 mysql_upgrade）	pg_upgrade（in-place）
Downtime	< 5 分鐘 per instance（binary + DD upgrade）	< 1 分鐘 per instance（pg_upgrade）
Paradigm shift	3 條（charset / auth / atomic DDL）	一般 0-1 條（PG major 多保 compat）
App 必須改	多（driver + query）	少（多數 query 兼容）
Risk	高（paradigm 多）	中-低
Rollback	不可（一旦 atomic DDL data 寫入、5.7 不認）	不可（pg_upgrade 不可逆）

PG major upgrade 比 MySQL 簡單。MySQL 5.7 → 8.0 是特例 — Oracle 把多年 deprecated 一次清。8.0 → 8.4 / 9.x 預期更平順。

跟其他模組整合

跟 Replication topology

8.0 replica 可 attach 5.7 primary（向下兼容）、但 5.7 replica 不能 attach 8.0 primary（向上不兼容）。Upgrade 順序必須 replica 先升、primary 後升。詳見 Replication Topology。

跟 InnoDB Tuning

8.0 InnoDB 改寫了 redo log（atomic、可動態調整）、innodb_log_file_size 升級後可以 online 改、不必停機。詳見 InnoDB Tuning。

跟 Modern SQL Features

8.0 補 CTE / window / JSON_TABLE / hash join — 是 為什麼要升 8.0 的 driver。詳見 Modern SQL Features。

跟 Group Replication

GR 在 5.7 有、但 8.0 才成熟。Group Replication 的 MySQL Shell + Router 整套 stack 主要在 8.0 才完整。詳見 Group Replication。

跟 Aurora / PlanetScale 等 managed

從 5.7 升 8.0 是個好時機 同時評估 是否要遷 Aurora / PlanetScale — 既然要做 paradigm shift、不如一次到位。詳見 migrate-to-aurora / migrate-to-planetscale。

MySQL → PlanetScale：managed Vitess + branch-based schema workflow 的 hybrid shift

Tue, 19 May 2026 00:00:00 +0000

本文是跨 vendor migration playbook、cross-link 到 MySQL 跟 PlanetScale。走 Migration playbook methodology Type E paradigm shift 結構。

維度	自管 MySQL	PlanetScale
Sharding	自己配 Vitess 或不 shard	Vitess 透明（即使單 keyspace 也走 Vitess）
Schema migration	gh-ost / pt-osc 跑 ALTER	Branch + Deploy Request workflow
Failover	Orchestrator 自管	PlanetScale 自動
Branching	不存在概念	DB branch（git-like）+ revert
Connection limit	max_connections 自己設	PlanetScale connection pool / per-plan limit
Foreign key	支援	有限支援（Vitess 18+ / 2023 起、需明確啟用）
`SUPER` privilege	自己有	無
Multi-region	自己配 binlog ship	PlanetScale 內建（Boost feature）
Per-month cost	EC2 + EBS + ops	per-row-read + per-row-written + storage

從 application 連線 視角：跟 Aurora MySQL migration 一樣低、connection string 換就完事。從 schema management 視角：PlanetScale 強推 branch-based workflow — 改 schema 不再是「跑 gh-ost」、是「開 branch → Deploy Request → review → merge」。整個 schema change 工作流跟 git 同型、跟 application code review 同 workflow。

這是 workflow + schema-tooling shift — Aurora 是「同 workflow + managed」、PlanetScale 是「同 protocol + 不同 schema workflow + branch tooling」。Database paradigm（OLTP relational）跟 application change 都 Low、主要 shift 在 DBA / dev 操作介面。

為什麼是 Type E（Paradigm + Operational + Schema 多軸）

跑 6 維 diff dimension audit：

維度	評	說明
Schema	Medium-High	MySQL wire protocol 一致、FK 有限支援（Vitess 18+）、部分 INSTANT DDL 行為差
Operational	High	branch lifecycle、Deploy Request workflow、connection pooler 不同
Paradigm	High	branch-based schema management、跟自管 gh-ost / pt-osc 思維完全不同
Components	Medium	PlanetScale CLI / Console / API / connection pooler 都進團隊工具
App change	Low	connection string + 移除 FK 約束
Topology	Low-Medium	Vitess 透明 sharding 即使單 keyspace

Paradigm + Operational + Schema 三軸 High。按優先序 Schema > Paradigm > Operational、預設選 Type A。但 讀者最關心 的是 schema workflow paradigm 轉變、不是 schema field translation — Type E 結構更貼合「不收斂、部分 adopt」的真實 migration 流程。

→ Type E paradigm shift、4-phase partial migration（多數 org 停 Phase 2-3 hybrid）。

Driver：Branch-based workflow + Vitess transparent sharding + zero DBA

從自管 MySQL 遷 PlanetScale 的核心 driver 有三條：

Branch-based schema workflow：

改 schema 開 branch（pscale branch create）、在 branch 上跑 ALTER、跑 application code 改、merge 進 main 前 Deploy Request review
Deploy Request 顯示 schema diff、跟 GitHub PR 同概念
Merge 後 PlanetScale 自動跑 no-downtime schema migration（內部 VReplication）
出問題可 revert（48 小時內、用 Vitess VReplication 反向 ship 資料）

這條 workflow 對 developer ergonomic 拉力大 — schema change 不再是「DBA 工作」、是「dev 自己處理、跟 code review 同流程」。

Vitess transparent sharding：

PlanetScale 強制每個 cluster 走 Vitess（即使單 keyspace 看似 unsharded）
寫吞吐成長到需要 shard 時、加 shard 是 PlanetScale internal 操作、application 看不到
不用養 Vitess SRE 團隊

Zero DBA：

PlanetScale 接管所有 ops（failover / backup / parameter / scaling）
跟 Aurora 同等級「managed」、加上 branch workflow

FK 處理：早期 Vitess（< 18）不支援 FK、PlanetScale 對應期間建議全 drop FK + 改 application enforcement。Vitess 18（2023 末）後加 FK 支援、PlanetScale 在合適 plan 內可啟用、但 cross-shard FK 仍受限。Phase 1 audit 重點不再是「全 drop FK」、而是「驗證 FK 行為（特別 cascade / cross-shard）跟自管 MySQL 預期一致」。

4-phase partial migration（不收斂）

Phase 1：FK 行為驗證 + schema audit、PlanetScale shadow cluster 起來

第一步是 FK 行為驗證 + schema layout audit。Vitess 18+ / PlanetScale 已支援 FK、但行為跟自管 MySQL 有差異：

列所有 FK：SELECT * FROM information_schema.KEY_COLUMN_USAGE WHERE REFERENCED_TABLE_NAME IS NOT NULL
對每個 FK 評估：
- Cross-shard FK：PlanetScale 不允許 FK 跨 shard、parent 跟 child 必須同 shard（透過 Vindex 設計）
- Cascade 行為：cross-shard DELETE cascade 在 PlanetScale 不執行、改 application 層處理
- Native FK 啟用 vs application enforcement：依 Vitess 18+ 行為決定保留 FK 或改 app-level
PlanetScale shadow cluster 起來、跑 application schema、用 Vitess Connector 從自管 binlog ship 資料

工作主要塊：

FK 行為 audit + 改 cross-shard cascade（依 FK 數量、weeks 工作量）
Schema dump → PlanetScale import（用 pscale shell）
Vitess Connector 設定 binlog stream

完成標準：PlanetScale shadow cluster 有完整 production schema、cross-shard FK 已處理、binlog stream lag < 1 秒。

Phase 2：Read traffic 切 PlanetScale

跟 Aurora migration Phase 2 同概念：read query 切 PlanetScale connection string、寫入仍自管 MySQL。

差異：

PlanetScale connection 有 per-plan rate limit（Scaler Plan: 10K connections、Enterprise: 100K）
必須走 PlanetScale connection pool（不是直接連、有 SSL handshake overhead）
監控 pscale_io_read_query_throttled_total 確認沒撞 plan limit

跑 2-4 週、確認：

PlanetScale read latency 跟自管 replica latency 接近（PlanetScale Boost cache 可能比自管快）
Vitess Connector stream 穩定
Application 對 PlanetScale row read 量符合 cost forecast

Phase 3：Schema workflow 切 PlanetScale + write cutover

關鍵 paradigm shift：停 gh-ost / pt-osc、改用 PlanetScale branch workflow。

訓練步驟：

第一個 small schema change 用 PlanetScale branch + Deploy Request 跑
開發團隊熟悉 pscale branch create / pscale deploy-request create CLI
CI integration：把 PlanetScale CLI 加進 deploy pipeline
退役 gh-ost / pt-osc CI integration

完成 schema workflow 訓練後 write cutover：

1# 1. PlanetScale 把 shadow cluster promote 為 primary（用 PlanetScale console / API）
2# 透過 PlanetScale Console 啟用 production write 或用 `pscale` CLI 對應 promotion 命令
3# （CLI 命令名稱隨 pscale 版本變動、以 pscale --help 為準）
4
5# 2. Application connection string 切 PlanetScale writer
6# 自管 → mysql://primary.example.com:3306/production
7# PlanetScale → mysql://...@xxx.connect.psdb.cloud/production?sslaccept=strict
8
9# 3. Vitess Connector 反向（PlanetScale → 自管）作為 rollback insurance

完成標準：寫入流量 100% 進 PlanetScale、自管 MySQL 接 PlanetScale binlog（rollback buffer）。

Phase 4：自管 MySQL 退役 / 保留作 rollback buffer

跟 Aurora migration Phase 4 同模式：

自管保留 30-90 天作 cold buffer
確認 PlanetScale cost forecast 跟 actual 一致（per-row read / write 計費可能超預期）
確認 branch workflow 在 production team 內 adopt（不是「PlanetScale 在用、但團隊還是用 gh-ost on staging」這種 stuck 狀態）

多數 org 在 Phase 3 停留更久（半年-一年）— Vitess Connector 反向 binlog ship 是穩定 rollback path、Phase 4 不急。

5 個 Production 踩雷

1. Cross-shard FK — PlanetScale 跟 native MySQL 行為不同

Vitess 18+ / PlanetScale 已支援 FK、但 cross-shard cascade 不執行。同 shard 內 FK 跟 native MySQL 一致；parent 跟 child 跨 shard 時、ON DELETE CASCADE 在 PlanetScale 不會跨 shard 觸發 child delete、結果 application 看到 orphan row。

修法：

Phase 1 audit 出哪些 FK 跨 shard（Vindex 設計決定 parent / child 是否同 shard）
同 shard FK：直接保留、行為跟自管 MySQL 一致
Cross-shard cascade：改 application 層 transaction 內 explicit DELETE child、或 background reconciliation job（定期掃 orphan）
把 parent / child 強制同 shard（用相同 Vindex column）是預防 cross-shard FK 議題的根本解

2. Deploy Request 思維轉換不到位 — 團隊仍用「跑 ALTER」心智模型

DBA / SRE 習慣 直接連 PlanetScale 跑 ALTER —但 PlanetScale 在 production branch 上 禁止 DDL（必須走 Deploy Request）。失敗訊息 not actionable（ERROR: not authorized）、DBA 找不到原因、production maintenance 卡住。

修法：

Phase 3 訓練步驟 不能跳：找一個 small schema change 在 staging 走完整 branch workflow、團隊每個 DBA / SRE 都 hands-on 過
在 ops runbook 寫明 production schema change must go through Deploy Request、列 CLI 命令模板
緊急 schema change（事故中）也走 branch + Deploy Request、PlanetScale 可加速 Deploy（不能 bypass workflow）

3. Schema diff 邊界 — PlanetScale 看不到 application-level INSERT changes

Deploy Request 顯示 schema-level diff（CREATE / ALTER / DROP）、不顯示 data diff。如果 branch 上有 INSERT 進去（測試資料 / seed data）、merge 進 main 時 資料不會搬（只搬 schema）、application 預期有資料但 production 沒。

修法：

把 seed data INSERT 放 application migration / fixture、不在 PlanetScale branch 內
用 PlanetScale CLI export branch data 跟 import to main（手動操作）作為 escape hatch
教育團隊：PlanetScale branch = schema branch、不是 git-like data branch

4. Branch lifecycle ops cost — 100 個 stale branch

每個 PR 都開一個 PlanetScale branch、PR merge 後忘記刪、累積 100 個 stale branch。每個 branch 佔 storage cost、PlanetScale plan limit 也限制 branch 數量。

修法：

CI integration：PR close 自動 pscale branch delete
設 branch retention policy（30 天無活動自動刪）
監控 pscale branch list | wc -l 數量、超 threshold alert
把 branch lifecycle 寫進 team playbook（不是 PlanetScale 教、是團隊內部規範）

5. 無 `SUPER` privilege — 部分操作不可行

PlanetScale connection 拿到的 MySQL user 沒有 SUPER privilege。需要 SUPER 的操作直接失敗：

SET GLOBAL（不能改 runtime variable）
KILL 別人的 query（PlanetScale console 提供 alt 介面）
SHOW MASTER STATUS / SHOW SLAVE STATUS（PlanetScale 抽象掉、不暴露）
INSTALL PLUGIN（managed、不允許）
STOP SLAVE / START SLAVE（Vitess 內部）

修法：

評估 application 跟 ops tool 是否依賴 SUPER privilege
改用 PlanetScale console / API 等價操作
部分監控 query（SHOW SLAVE STATUS）用 PlanetScale 內建 dashboard 代替

Schema translation 主要工作量塊

雖然 Type E 結構不以 schema translation 為主、但 schema diff 在 Phase 1 仍佔多數時間：

自管 MySQL	PlanetScale (Vitess)	翻譯難度
FOREIGN KEY constraint	（無）+ application enforcement	高
INSTANT DDL	部分支援、其他走 Vitess online DDL	低-中
Stored procedure	支援	低
Trigger	支援	低
User-defined function	受限	中
INSERT 跨表（CTE）	支援	低
Cross-shard JOIN	必須用 Vindex（user_id 等 shard key 同表）	中-高
`SUPER` 行為	不支援	中（ops tool 改）
`RELOAD` privilege	不支援	中

容量與成本對照

PlanetScale 計費 很不同：

項目	自管 MySQL（EC2）	PlanetScale Scaler Pro
Per-row read	不計費	按量計費、$1 per 1B row read
Per-row written	不計費	按量計費、$1.50 per 1M row write
Storage	EBS、$0.10/GB-month	$1.50/GB-month + replication overhead
Connection limit	max_connections 自己設	per-plan limit、可加 Connection pooler
Branch	不適用	每 branch 含 storage cost
Boost cache	不適用	additional cost
Ops headcount	1-2 FTE	< 0.2 FTE

PlanetScale 適合 小-中規模 + high developer productivity priority：

流量 < 10K WPS：cost 接近自管、developer productivity 顯著提升
流量 10-50K WPS：cost 開始貴、但 ops saving 仍大於 cost increase
流量 > 100K WPS：PlanetScale Enterprise 議價、要 commit pricing

對 high-traffic 場景 cost forecast 必須跑 真實 workload trace — PlanetScale 提供 pscale analytics 預估 read / write 量、用 production binlog replay 在 staging 跑、估算 row read / write 計費。

何時不要遷

FK 是 application core constraint：cascade DELETE / SET NULL 廣泛使用、application 改不動
大量 SUPER-required ops 自動化：DBA tools / monitoring 寫死 SUPER、改不動
OS-level customization 需求：跟 Aurora 一樣、PlanetScale 完全 managed
流量極大 + 預算敏感：> 100K WPS row read 計費可能比 EC2 貴 5x、需要 Enterprise commit pricing
跨雲 portability 是 requirement：PlanetScale 跑在自家 cloud（背後 AWS / GCP）、不像自管 Vitess 可跨雲

跟 Aurora MySQL 對比（同 batch 的選擇）

維度	Aurora MySQL	PlanetScale
Type	C operational hybrid	E paradigm shift
工作量主軸	parameter group + IAM + endpoint	FK audit + branch workflow
Sharding	不 shard、single-region scaling	Vitess 透明 sharding
Schema workflow	仍用 gh-ost / pt-osc	Branch + Deploy Request
FK	支援	不支援
Cost model	per-hour instance + per-GB storage	per-row read / write + per-GB storage
適合規模	100 GB - 50 TB	100 GB - 1 PB
跨雲	AWS-only	PlanetScale 背後 AWS / GCP

選擇邏輯：

AWS-heavy ecosystem + 不想 schema workflow paradigm shift → Aurora
Developer-first culture + 想 branch-based schema workflow + 接受 FK 限制 → PlanetScale

兩者不互斥、有 org 用 Aurora 給 OLTP core、PlanetScale 給 newer microservices（branch workflow 帶價值）。

從 RDS / MongoDB 遷移到 DynamoDB：access-pattern-first 重建模、混合架構與 cost crossover

Tue, 02 Jun 2026 00:00:00 +0000

本文是 DynamoDB overview 的 migration playbook。寫作參照 Migration Playbook 寫作方法論。

「我們要把 RDS 整個搬到 DynamoDB。」這句話本身就藏著最大的誤解 — DynamoDB 遷移不是把 table schema 1:1 搬過去。RDS 的 normalized schema、JOIN、ad-hoc query 在 DynamoDB 沒有對應物；MongoDB 的彈性 document、二級索引、aggregation pipeline 也不能直接映射。字面意義的「遷移」不成立 — 遷移的動作是 從 access pattern 重新設計資料模型、搬資料只是最後一步。能不能遷、該遷多少，取決於 workload 的查詢形狀是否固定、一致性需求是否能放寬。本文走 paradigm shift 結構：先講為何字面遷移不成立、再講哪些該遷哪些該留、最後才是階段化執行。

6 維 diff audit：主導維度是 paradigm

遷移前先盤點 source 跟 target 的差異落在哪幾維、決定 playbook 結構：

維度	RDS / MongoDB → DynamoDB	程度
Schema / API	SQL / document query → KV `GetItem` / `Query`、無 JOIN	High
Operational model	self-managed / RDS-managed → fully managed serverless	Medium
Paradigm	relational / document model → access-pattern-first KV	High
Components 數量	單 DB → 單 DB（不拆分）	Low
Application change	ORM / query layer 全改、access pattern 先行	High
Data topology	partition key 設計、無跨 region transaction	Medium

主導維度是 paradigm（其次 schema / application change）。這定義了結構 — Type E paradigm shift（排除 schema 翻譯 Type A 和 drop-in Type B）：部分遷移、長期混合架構、不收斂到「全部搬完」。

No-go condition：workload 需要 ad-hoc 分析查詢、跨實體 JOIN、頻繁 schema 變動下的彈性查詢、或複雜多表交易 → 不該遷 DynamoDB。這些是 relational / document 的主場、硬遷會把複雜度推給 application 層（自己做 JOIN、自己維護冗餘）。

為什麼字面遷移不成立：paradigm gap

RDS / MongoDB 是 先有資料模型、再支援任意查詢；DynamoDB 是 先有查詢、才設計資料模型。這個順序顛倒是遷移的核心難點。

relational → DynamoDB 的斷層：

JOIN 消失：relational 用 JOIN 組合多表、DynamoDB 要嘛預先反正規化（把關聯資料寫在同一 item / 同一 partition）、要嘛 application 多次查詢自己組
ad-hoc query 消失：RDS 可以對任意欄位下 WHERE、DynamoDB 只能用 PK/SK 或預建 GSI 查（對應 gsi-lsi-design）
強一致交易縮窄：relational 任意多表交易 → DynamoDB 有限的 TransactWriteItems（對應 transactions-conditional-writes）

document（MongoDB）→ DynamoDB 的斷層：

看似接近（都是 NoSQL / document-ish）、實際 MongoDB 的二級索引彈性、aggregation pipeline、彈性 query 在 DynamoDB 都沒有對應
MongoDB 可以「先存進去、之後再想怎麼查」；DynamoDB 不行、access pattern 沒想清楚就建表、後面要重做

所以遷移的第一步不是匯資料、是 窮舉 access pattern：列出 application 對這份資料的所有讀寫路徑、每條路徑對應 DynamoDB 的 PK/SK/GSI 設計。access pattern 列不完整、就還不能開始遷。

哪些 workload 該遷、哪些該留（混合架構）

Type E 的本質是 不收斂 — 不是所有資料都該進 DynamoDB、混合架構會長期存在。判讀標準：

Workload 特徵	去向
access pattern 固定、key-based 查詢、高吞吐	遷 DynamoDB
可接受 eventually consistent	遷 DynamoDB
需要 ad-hoc 分析 / 報表 / JOIN	留 RDS / 或進 analytics 系統
需要強一致複雜交易	留 RDS
schema 頻繁演進、查詢需求不穩	留 MongoDB / RDS

9.C20 Zomato 是這個判讀的 case anchor：Zomato 遷的是 billing platform（帳單事件、access pattern 固定、可接受 eventually consistent）、不是把整家公司的資料庫都搬。帳單系統從 TiDB 遷到 DynamoDB 後吞吐 2,000 → 8,000 RPM（4x）、延遲降 90%、成本降 50%；動機是 TiDB 必須為突發流量峰值預先 over-provision、DynamoDB on-demand「pay only for what we use」避免常態浪費。

Scope warning：Zomato 的「成本降 50%」是 當下流量 下的對照、不是永久結論；「延遲降 90%」可能主要是 p50、p99/p999 改善幅度通常較小。這兩點 case 原文已標明、引用時不可升級成「DynamoDB 永遠更便宜更快」。crossover 判讀見下方容量段。

Phase plan：access-pattern-first 階段化

paradigm shift 的階段化把不可逆動作放到最後、每階段有獨立驗證門檻：

Phase 1：access pattern 窮舉

列出 application 對目標資料的所有讀寫路徑、標每條的頻率、一致性需求、是否可放寬。這份清單是後續所有設計的輸入、不完整不進下一階段。

Phase 2：DynamoDB 資料建模

依 access pattern 設計 PK/SK、single-table 結構、需要的 GSI、capacity mode。對應 single-table-design-pattern、partition-key-antipatterns。

Phase 3：dual-write

application 同時寫舊（RDS / MongoDB）跟新（DynamoDB）。舊系統仍是 source of truth、DynamoDB 累積資料。dual-write 要處理寫入失敗一致性（其中一邊失敗如何補償）。

Phase 4：backfill 歷史資料

把舊系統既有資料按新模型轉換寫入 DynamoDB。backfill 跟 dual-write 並行時要處理覆蓋順序（backfill 不能覆蓋掉 dual-write 的新值）。

Phase 5：shadow read 驗證

讀路徑同時打舊跟新、比對結果、記錄差異但仍以舊系統回應用戶。shadow read 是 cutover 前的信心來源 — 差異率降到可接受才進 cutover。對應 1.7 Schema Migration Rollout 證據的 evidence 方法。

Phase 6：漸進 cutover

讀流量逐步從舊切到新（按比例 / 按 user segment）、保留隨時切回的能力。cutover 完成後 DynamoDB 成為該 workload 的 source of truth；但其他未遷 workload 仍在 RDS / MongoDB — 混合架構成立。

Evidence：每階段的前進依據

每個階段用資料證明可前進、不靠感覺：

階段	Evidence
dual-write	雙寫成功率、寫入失敗補償紀錄、兩邊 row count 差異
backfill	已 backfill 比例、轉換錯誤數、checksum 對照
shadow read	新舊結果差異率、差異分類（可接受的 eventual vs 真錯誤）
cutover	切流比例、新系統 latency p99、error rate、rollback 是否觸發

這些 evidence 對齊 4.20 Observability Evidence Package（Source / Time range / Query link / Owner / Data quality）與 6.8 release gate 的 gate 決策。

Cutover 與 rollback 決策

資料庫切流失敗代價高、決策權責要寫清楚：

cutover window：選低流量時段、明確切流比例階梯（如 1% → 10% → 50% → 100%）
rollback condition：新系統 error rate / latency 超過閾值、或 shadow read 差異率異常 → 切回舊系統
decision owner：誰有權喊停、依據什麼 evidence、記錄在 8.19 incident decision log（Timestamp / Decision / Context / Evidence / Owner / Rollback condition）
資料凍結策略：cutover 期間若需要凍結寫入、明確凍結範圍與時長

對應 rollback window、rollback condition。

Cleanup 與長期混合

Type E 的 cleanup 不一定是「退役舊系統」— 多數情況舊系統仍服務未遷 workload：

已遷 workload 的舊 schema / 舊 writer / dual-write code path 退役
shadow read 比對 code 移除
但 RDS / MongoDB 本身保留（服務 analytics / 強一致 / 彈性查詢 workload）
明確標示哪條資料路徑的 source of truth 是 DynamoDB、哪條仍是 RDS / MongoDB、避免「到底哪個是真的」混亂

混合架構不是過渡失敗、是 paradigm shift 的穩態 — 每個 workload 待在最適合它的儲存層。

失敗模式

production 常見的 5 個踩雷：

Case 1：先匯資料才想 access pattern

把 RDS table 結構直接搬成 DynamoDB item、上線後發現查不出要的資料、要重建表。修法：access pattern 窮舉是 Phase 1、資料建模是 Phase 2；順序不能顛倒。

Case 2：把 JOIN 邏輯推給 application 卻沒評估成本

遷了關聯資料、application 每次查詢做 N 次 DynamoDB 呼叫自己組 JOIN、latency 跟成本爆炸。修法：關聯資料在建模階段反正規化（同 partition / 同 item）；無法反正規化的關聯查詢、該 workload 可能不適合遷。

Case 3：dual-write 一邊失敗沒補償

dual-write 時 DynamoDB 寫成功 RDS 失敗（或反之）、兩邊資料分歧、cutover 後發現新系統資料不完整。修法：dual-write 要有失敗補償（記錄失敗、重試、或標記該筆需人工對帳）；對應 1.9 Reconciliation 與 Data Repair。

Case 4：跳過 shadow read 直接 cutover

對自己的建模有信心、省掉 shadow read、cutover 後才發現 access pattern 漏了某個查詢路徑、生產出錯。修法：shadow read 是 cutover 前唯一能在真實流量下驗證新模型的階段、不能省。

Case 5：只看當下成本忽略 crossover

遷移時算出成本降 50% 就下決策、未來流量成長後 DynamoDB cost-per-request 累積超過自管 cluster、反而更貴。修法：算 12-24 個月在預期流量下的成本曲線、不是當下 snapshot（見容量段）。

Anti-recommendation：workload 查詢需求還在快速變化、或團隊對 access-pattern-first 建模沒經驗 → 先不要遷；用一個低風險、access pattern 已穩定的 workload 試點（如 Zomato 的 billing platform）、累積經驗再擴大。

容量與成本：crossover 判讀

DynamoDB 成本判讀的關鍵是 未來流量曲線、不是遷移當下的 snapshot：

遷移當下：相對 over-provisioned 的自管 cluster、DynamoDB on-demand 常更便宜（Zomato -50%）
流量成長後：DynamoDB cost-per-request 隨用量線性成長、自管 cluster 在高且可預測流量下有 crossover 點、可能反超便宜
判讀分層：小/中流量或流量不可預測 → DynamoDB 划算；大且可預測流量 + 已有 DBA 團隊 → 算自管 crossover

這條 vendor-level 成本軸主寫於 on-demand-vs-provisioned 軸 6；本篇從遷移決策角度引用、不重複展開 6 軸。

Scope warning：crossover 點隨 region pricing、workload shape、團隊成本結構變動、無通用閾值；Zomato 的具體百分比是單一 case 當下對照、不可外推。

接回 9.7 成本邊界與 efficiency、1.10 KV / Document DB 容量規劃。

邊界與整合

跟其他遷移路徑的關係

DynamoDB → SQL / search / analytics split（遷出方向）：當 DynamoDB workload 長出 ad-hoc 查詢需求、把分析部分拆到 OpenSearch / 數倉、是反向路徑、屬另一篇 playbook scope
MongoDB → Atlas：若只是要 managed MongoDB 而非換 paradigm、走 MongoDB → Atlas、不必遷 DynamoDB（保留 document paradigm）
跨平台等效：RDS → Aurora（保留 relational）、MongoDB → Cosmos DB（保留 document）、都比遷 DynamoDB 的 paradigm 跨度小；先確認真的需要換 paradigm

Sibling 與 cross-link

single-table-design-pattern — 遷移 Phase 2 資料建模的核心
partition-key-antipatterns — 建模時 PK 均勻度判讀
transactions-conditional-writes — 遷移後寫一致性如何在 DynamoDB 重建
on-demand-vs-provisioned — cost crossover 軸 6 SSoT
1.6 資料庫轉換實作 — 通用 dual-write / shadow read / cutover 框架
跟 Zomato 9.C20 互引：billing platform 遷移的可量化對照與 cost crossover 警示

PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration

Tue, 19 May 2026 00:00:00 +0000

本文是跨 vendor migration playbook、cross-link 到 PostgreSQL 跟 CockroachDB。本文是 #127 多重歸類跟 tie-breaking 規則的實證 — 三維皆 High 配對的處理方式不是「選 type A 或 type C 或 type E」、是 主導維度走 Type E、其他高維度獨立加段。每階段切換用 migration gate 把關。

三維皆 High：決策矩陣

跑 diff dimension audit 對 PostgreSQL → CockroachDB：

維度	評估	等級
Schema / API	PostgreSQL wire protocol 兼容、但 SQL feature set 部分缺（CTE recursive 部分 / window function 部分 / extension 完全缺）	High
Operational model	Single-node + Patroni → distributed Raft + 自動 rebalance；HA / backup / topology 全換	High
Abstraction / paradigm	Single-node MVCC + transaction → distributed Serializable Snapshot Isolation (SSI)	High
Number of components	同 1 個 DB cluster	Low
Application change	Transaction retry pattern 必須改、ORM 可能需 patch	Medium

3 維 High + 1 維 Medium。按 methodology audit Step 5 的多重歸類處理規則：

1主導維度判讀 (優先序): Schema > Paradigm > Operational > Components
2
3實際應用: Schema High + Paradigm High + Operational High
4- Schema 是 High、但 CRDB 提供 PostgreSQL wire protocol 兼容
5- Paradigm 是 High、是 *單機 → 分散式* 的根本轉變、讀者最關心
6- Operational 是 High、但很大程度是 Paradigm 的 downstream
7
8→ 主結構選 Paradigm（Type E）、Schema + Operational 抽獨立段補充

不強迫單一 type 標籤 — 本文是 Type E 為主 + Type A / C 高維度增補 的 multi-axis 形態。

結構 differentiator：Type E 主結構 + 多軸增補段

跟前批 5 個 migration playbook 對照：

結構元素	Type A Splunk → Elastic	Type B Redis → DragonflyDB	Type C PostgreSQL → Aurora	Type D Datadog → Grafana	Type E Kafka ↔ NATS	本文（三維 High）
Phased translation	yes	-	-	-	-	partial
Compatibility audit	-	yes	-	-	-	yes
Operational redesign 對位	-	-	yes	-	-	yes（獨立段）
Schema gap 對位	-	-	-	-	-	yes（獨立段）
Parallel streams	-	-	-	yes	-	-
Paradigm contrast	-	-	-	-	yes	yes
Application 重設計	-	-	-	-	yes	yes
混合架構 long-term	-	-	-	-	yes	partial（部分 workload）

本文是「Type E 為主 + Type A schema gap 段 + Type C operational redesign 段」混合形態、9-10 章節、260-300 行。

維度 1：Paradigm shift（主導）

CRDB 是 distributed SQL DB、不是「PostgreSQL 多節點版」。核心差異：

概念	PostgreSQL	CockroachDB
Transaction isolation	MVCC、Read Committed default	Serializable Snapshot Isolation (SSI)、強一致
Transaction conflict	First writer wins	Retry-on-conflict、application 必須處理 `40001` retry code
Replication	Streaming replication + standby	Raft consensus、每筆寫 quorum + 自動 rebalance
Partition	Declarative partitioning（手動）	Automatic range-based + locality-aware
Latency p99	1-10ms（單 region）	5-50ms（cross-AZ Raft quorum）
Throughput limit	單 primary 上限 ~10-50K TPS	Linear scale by adding node、~5K TPS / node

關鍵 paradigm 改變：transaction 是 retry-able 操作、不是 atomic guaranteed。所有 transaction code 需要包 retry loop（CRDB 提供 cockroach_restart savepoint）。

維度 2：Schema gap（PostgreSQL features CRDB 不支援）

CRDB 號稱 PostgreSQL-compatible、但 covergence rate 80-90%；常見 gap：

PostgreSQL feature	CRDB 狀態	影響
Stored procedure / function (PL/pgSQL)	Limited（CRDB 22.2+ 部分支援）	Migration scope 內必須 audit + 改寫
Common Table Expression (CTE) recursive	Limited (depth + structure)	複雜 CTE 可能跑不通、必須 query refactor
Window function 全集	Partial	報表 query 需逐 case 驗證
Extensions (pg_repack / pgaudit / TimescaleDB)	不支援	用 CRDB 自家 alternative 或自管 application 層
Triggers	Limited	Audit / data integrity 邏輯遷到 application 層
Custom types / domain	Partial	用 CHECK constraint 替代
Geographic types (PostGIS)	CRDB native geo support（語法不同）	Spatial query 改寫
`SELECT FOR UPDATE` semantics	對等但底層機制不同（distributed lock）	注意 deadlock pattern 差異
Advisory locks	不支援	Application 端用其他 distributed lock（Redis / Consul）

Migration 必須 先 audit 完整 SQL feature 使用、列出 gap、評估解法或退役。

維度 3：Operational redesign

CRDB operational model 完全不同：

Operational concept	PostgreSQL self-managed	CRDB
Cluster bootstrap	Patroni / Stolon + manual	`cockroach init` + 自動 Raft formation
HA	Patroni + DCS + watchdog	內建 Raft、無 single primary
Failover	Patroni-managed、15-60s	透明 Raft re-election、< 5s
Backup	pgBackRest + WAL archive	`BACKUP TO` (incremental + full)
Restore	`pgBackRest restore` + PITR	`RESTORE FROM`
Replication	Streaming + logical	Built-in、無 logical replication 對等概念
Schema migration	`pg_dump` / Flyway / Liquibase	`cockroach sql` + online schema change（無 lock）
Monitoring	pg_stat_* views + Prometheus exporter	CRDB admin UI + Prometheus（schema 不同）
Sizing	Vertical scale（單 node big spec）	Horizontal scale（多 node 小 spec）

SRE 心智模型完全重訓：無 primary 概念 / 無 streaming lag 概念 / 無 standby promote 概念。

Migration 流程（混合形態）

不是線性 phased、是 phased + parallel + partial 混合：

 1Phase 0: scope 判讀
 2  - 列 application、區分「適合 CRDB」vs「保留 PostgreSQL」
 3  - SQL feature audit
 4  - Application transaction pattern audit
 5
 6Phase 1: schema port + application 改寫
 7  - DDL 轉成 CRDB syntax
 8  - 不支援 extension 找 alternative
 9  - Application transaction code 加 retry loop
10
11Phase 2: 雙寫期（部分 application 開始走 CRDB）
12  - 新 application 走 CRDB
13  - 舊 application 持續 PostgreSQL
14  - CDC bridge（Debezium → Kafka → CRDB consumer）
15
16Phase 3: cutover 適合的 application
17  - 每個 application 獨立 cutover
18  - 不是「全 DB 一次切」
19
20Phase 4: 長期混合架構
21  - 某些 workload 永遠保留 PostgreSQL（不適合分散式）
22  - CRDB 跑 distributed 適配 workload

整體 3-6 個月、不收斂到全 CRDB。

Production 故障演練

Case 1：Transaction retry 沒處理、application 大量 `40001` error

徵兆：cutover 後 application 5-10% transaction 報 restart transaction: TransactionRetryWithProtoRefreshError、業務 fail。

根因：PostgreSQL Read Committed 不要求 application 處理 conflict、CRDB Serializable Isolation 必須 retry-on-conflict；application code 沒 retry loop。

修法：

 1// CRDB transaction with retry
 2for retries := 0; retries < 10; retries++ {
 3    tx, _ := db.Begin()
 4    // ... transaction logic ...
 5    err := tx.Commit()
 6    if err != nil && strings.Contains(err.Error(), "40001") {
 7        time.Sleep(backoff(retries))
 8        continue
 9    }
10    break
11}

framework-level：用 CRDB-provided client lib（go-cockroachdb / crdb-jdbc）有 retry helper。

Case 2：Extension 缺位、application feature 整段掉

徵兆：cutover 後 application 某個地理計算功能直接報錯、PostGIS 函數不存在；migrate 計畫漏看。

根因：CRDB native geo 不同 syntax / API、PostGIS extension 不能直接搬。

修法：

Pre-migration 必跑 extension audit：列所有 pg_extension、找對應 CRDB feature 或退役
PostGIS 替代：CRDB native ST_* functions、部分 syntax 對齊但 spatial index 不同
退役不能換的 feature：評估保留 PostgreSQL（混合架構）

Case 3：Sequential PK 撞 Raft quorum 瓶頸

徵兆：cutover 後寫入吞吐量 / latency 不如預期、CRDB cluster CPU < 30% 但 write latency p99 high。

根因：application 用 AUTO_INCREMENT / SERIAL 連續 PK；CRDB 把連續 key 放 同一 range / 同一 Raft group、寫入串行化、無法平行 scale。

修法：

改 UUID v7 / unique_rowid()：時序排序但散佈跨 range、自動 partition by hash
PRIMARY KEY (region, id)：multi-region 場景 multi-tenancy 自然拆分
不適合的 workload 留 PostgreSQL：不是所有 schema 都適合 distributed

Case 4：Long transaction 對 Raft 衝擊

徵兆：跨 1 分鐘+ 的 transaction（batch processing / 大 ETL）大量 retry、最後失敗；同期間其他短 transaction 也 retry rate 上升。

根因：CRDB long transaction holds intent on touched ranges、阻塞其他 transaction；SSI conflict 機率隨 transaction 時間平方增長。

修法：

Long transaction 拆短：batch 用多個 short transaction、checkpoint 在 application 層
Heavy ETL 不跑 CRDB：用 CRDB CDC export 到 OLAP（Snowflake / BigQuery）跑 batch
Read-only long transaction 用 follower read：AS OF SYSTEM TIME 不 hold intent、適合 reporting

Case 5：Backup / restore 行為跟 PostgreSQL 不同、SRE runbook 失效

徵兆：DBA 嘗試 pg_restore 失敗、CRDB 端 backup format 完全不同；incident response 卡關 1-2 小時。

根因：CRDB backup 是 cluster-internal format、不能用 PostgreSQL tooling；SRE runbook 仍是 PostgreSQL world、應急時心智模型錯位。

修法：

Runbook 重寫：CRDB-specific backup / restore 流程、SRE training
DR drill：cutover 前跑完整 DR drill、用 CRDB tooling 完成、不依賴 PostgreSQL 經驗
Multi-region backup：CRDB 跨 region backup 配置、避免單 region 故障

Capacity 規劃

維度	PostgreSQL self-managed	CockroachDB
Single-node 上限	~10-50K TPS（vertical scale 到 32-128 vCPU）	~5K TPS / node（horizontal scale by adding node）
跨 region	高 latency 跨區 streaming	設計 native、Locality-aware queries
Sharding	手動 partition / pg_partman	自動 range-based
Storage / TPS ratio	不變	Storage 跨 node 3x（Raft quorum 3-replica default）
Total cost (10TB)	$2-4K USD / month（self-managed）	$5-10K USD / month（CRDB Cloud + 3x storage）

判讀：CRDB cost 顯著高、選 CRDB 必須是 paradigm 需求（distributed transaction / multi-region / linear scale）；單純成本 / availability 改善走 Aurora 更划算。

整合 / 下一步

跟 PostgreSQL → Aurora migration 對比

兩條 PostgreSQL 出路：

Aurora：operational simplification、protocol drop-in、cost 中等漲；適合 不需 distributed transaction 的 production
CRDB：distributed paradigm shift、application 必須改、cost 顯著漲；適合 真的需要 distributed 的 workload

多數 application 不需要 distributed transaction、Aurora 更合理；真正需要 cross-region 強一致 / linear scale by adding node 才走 CRDB。

跟 application transaction pattern 重設計

CRDB 強制 application 改 transaction code、retry loop 必加。團隊心智模型轉換是 migration 主要 effort、技術部分相對少。

下一步議題

CRDB → PostgreSQL reverse migration：當業務 simplify 後 distributed 不必要、reverse migration cost 高、實務上 CRDB 是 single-direction lock-in
CRDB Serverless：cost 起點低、burst workload 適合；steady workload 仍是 dedicated cluster
Multi-region active-active：CRDB 真正強項、但網路成本爆、僅金融 / 政府客戶 ROI 合理

JMeter → k6：k6 不是 JMeter 的「script 版本」、是 VU model 取代 thread model

Tue, 19 May 2026 00:00:00 +0000

k6 不是 JMeter 的 「script 版本」。

這個誤解是 JMeter → k6 migration 第一週最常見的事故來源。Migration 啟動會議常聽到「JMeter 的 thread group 翻成 k6 的 VU 就好了吧」、然後團隊把 .jmx 內 100 thread → k6 vus: 100、跑下去發現 RPS 差三倍、p95 延遲表完全不同形狀、以為 k6 壞了。

實際上 k6 的 Virtual User (VU) 跟 JMeter 的 Thread 是 兩種不同的使用者行為建模方式：

JMeter Thread：一個 OS thread = 一個 user、numThreads=100 就 固定 100 個 concurrent 使用者一直跑、ramp-up period 控制怎麼啟動、無 explicit arrival rate 概念
k6 VU：一個 goroutine-like execution context、預設 vus 是 concurrent VU pool、但 k6 更推薦用 arrival-rate executor — 直接表達 每秒進來幾個 request、VU 是 為了達到 arrival rate 動態起的 worker

差別在 測量視角：JMeter 預設視角是 「我有 100 個使用者在用系統」、k6 預設視角是 「我每秒有 N 個請求進來」。兩種視角下 同一個系統的瓶頸結果完全不同：100 concurrent user 模型在 server 慢時 throughput 會自動降（user 等回應）、100 RPS arrival rate 模型在 server 慢時 queue 會累積、暴露 真實 production behavior（user 不會體諒、會繼續送請求）。

這篇 migration playbook 不是 schema translation 文（.jmx 翻成 .js 只是表面）、是 paradigm shift — 從 closed-system model（thread）到 open-system model（arrival rate）的視角轉換。

為什麼是 Type E（schema + paradigm 同 High）

跑 6 維 diff dimension audit：

維度	評	說明
Schema	High	`.jmx` XML vs JavaScript scenario、test plan 完全不同 file format / DSL
Operational	Medium	CLI / distributed run 接近、CI integration 差別大、distributed runner 模型不同
Paradigm	High	thread group closed model → arrival rate open model、測試思維不同
Components	Low	都是 load test runner、no multi-tool decomposition
App change	N/A	是 test code、不是 production code
Topology	Low	都是 CLI / runner 跑、無 sharding

Schema High + Paradigm High 兩軸 High。按優先序 Schema > Paradigm、預設選 Type A。但對 JMeter → k6 的讀者來說、paradigm shift 才是難關 — schema translation 是工作量、但搞錯 paradigm 會讓 migration 後的測試結果 跟 production 不對應。所以選 Type E paradigm shift 結構、schema translation 抽出 Phase 1-2 補充。

Driver：developer ergonomic + CI gate friendly

從 JMeter 遷出 k6 的核心拉力是 developer ergonomic + CI 友善：

.jmx XML 在 git 內 diff 不可讀：兩個 .jmx PR 的 diff 是 XML attribute reorder noise、reviewer 看不出來實際邏輯改了什麼；JavaScript 是純文字 + AST、PR diff 直接可讀
GUI 學習曲線：JMeter GUI 不是現代 IDE、不熟的工程師寫一個 scenario 要花半天找對的 sampler 跟 listener；JavaScript 用既有 IDE（VS Code / IntelliJ）、autocomplete + lint + format 全有
CI integration 步驟差：JMeter 在 CI 跑要 packaging plugin + non-GUI mode + result XML parser；k6 直接 k6 run script.js、result 是 JSON / Prometheus metrics、threshold pass/fail 直接 exit code
單機 VU 容量：JMeter 單機通常 ~500-1000 thread（受 JVM 跟 OS thread limit）、k6 單機可跑 30K-50K VU（Go runtime + goroutine）、distributed runner 需求降低
Workload model expressiveness：k6 arrival-rate executor + ramping-vus + constant-vus 三種 executor 直接對應 open system / ramping / closed system 三種測量視角、不像 JMeter 需要組合 Constant Throughput Timer + Synchronizing Timer + thread group 才達到

這條 driver 在 QA 團隊 GUI 維護 .jmx asset 的 org 沒拉力（GUI 反而是優勢）、但對 dev / SRE 寫 performance test 進 CI 的 org 是強拉力。Audience 不同、migration value 完全不同。

4-phase partial migration（不收斂）

Type E 的特徵是 不收斂 — 多數 org 不會把 .jmx 全退役、會停在某個 phase 變成 hybrid：

Phase 1：學會 k6 paradigm（不寫實際 test）

寫一個 throwaway script 跑當前 production-like API、不為了 migrate、為了搞清楚 k6 paradigm：

 1import http from 'k6/http';
 2import { check } from 'k6';
 3
 4export const options = {
 5  // 不要用 vus: 100、用 arrival rate
 6  scenarios: {
 7    open_model: {
 8      executor: 'constant-arrival-rate',
 9      rate: 100,           // 每秒 100 request
10      timeUnit: '1s',
11      duration: '5m',
12      preAllocatedVUs: 200, // 預先準備 VU 數
13      maxVUs: 500,          // 上限
14    },
15  },
16  thresholds: {
17    http_req_duration: ['p(95)<500'], // p95 < 500ms
18    http_req_failed: ['rate<0.01'],   // 失敗率 < 1%
19  },
20};
21
22export default function () {
23  const res = http.get('https://api.example.com/orders');
24  check(res, { 'status 200': (r) => r.status === 200 });
25}

對比同一個 test 用 .jmx 寫的形狀、思考 為什麼 arrival rate 跟 thread group 測出來不一樣。這 phase 的目標是 paradigm internalization、不是產出 migration artifact。團隊每個寫 performance test 的人都要過這一關、不能跳。

完成標準：寫的人能講清楚「arrival rate 100 / 5 分鐘」跟「100 thread / 5 分鐘 ramp-up」的 production behavior 差異。

Phase 2：高價值 critical path 改 k6（GUI 留 JMeter）

選 最常跑 + 最重要 的 1-3 條 scenario 改寫 k6、不全部一次轉。典型候選：

Pre-release smoke test（核心 API 的 baseline check）
Nightly regression（per-commit performance gate）
Peak readiness rehearsal scenario（活動前 T-7 跑的 stress test）

GUI / QA 團隊維護的 .jmx 不動 — 那些通常是 multi-protocol（JDBC / JMS / FTP）、不在 k6 適合 scope。

工作主要塊：

.jmx thread group → k6 scenario executor 的 paradigm-correct 翻譯（不是欄位翻譯）
HTTP request 跟 assertion 翻譯（payload / header / cookies）
CSV data source（JMeter CSV Data Set Config）→ k6 SharedArray from JSON
結果輸出 schema 改變（XML / JTL → JSON / Prometheus / k6 Cloud）
CI integration 重做（GitHub Actions / GitLab CI 直接 k6 run、不需要 packaging）

完成標準：critical path 的 k6 baseline 跟 .jmx baseline 數據對比一致（p50 / p95 / throughput 在 10% 誤差內、行為不一致時知道是 paradigm 差還是 bug）。

Phase 3：QA 團隊雙工具技能（hybrid 穩定形態）

很多 org 停在這個 phase：QA 團隊用 GUI 維護 multi-protocol .jmx（covering JDBC / JMS / LDAP / SOAP / FTP）、dev / SRE 用 k6 維護 HTTP / gRPC / WebSocket performance test in CI。Two-tool stack 不是 broken state、是 not-converged-by-design。

這個 phase 的工作主要塊：

文件化：哪類 test 用 k6、哪類用 JMeter、決策樹寫在 team handbook
結果整合：兩個工具的 metrics 都進同一個 Grafana dashboard（k6 → Prometheus 直接、JMeter → InfluxDB / Prometheus exporter）
Release gate 用 k6 為主（CI 整合直接）、JMeter 用於 manual QA campaign / multi-protocol 場景

多數 org 不進 Phase 4。

Phase 4：JMeter 退役（少見）

只有當 所有 protocol 都換到 k6 extension 或 捨棄了 multi-protocol coverage 時、才 fully 退役 JMeter。常見路徑：

用 k6 xk6 extensions 補 protocol（xk6-sql for JDBC、xk6-kafka for Kafka、xk6-amqp for RabbitMQ、xk6-mqtt for MQTT）
評估每個 extension 的 maturity / community support — xk6 ecosystem 比 JMeter plugin 小很多
接受 part of legacy .jmx test 直接 deprecate（covered by integration test 而非 load test）

完成標準：所有 protocol 都在 k6 + xk6 內可表達、.jmx 全部 archive。

5 個 production 踩雷

1. Thread group → VU 直接翻譯（最常見、Phase 2 必踩）

把 numThreads=100 翻成 vus: 100 就完事 — 結果 RPS 跟 JMeter 不一致、p95 完全不同形狀。原因：JMeter 100 thread 是 closed model（thread 等回應才送下一個）、k6 vus: 100 預設也是 closed model、但 iteration 結束就立刻送下一個（無 think time）— 兩者的 throughput 行為 差異來自 think time / response time。

修法：

不用 vus: N、用 constant-arrival-rate 或 ramping-arrival-rate、直接表達 每秒幾個請求
如果一定要 closed model（pre-existing JMeter scenario 對比）、在 default function 內加 sleep(thinkTime) 模擬 JMeter Think Time

2. Arrival rate vs concurrent VU 混淆

arrival-rate executor 的 rate: 100 意思是 每秒進來 100 request、preAllocatedVUs: 200 是 預先準備 200 個 VU worker pool。如果 service 變慢（p95 從 100ms 飄到 500ms）、需要的 VU 數會從 100/sec * 0.1s = 10 暴增到 100/sec * 0.5s = 50、preAllocatedVUs 不夠就會 warning「ran out of VUs」、實際 arrival rate 達不到 spec。

修法：

preAllocatedVUs 設為 maxVUs / 2
maxVUs 設為 rate * worst_case_response_time_seconds * 5（5x safety margin）
Monitor dropped_iterations metric — 不該 > 0、> 0 表示 worker pool 不夠

3. Protocol gap（k6 沒原生對應 JMeter 的部分）

k6 原生支援 HTTP/1.1 / HTTP/2 / gRPC / WebSocket / SSE。沒有原生支援：

JDBC（要 xk6-sql extension）
JMS（要 xk6-amqp / xk6-kafka extension）
LDAP（無 extension、要外接 LDAP client）
FTP（無 extension）
SMTP / IMAP / POP3（無 extension）
SOAP（HTTP module 內手寫 XML body、無 helper）

如果 .jmx 用了這些 protocol、評估 xk6 extension 成熟度（GitHub stars、recent commit、issue volume）、不成熟就把這些 test 留在 JMeter。

4. 結果輸出 schema 改變（result post-processing 全部要重寫）

JMeter 預設輸出 JTL XML（per-sample 一行）、有 listener 後處理。k6 預設輸出 stdout summary + optional JSON / CSV / Prometheus / k6 Cloud。如果有既有 result analysis pipeline（從 JTL 拉 data 進 BI tool、產 trend chart）、Phase 2 必須重寫。

修法：

評估直接接 Prometheus + Grafana（k6 native）取代既有 BI dashboard
或寫 k6 JSON output → 自家 BI 的 transformation script

5. CI integration 重做（distributed runner 模型不同）

JMeter 在 CI 跑要：JVM provision、plugin install、.jmx upload、non-GUI mode 跑、JTL 結果 parse、exit code 對應 threshold。k6 在 CI 跑：k6 run script.js、threshold pass / fail 直接 exit code、result 進 Prometheus / k6 Cloud。

看起來 k6 簡單、但有踩雷：

Distributed run model 不同：JMeter 用 master-slave、k6 OSS 不內建 distributed、要 Grafana Cloud k6 或自建 k6-operator on Kubernetes
大規模負載（> 50K VU）必須 distributed、Phase 2 評估時要先確認 distributed setup 不是 blocker
CI runner 資源：k6 是 native binary、CPU / memory 用量比 JMeter（JVM）低、但 runner spec 要按 max VU 估

Protocol gap 詳表

Protocol	JMeter sampler	k6 對應	成熟度 / 替代方案
HTTP/1.1	HTTP Request	`k6/http`	原生、成熟
HTTP/2	HTTP/2 sampler	`k6/http`（auto）	原生、成熟
gRPC	（無原生、要 plugin）	`k6/net/grpc`	原生、成熟
WebSocket	WebSocket sampler（plugin）	`k6/ws`	原生、成熟
SSE	（無原生）	xk6-sse	extension、中等
JDBC	JDBC Request	xk6-sql	extension、不成熟、留 JMeter
JMS	JMS sampler	xk6-amqp / xk6-kafka	extension、protocol-specific
LDAP	LDAP Request	（無）	外接 / 留 JMeter
FTP	FTP Request	（無）	留 JMeter
SMTP / IMAP	Mail sampler	（無）	留 JMeter
SOAP / XML-RPC	SOAP / XML-RPC Request	`k6/http` 手寫 XML body	工作量大、留 JMeter
TCP socket	TCP sampler	`k6/net/tcp`	原生但簡單、複雜 protocol 留 JMeter

容量與成本對照

項目	JMeter	k6 OSS	Grafana Cloud k6
Cost	Free (Apache)	Free (Apache 2.0)	$49+ / mo (Pro)
單機 VU 容量	~500-1000 thread	30K-50K VU	unlimited（cloud runner）
Distributed	master-slave 內建	不內建、需 k6-operator	cloud-native
Result store	JTL XML（local）	stdout / JSON / Prom	cloud retained
CI integration	需 packaging	native CLI	native + cloud
Multi-protocol coverage	廣	窄（HTTP/gRPC/WS）+ xk6	同 OSS

對 dev-driven CI gate use case：k6 OSS 已經夠用、Grafana Cloud k6 在 跨 region runner + result retention + dashboard 整合 時才有 ROI。對既有 multi-protocol .jmx asset：考慮 Phase 3 hybrid stable state、不要強推 Phase 4。

何時不要切

multi-protocol coverage 是核心需求：JDBC + JMS + LDAP + FTP 必要、xk6 extension 不夠成熟、留 JMeter
QA 團隊維護 GUI .jmx：QA 不寫 code、.jmx GUI 是團隊資產、貿然轉 k6 等於 throwaway QA team
既有 multi-year .jmx asset 大量：500+ scenario 全部翻譯成本 > k6 ergonomic 收益、考慮 Phase 3 stable hybrid
Distributed run 需求極大（> 100K VU）但 ops budget 緊：k6-operator on Kubernetes 不便宜、Grafana Cloud k6 對應 tier 也不便宜、JMeter master-slave 仍是 cost-effective 選項

下一步路由

平行 batch：Pyroscope → Datadog Profiler（Type C operational hybrid）
同 batch Type E：PagerDuty → incident.io（IR paradigm shift）
上游：9.3 壓測工具選型 / 9.2 Workload Modeling
下游：6.13 Performance Regression Gate（CI gate integration）
vendor 對照：JMeter / k6 / Gatling / Locust
方法論：Migration Playbook Methodology（Type E paradigm shift 結構說明）

PagerDuty → incident.io：「On-call」是個 retconned word、同名不同 contract

Tue, 19 May 2026 00:00:00 +0000

「On-call」是個被 retconned 的詞。PagerDuty 用了十年定義它為 alert routing + schedule + escalation — 重點是「誰會被叫醒」。incident.io 2024 年推出 On-call 模組時保留了同一個詞、但 contract 變了：On-call 在 incident.io 是 IR coordination + Slack-native workflow + retrospective integration 的 paging 入口 — 重點是「被叫醒之後做什麼」。

這個語意 retroactive 是這篇 migration playbook 必須先講清楚的事。讀者打開比較表會看到「PagerDuty 有 schedule、incident.io 有 schedule、PagerDuty 有 escalation policy、incident.io 有 escalation policy」、以為這是一場 schema translation 文。實際上 schema 翻譯只是其中一個工作塊、更難的是 org 的事故行為從「等 PagerDuty 叫」變成「在 Slack channel 內跑 lifecycle」。

為什麼是 Type E（不是 Type A）

跑 6 維 diff dimension audit：

維度	評	說明
Schema	High	service / escalation policy / schedule / integration 跟 incident / role / action / catalog 沒 1:1 對應
Operational	High	alert routing → Slack-native IR coordination + retrospective workflow
Paradigm	High	「alert someone」 → 「coordinate full incident lifecycle from declare to retro」
Components	Medium	incident.io 整合 Slack / Linear / Jira / Confluence 變 multi-component
App change	Medium	webhook / integration key / IaC 都要改
Topology	Low	都是 cloud SaaS、無 sharding / region 議題

三軸 High（schema / operational / paradigm）。按優先序 schema > paradigm > operational、預設會選 Type A。但這條優先序是 audience-dependent heuristic — 對「我要把 PagerDuty config 翻譯成 incident.io」的讀者選 Type A、對「我要把事故管理 paradigm 從 paging-first 變成 Slack-first」的讀者選 Type E。

決定因素是 讀者最關心什麼。從 PagerDuty 出發評估 incident.io 的 org 通常 已經有 Slack channel 跑 IR 的痛感（雙系統 state drift / context switching cost / Slack bot 補 PagerDuty 的能力斷裂）、進來找的是 paradigm 統一、不是欄位翻譯。schema translation 是工作量、但不是讀者來找答案的問題。所以選 Type E paradigm shift 結構、schema translation 抽出獨立段補充。

為什麼遷：IM-native coordination 的拉力

事故反應在已經 Slack 中心的 org 是 從 Slack 自然發生 的 — 觀測 alert 進 Slack、SRE 開 thread、PM 跳進來問影響、customer-facing team 在 incident channel 看通報、所有上下文都在 IM 內。PagerDuty 在這個 reality 下變成 第二個 system of record：incident 開在 PagerDuty 也開在 Slack、PagerDuty timeline 跟 Slack scroll 是兩條時間線、status update 要 mirror 兩次、責任分派在 Slack 講但要在 PagerDuty 點。

PagerDuty 注意到這個問題、後加了 Status Updates / Slack integration / Postmortem 模組想把 Slack 拉回 PagerDuty。但結構性還是 PagerDuty 是主、Slack 是 mirror — incident object 的 source of truth 在 PagerDuty、Slack 的訊息只是 attachment。對 Slack-first 的 org 來說這個 ownership 反了：Slack channel 才是事故進行中的 ground truth、PagerDuty incident 應該是 paging 入口的 artifact。

incident.io 設計上把這個關係翻過來：Slack channel 是 IR ground truth、incident object 是 channel 的 metadata 投影。declare incident 在 Slack、role 指派在 Slack bot prompt、status update 在 channel reply、retrospective 從 channel 訊息自動 stitch — incident.io dashboard 是 管理視圖、不是事故 進行視圖。On-call 模組加進來後、連 paging 入口也跟 IR coordination 收斂到同一個 system of record。

這個 pull 是這條 migration 的 driver。schema 翻譯只是把這條 pull 落地的工作。

4-phase partial migration（不收斂）

Type E paradigm shift 的特徵是 不收斂 — 多數 org 不會把 PagerDuty 全退役、會停在某個 phase 變成穩定的 hybrid。下面 4 phase 是 常見演進路徑、不是 必要完成步驟：

Phase 1：Slack-first response（paging 留 PagerDuty）

incident.io 接 PagerDuty incident webhook、PagerDuty 開 incident → incident.io 自動開 Slack channel、跑 response lifecycle（declare / role / status / close / retro）。PagerDuty 仍管 paging schedule + escalation、incident.io 管 response coordination。

這個 phase 的工作主要塊是：

incident.io 跟 PagerDuty 雙向 webhook 接（PD incident.trigger → IO open channel、IO incident.resolved → PD ack）
Slack workspace 整合（permissions、channel naming、stakeholder broadcast channel）
Severity 對應表（PagerDuty P1-P5 對 incident.io SEV1-SEV4、語意 reconcile）
跑 2-4 週 dual ops、訓練 SRE 在 Slack 內跑 lifecycle、不要回 PagerDuty 點 timeline

完成標準：incident commander 不再需要進 PagerDuty UI、status update / role 指派 / action item 都在 Slack。

Phase 2：Catalog + service ownership migrate

把 PagerDuty 的 service registry（service / team / escalation policy 關聯）抽出進 incident.io 的 Catalog。Catalog 是 incident.io 的 service metadata source of truth、把 service 跟 team / Slack channel / Linear project / runbook URL 綁在一起、incident 發生時自動推薦 role 跟通知 stakeholder。

工作主要塊：

從 PagerDuty API export service / team / escalation policy（REST endpoint /services、/teams、/escalation_policies）
Schema mapping：PagerDuty service → incident.io catalog entry、escalation policy → 暫時不動（留在 PagerDuty）
補 PagerDuty 沒有的欄位：Slack channel、Linear project、runbook URL、tier（catalog 比 PagerDuty service 多 metadata 維度）
Service ownership reconcile（PagerDuty 的 team grant 通常跟 GitHub team / IAM group 不一致、Catalog 是重新對齊機會）

完成標準：incident 發生時自動知道 owner team 跟對應 Slack channel、不需要人查。

Phase 3：Schedule + escalation 移到 incident.io On-call

PagerDuty 的 schedule + escalation policy 改進 incident.io On-call。這是 paging 入口的 ownership 轉移 — Phase 1 是 PD 觸發 IO response、Phase 3 是 IO 直接收 alert source 觸發 paging。

工作主要塊：

Alert source 改線：Splunk / Datadog / Cloudflare WAF / cloud control plane 的 webhook 從 PagerDuty Event API 改成 incident.io webhook endpoint、deduplication key / severity mapping 重做
Schedule 重建：PagerDuty schedule layer model（多 layer 疊加 + restriction + override）跟 incident.io schedule rule（單純 weekly rotation + override）不是 1:1、複雜 schedule 要重新設計
Escalation policy 重建：PagerDuty 的 multi-step escalation + level-based timeout 對應 incident.io 的 escalation path、policy 比 PagerDuty 簡單但要重新測 failover 行為
Mobile app 切換：on-call 人員裝 incident.io app、PagerDuty app 保留作為 backup paging（Phase 4 才完全捨棄）

完成標準：日常 paging 全走 incident.io、PagerDuty 留作 fallback 或退役。

Phase 4：Retrospective + 完全退役 PagerDuty

把 retrospective workflow 切到 incident.io 內建的 post-incident flow、捨棄 PagerDuty Postmortems / Jeli 整合。incident.io 的 retro template 從 Slack channel 訊息自動 stitch timeline、action item 推 Linear / Jira、learning review 結構化。

工作主要塊：

既有 Jeli / PagerDuty Postmortems 歷史 export（PagerDuty REST 不直接給 postmortem export、要從 Jeli web app 手動 export）
Retrospective template 對應到 org 既有的 post-incident review 結構
Action item lifecycle 整合（incident.io 推 Linear / Jira → close → retrospective 自動標 done）

多數 org 停在 Phase 2 或 Phase 3。完整 Phase 4 退役 PagerDuty 不是必要、且常見的選擇是 PagerDuty 留作 backup paging route 或 特定 integration 持續用（見下一段 capability gap）。

5 個 production 踩雷

實際遷過程踩過的 5 個典型問題：

1. 雙系統 state drift（Phase 1 最常見）

PagerDuty incident.trigger → incident.io 開 channel、但 PagerDuty 上 incident 被自動 resolve（例如 monitoring tool 認為 issue cleared）後、incident.io 沒收到對應 webhook、Slack channel 還 active 顯示 in-progress。修法是雙向 webhook 都要接（PD resolved → IO 自動 close channel），但 webhook 失序的場景仍要有 nightly reconcile job 對比兩邊狀態。

2. Severity 翻譯失真

PagerDuty 的 P1-P5 跟 incident.io 的 SEV1-SEV4 不是 5:4 對應、是兩個獨立 schema。同一個事故在 PagerDuty 是 P2（高優先但非全面 outage）、進 incident.io 可能變 SEV2（部分服務影響）或 SEV1（依 incident.io custom severity 定義）。Phase 1 雙系統並行時 SRE 在 Slack 看到 SEV1 跑進 war room mode、PagerDuty 同 incident 是 P2 沒拉 stakeholder bridge — 同事故兩邊嚴重度不同步、回應節奏錯亂。修法是事先寫死 mapping table（PD P1 → IO SEV1、PD P2 → IO SEV2、不 case-by-case 判斷），並在 Phase 3 後讓 incident.io severity 變唯一 source of truth。

3. Schedule layer 漏 holiday override / restriction layer

PagerDuty schedule 是 layer model — primary rotation（layer 1） + secondary rotation（layer 2） + holiday override（layer 3） + restriction（每層 time-of-day 限制）可以疊加。Export 出來只看 layer 1 通常會漏 holiday override 跟 restriction layer、incident.io schedule rule 是單一 rotation + override list、不 cover 多 layer 疊加。修法是 export 時用 PagerDuty API /schedules/{id} 的完整 layer + final_schedule 一起拉、用 incident.io schedule 的 override list 模擬 layer 疊加、複雜 schedule（例如 follow-the-sun + 4 region + holiday override）可能要拆成多個 incident.io schedule 用 escalation chain 串。

4. Slack channel 過載

incident.io 預設每個 incident 開一個 channel。Phase 1 啟用後 SRE 一週收 50+ channel notification、即使 P3 / P4 也開 channel、Slack sidebar 被淹沒。修法是 incident type 設計時把低 severity（SEV3 / SEV4）改成 don’t auto-create channel 或 use shared low-severity channel、只 SEV1 / SEV2 開獨立 channel。incident.io 有這個 configuration、但預設不開、要主動設定。

5. Retrospective 切換時歷史 learning 斷層

從 Jeli / PagerDuty Postmortems 切到 incident.io retro 後、過去 2 年 postmortem 留在原系統、search 跨不到、新 retro template 跟舊的結構不同、learning review 的 trend analysis 斷層。修法是 Phase 4 前先 export 既有 postmortem 為 markdown 進 GitHub Wiki / Confluence 集中保存、incident.io retro 自動 export 到同位置、retro search 不依賴 vendor lock-in。

Schema translation 主要工作量塊

雖然 Type E 結構不以 schema translation 為主、但 translation 工作量塊在 Phase 2-3 仍佔多數時間：

來源（PagerDuty）	目標（incident.io）	註
Service	Catalog entry	增加 Slack channel / Linear project metadata
Team	Catalog team	多對應 GitHub team / IAM group
Escalation policy	Escalation path	比 PD 簡單、複雜 escalation 要拆
Schedule（multi-layer）	Schedule + override list	不是 1:1、複雜 schedule 要拆多個
Integration（webhook）	Webhook endpoint	全部 alert source 要重 wire
Incident workflow	Incident type + role	重新設計、不直接翻譯
Event Orchestration rule	Workflows	incident.io workflows 比 EO 簡單、複雜 routing 要外接
AIOps / Process Automation	（無對應）	見 capability gap 段
Postmortem / Jeli	Post-incident flow	template 重寫、歷史保存獨立

Capability gap：PagerDuty 有但 incident.io 沒有

不是所有功能 incident.io 都有對應。Phase 3-4 推進前要先確認這些能力是否在用、是否願意捨棄或外接：

AIOps（intelligent grouping / noise reduction）：PagerDuty Enterprise tier 用 ML 自動 group alert、incident.io 沒對應、grouping 靠 alert source 端 deduplication key
Process Automation（runbook automation）：PagerDuty 收購 Rundeck、提供 automated remediation step、incident.io 沒對應、要外接 Tines / n8n / 自製 Lambda
Status Page 整合（PagerDuty 內建）：PagerDuty 提供 Status Page 模組、incident.io status page 是 separate product、定價跟 feature 不同
Multi-region / 強合規（FedRAMP / IL5）：PagerDuty 在金融 / 政府 / 高合規 deploy 成熟度高、incident.io SOC 2 + ISO 27001 但 FedRAMP 還在追

如果在用 AIOps + Process Automation 而且重要、不要做這個 migration、或保留 PagerDuty 作為 AIOps + Automation 後端、incident.io 處理 response coordination — Phase 1 永久 hybrid。

容量與成本對照

項目	PagerDuty	incident.io
計費模式	Per-user / month、tier-based（Pro / Business / Enterprise）	Per-user / month、On-call 模組另計
隱性容量上限	API rate limit（10K / minute）	Slack workspace seat 上限（IR participant ≤ workspace user）
AIOps 加價	Enterprise tier + AIOps add-on	不適用
Status page	內建（Business tier+）	獨立 product
Process Auto	Rundeck-based、separate pricing	不適用

實際成本對比需要 RFP — 50 人 SRE org 大致 PD Business + AIOps ~$30-40 / user / mo、incident.io Pro + On-call ~$25-35 / user / mo、cost 差距通常不是 migration 主因（是 paradigm fit + Slack-native）。

何時不要做這個 migration

Slack 不是 IR ground truth：Discord / Teams primary 或 ticket system 為主的 org、incident.io Slack-first 設計無法落地
AIOps + Process Automation 是核心能力：用了 PD AIOps 自動 group alert 跟 Rundeck 自動 remediation、且這條 chain 重要 — incident.io 沒對應
規模 < 20 SRE / 50 eng：incident.io 的 catalog + opinionated workflow 設計給中大型 org、小團隊 PagerDuty Lite 或 Grafana OnCall 已經夠用
強合規場景（FedRAMP / IL5 / 金融 SOC 1 type II）：PagerDuty 合規成熟度高、incident.io 在追、合規團隊不會 sign-off
不打算改變事故行為：如果 org 只是想換廠商但不想改變 事故在 Slack 跑 lifecycle 的工作模式、這條 migration 的價值丟一半、不如走 PagerDuty → Opsgenie（Type A schema translation、同 paradigm）

下一步路由

平行 batch：PagerDuty → Opsgenie（Type A、同 paradigm 換廠商）/ Atlassian Statuspage → Instatus（Type B drop-in）
同 batch Type E：JMeter → k6（scripting paradigm shift）
上游：8.10 Incident Workflow Automation Boundary（automation handoff）
下游：8.18 Post-Incident Review（incident.io retrospective workflow）
vendor 對照：PagerDuty / incident.io
方法論：Migration Playbook Methodology（Type E paradigm shift 結構說明）

Paradigm-Shift on Tarragon

從 Firestore 遷往自建 relational：撞牆驅動的 Type E 重建模、存取模型反轉與並行期

遷移的 driver：三面牆，不是「relational 比較好」

6 維 diff audit：主導維度是 paradigm + application change

為什麼字面遷移不成立：存取模型反轉

哪些該遷、哪些先留（逐能力混合）

Phase plan：存取模型反轉的階段化

Phase 1：依賴面盤點

Phase 2：relational 重建模

Phase 3：自建後端 + dual-write

Phase 4：backfill 歷史資料

Phase 5：shadow read 驗證

Phase 6：漸進 cutover + 重建即時層

Evidence：每階段的前進依據

Cutover 與 rollback 決策

Cleanup 與長期混合

失敗模式

Case 1：只匯資料、漏了存取模型反轉

Case 2：Security Rules 翻譯漏洞

Case 3：反正規化還原錯誤

Case 4：低估 realtime / offline 重建工作量

Case 5：dual-write 一邊失敗沒補償

容量與成本：crossover 判讀

邊界與整合

跟其他遷移路徑的關係

Sibling 與 cross-link

Docker Swarm → Kubernetes：5 個 Swarm production cluster 撞牆數據

5 個 Swarm production cluster 撞牆數據

為什麼遷：ceiling / ecosystem / multi-region 三條 driver

6 維 audit

Paradigm 對位

Schema gap：docker-compose vs K8s YAML

Migration 流程

Partial migration + 混合架構

Production 故障演練

Case 1：Networking model 差、cross-service connectivity 失效

Case 2：Secret rotation 從 Swarm secrets 換 Vault / Secrets Manager

Case 3：Readiness probe 沒設、rolling update 期間 traffic loss

Case 4：HPA 預設不啟、autoscaling 失效

Case 5：YAML 維護地獄、Helm / Kustomize 配置遲

Capacity / cost

整合 / 下一步

跟 Service mesh 整合

跟 GitOps 整合

跟 Vault → AWS Secrets Manager 對齊

相關連結

Sentry → Honeycomb：trace 不是 error、是不同 observability paradigm

Trace 不是 error、是不同 paradigm

為什麼遷：observability 成熟度 / cardinality / cost 三條 driver

6 維 audit

結構：partial migration + 混合架構是 long-term default

Application 重設計範例

Migration 流程

Production 故障演練

Case 1：Event schema 對位失敗、SRE 不會用 BubbleUp

Case 2：Sampling 行為差、production cost 飛

Case 3：Error grouping 失效

Case 4：Cost 模型差、預估錯

Case 5：Alert paradigm 不對等

Capacity / cost

整合 / 下一步

跟 OpenTelemetry 整合

跟 Datadog → Grafana Stack 對位

相關連結

etcd → Consul：KV + N 個 extras feature matrix

KV + N 個 extras：feature matrix

為什麼遷：3 條 expansion driver

Paradigm expansion 路線

API 對位

Application 重設計

Migration 流程

Production 故障演練

Case 1：KV API 對位看似 1:1、watch event model 不同

Case 2：Session-based lock 跟 etcd lease 差

Case 3：Multi-DC failover、KV 寫到 wrong DC

Case 4：ACL system 預設 open、cutover 後曝險

Case 5：Health check failure 連鎖、service discovery 失效

Capacity / cost

整合 / 下一步

跟 Kubernetes 對位

5. 無 `SUPER` privilege — 部分操作不可行