本文是跨 vendor migration playbook、cross-link etcdConsul。跑 migration-playbook-methodology 6 維 audit 後對映 Paradigm = High(pure KV → service mesh paradigm)→ Type E paradigm shift;跟 Redis → Memcached(paradigm reduction)對偶、本文是 paradigm expansion(upgrade)方向。

KV + N 個 extras:feature matrix

概念etcdConsul
核心 paradigmPure KV with Raft consensusService mesh(KV + 6 個其他)
Data storeKV with versioned values + watchKV + service catalog + health checks + sessions
API stylegRPC + HTTP/RESTHTTP/REST + gRPC(Connect)+ DNS
Service discovery無(application 自管)Built-in(DNS / HTTP API)
Health checkBuilt-in(HTTP / TCP / script / TTL)
Service meshConnect(mTLS + intentions + service-to-service)
Multi-DC不支援(per-cluster only)Built-in WAN federation
ACL systemRBAC (etcd 3.5+)Token-based ACL + namespaces (Enterprise)
Lock primitiveLease + transactionSession + KV check-and-set
Watch event modelEvent stream(gRPC stream)Long-polling blocking query (X-Consul-Index)
Distributed configKV + watchKV + watch + template rendering (consul-template)
Use case 對映K8s control plane / 純 distributed KVService mesh + service discovery + config + KV

核心差異不在「Consul 多功能」、在「Consul 是 service mesh paradigm」:service discovery / health check / Connect mTLS 是 first-class、KV 只是其中一個 sub-feature。

6 維 diff dimension audit

維度評估等級
Schema / APIKV API 對位 + 多 N 個 extra APIMedium
Operational model兩者 Raft-based、ops similarLow
ParadigmPure KV → service meshHigh
Components同 1 clusterLow
Application changeKV API 改 + 新增 service registration / healthMedium
Data topology單 DC → multi-DC(如果用 federation)Low-Medium

Paradigm = High(其他 Low-Medium)→ Type E paradigm shift;KV 是 sub-feature、不是 migration scope 全部。

為什麼遷:3 條 expansion driver

  • Service mesh adoption:本來用 etcd 跑 K8s control plane、現在 application 端要 service mesh(mTLS / intentions / 流量切換)、Consul 一站式 cover
  • Multi-DC strategy:etcd 不支援跨 DC、要 active-passive failover;Consul WAN federation 支援 active-active 多 DC
  • Configuration management:consul-template + envconsul 比 etcd watch + 自寫 reloader 簡單

反向 driver(Consul → etcd):

  • 純 K8s control plane scenario、不需要 service discovery / health check / mesh、etcd 簡單足夠
  • Resource constraint:Consul agent 比 etcd 更吃資源、low-end VM 上不夠

Paradigm expansion 路線

Redis → Memcached paradigm reduction(移除 features)對偶、Consul 是 補進 features

 1etcd KV pattern         → Consul KV API (1:1 對位)
 2etcd watch              → Consul blocking query / consul-template
 3etcd lease + lock       → Consul session + KV CAS
 4
 5(額外加進)
 6無                      → Consul service registration (services.json / API)
 7無                      → Consul health check (HTTP / TCP / TTL)
 8無                      → Consul service discovery (DNS / HTTP)
 9無                      → Consul Connect (mTLS + intentions)
10無                      → Consul WAN federation (multi-DC)
11無                      → Consul ACL token + policy

Migration 不只是 KV API 對位、是 application 增能

API 對位

1# etcd basic KV
2etcdctl put /myapp/config/db_url 'postgres://...'
3etcdctl get /myapp/config/db_url
4
5# Consul KV (對位)
6consul kv put myapp/config/db_url 'postgres://...'
7consul kv get myapp/config/db_url
1# etcd watch
2etcdctl watch --prefix /myapp/config/
3
4# Consul blocking query (long polling)
5curl 'http://consul:8500/v1/kv/myapp/config?recurse&index=5&wait=10s'
6# X-Consul-Index header 為 watch cursor
 1# etcd transaction (multi-key atomic)
 2etcdctl txn <<EOF
 3compares:
 4mod("/myapp/lock") = "0"
 5success requests:
 6put /myapp/lock "owner1"
 7EOF
 8
 9# Consul session + KV CAS (對位)
10SESSION_ID=$(curl -X PUT 'http://consul:8500/v1/session/create' | jq -r .ID)
11curl -X PUT 'http://consul:8500/v1/kv/myapp/lock?acquire='$SESSION_ID -d 'owner1'
12# 若失敗 lock 已被別人持有

Application 重設計

 1# Before: etcd
 2import etcd3
 3etcd = etcd3.client(host='etcd', port=2379)
 4etcd.put('/myapp/config/db_url', 'postgres://...')
 5db_url = etcd.get('/myapp/config/db_url')[0]
 6
 7# After: Consul (KV-only)
 8import consul
 9c = consul.Consul(host='consul', port=8500)
10c.kv.put('myapp/config/db_url', 'postgres://...')
11_, kv = c.kv.get('myapp/config/db_url')
12db_url = kv['Value']
13
14# (額外加進) After: Consul service discovery
15c.agent.service.register(
16    name='myapp',
17    service_id='myapp-1',
18    address='10.0.0.10',
19    port=8080,
20    check=consul.Check.http('http://10.0.0.10:8080/health', '10s', '5s', '30s')
21)
22
23# DNS-based discovery (其他 service 找 myapp)
24# dig +short myapp.service.consul SRV

Migration 流程

 11. Pre-migration audit
 2   - 列 etcd 使用的所有 application
 3   - 評估每個 application 是否 *需要* Consul extras(service discovery / health / mesh)
 4   - 純 KV use case 標 *low-effort migration*、用得到 extras 標 *value-add migration*
 5
 62. Consul cluster build
 7   - 跨 DC 設計(WAN federation 規劃)
 8   - ACL system 配置(不要 default open)
 9   - 性能 sizing(Consul agent 比 etcd 重)
10
113. Application migration(per-app)
12   - 純 KV: SDK 換、API 對位、cutover
13   - Service discovery: 加 registration + health check + DNS lookup
14   - Service mesh: 加 Connect proxy + intentions
15
164. Dual-run period
17   - etcd 仍跑、application 漸進切到 Consul
18   - 每 application cutover 後驗證
19
205. etcd decommission
21   - 確認所有 application 已切
22   - K8s control plane(如果是 etcd 唯一 user)保留不切

整體 2-4 個月、依 application 數量跟 extras 採用程度。

Production 故障演練

Case 1:KV API 對位看似 1:1、watch event model 不同

徵兆:application 端從 etcd watch 切 Consul blocking query 後、event 處理 latency 從 50ms 漲到 1-5s;應用以為 event push 即時、實際變 polling。

根因:etcd watch 是 gRPC stream、event 即時 push;Consul blocking query 是 long-polling、有 wait timeout、event 在 timeout 內到才即時收到。

修法

  1. wait timeout 跟業務需求對齊(default 5min、可設 10s)
  2. 多 instance 並發 polling:N 個 application instance 各自 polling、降單點 event 延遲
  3. 架構:critical event 用 Consul event API(PUT /v1/event/fire/<name>)+ blocking query event endpoint、跟 KV change 分開
  4. 保留 etcd for critical watch:mission-critical watch 用 etcd 不切

Case 2:Session-based lock 跟 etcd lease 差

徵兆:原本 etcd lease 5s TTL、lease holder application 失聯時 5s 內 lock 自動釋放;切 Consul session 後、session TTL 仍生效、但 health check 整合複雜、偶發 lock not released。

根因:Consul session 有兩種模式 — delete(session expire 時 release lock)vs release(release lock 但 KV 保留);TTL 配 health check 時行為複雜。

修法

1# 明示 session behavior
2session_id = c.session.create(
3    name='myapp-lock',
4    ttl=15,           # 15s TTL
5    behavior='delete' # session 過期時 lock 自動 release
6)
7c.kv.put('myapp/lock', 'owner1', acquire=session_id)

session TTL 範圍 10s-86400s、不能 < 10s(etcd 可以 1s);critical low-latency lock 不適用 Consul。

Case 3:Multi-DC failover、KV 寫到 wrong DC

徵兆:跨 DC 部署後、某 application 寫 KV、但 read 不到;發現 application 端 hardcode 一個 DC 端點、write 到 us-east 但 read 來自 us-west。

根因:Consul WAN federation 跨 DC 不自動同步 KV;KV 是 per-DC、跨 DC sync 需要 Consul Enterprise license 或自管 consul-replicate

修法

  1. 每 application instance 連 local DC Consul:write/read 同 DC
  2. KV replication 跨 DC:用 consul-replicate 自管、或升 Enterprise
  3. Architecture:跨 DC 共享 config 改用 DB-backed config(持久 + 跨 DC)+ Consul KV 只存 DC-local config

Case 4:ACL system 預設 open、cutover 後曝險

徵兆:Consul cluster 上線 1 個月後 SOC 跑 audit、發現任何 application 都能 read 任何 KV;ACL 沒設、所有 token 都全權限。

根因:Consul ACL 預設 disabled、需要 bootstrap;很多 setup tutorial 簡化跳過 ACL、cutover 後沒補。

修法

 1# Bootstrap ACL system
 2consul acl bootstrap
 3# 生成 management token、保留為 root credential
 4
 5# 建 policy
 6consul acl policy create -name 'myapp-readonly' \
 7  -rules 'key_prefix "myapp/" { policy = "read" }'
 8
 9# 建 token 給 application
10consul acl token create -policy-name 'myapp-readonly'

Production setup 第一步就 bootstrap ACL、不可以延後。

Case 5:Health check failure 連鎖、service discovery 失效

徵兆:某 application instance 因 GC pause 5 秒未 respond health check、被 Consul 標 failed;DNS query 不返回該 instance;流量切走;GC 結束後 instance 仍 healthy 但 Consul 端 still failed、需要 minutes recover。

根因:Consul health check 失敗後進入 critical state、需要 連續 N 次成功 才回 passing;default 1-2 次成功即可、但實際時間視 check interval 而定。

修法

  1. success_before_passing 設低(1)讓快速恢復
  2. failures_before_critical 設高(3-5)容忍 transient failure
  3. Multi-check strategy:HTTP + TCP + script check 三軸、不靠單 check
  4. Application-side hint:JVM application 配 MaxGCPauseMillis 限制 GC pause < health check interval

Capacity / cost

維度etcdConsul
Cluster baseline3-5 node Raft cluster3-5 server + N agent (per host)
Memory per node2-8GB4-16GB(含 agent)
Operational FTE0.2-0.50.5-1.0(多 features 多運維)
Feature surfacePure KVKV + service mesh + multi-DC + ACL
Setup complexityLowMedium-High
Multi-DC support不支援Built-in WAN federation
LicenseApache 2.0 (open)MPL 2.0 (community) / commercial (enterprise)
Migration cost-1-3 FTE × 2-4 個月

判讀:純 KV use case 走 etcd;service mesh / multi-DC / discovery 需求大走 Consul;混合 deployment 是 long-term default(K8s control plane 仍跑 etcd、service mesh 跑 Consul)。

整合 / 下一步

跟 Kubernetes 對位

K8s control plane 永遠 用 etcd、不切 Consul;Consul 是 K8s 的 service mesh + 跨 cluster discovery。兩者並存、不互斥。

Vault 整合

Consul + Vault 是 HashiCorp 同生態、Consul 跑 service discovery / mesh、Vault 跑 secrets;Consul ACL token 可從 Vault dynamic engine 取得。

Istio / Linkerd 對位

Consul Connect 是 service mesh paradigm、跟 Istio / Linkerd 並列;多數 K8s-native organization 用 Istio / Linkerd、Consul 強項在 跨 K8s + VM + multi-DC mesh。

反向 migration(Consul → etcd)

少數 organization 簡化 stack 時做、流程鏡像對稱、但 退掉 service mesh / multi-DC 是有意識降級、不能假裝功能等價。

下一步議題

  • Consul Connect production rollout:mesh adoption 是 incremental、per-service intentions 漸進
  • Multi-DC topology 設計:active-active vs active-passive、依 RPO/RTO 跟 cost trade-off
  • 跟 Kubernetes Gateway API 整合:service mesh paradigm 在 K8s 內 vs 外整合策略

相關連結