etcd → Consul:KV + N 個 extras feature matrix
本文是跨 vendor migration playbook、cross-link etcd 跟 Consul。跑 migration-playbook-methodology 6 維 audit 後對映 Paradigm = High(pure KV → service mesh paradigm)→ Type E paradigm shift;跟 Redis → Memcached(paradigm reduction)對偶、本文是 paradigm expansion(upgrade)方向。
KV + N 個 extras:feature matrix
| 概念 | etcd | Consul |
|---|---|---|
| 核心 paradigm | Pure KV with Raft consensus | Service mesh(KV + 6 個其他) |
| Data store | KV with versioned values + watch | KV + service catalog + health checks + sessions |
| API style | gRPC + HTTP/REST | HTTP/REST + gRPC(Connect)+ DNS |
| Service discovery | 無(application 自管) | Built-in(DNS / HTTP API) |
| Health check | 無 | Built-in(HTTP / TCP / script / TTL) |
| Service mesh | 無 | Connect(mTLS + intentions + service-to-service) |
| Multi-DC | 不支援(per-cluster only) | Built-in WAN federation |
| ACL system | RBAC (etcd 3.5+) | Token-based ACL + namespaces (Enterprise) |
| Lock primitive | Lease + transaction | Session + KV check-and-set |
| Watch event model | Event stream(gRPC stream) | Long-polling blocking query (X-Consul-Index) |
| Distributed config | KV + watch | KV + watch + template rendering (consul-template) |
| Use case 對映 | K8s control plane / 純 distributed KV | Service mesh + service discovery + config + KV |
核心差異不在「Consul 多功能」、在「Consul 是 service mesh paradigm」:service discovery / health check / Connect mTLS 是 first-class、KV 只是其中一個 sub-feature。
| 維度 | 評估 | 等級 |
|---|---|---|
| Schema / API | KV API 對位 + 多 N 個 extra API | Medium |
| Operational model | 兩者 Raft-based、ops similar | Low |
| Paradigm | Pure KV → service mesh | High |
| Components | 同 1 cluster | Low |
| Application change | KV API 改 + 新增 service registration / health | Medium |
| Data topology | 單 DC → multi-DC(如果用 federation) | Low-Medium |
Paradigm = High(其他 Low-Medium)→ Type E paradigm shift;KV 是 sub-feature、不是 migration scope 全部。
為什麼遷:3 條 expansion driver
- Service mesh adoption:本來用 etcd 跑 K8s control plane、現在 application 端要 service mesh(mTLS / intentions / 流量切換)、Consul 一站式 cover
- Multi-DC strategy:etcd 不支援跨 DC、要 active-passive failover;Consul WAN federation 支援 active-active 多 DC
- Configuration management:consul-template + envconsul 比 etcd watch + 自寫 reloader 簡單
反向 driver(Consul → etcd):
- 純 K8s control plane scenario、不需要 service discovery / health check / mesh、etcd 簡單足夠
- Resource constraint:Consul agent 比 etcd 更吃資源、low-end VM 上不夠
Paradigm expansion 路線
跟 Redis → Memcached paradigm reduction(移除 features)對偶、Consul 是 補進 features:
1etcd KV pattern → Consul KV API (1:1 對位)
2etcd watch → Consul blocking query / consul-template
3etcd lease + lock → Consul session + KV CAS
4
5(額外加進)
6無 → Consul service registration (services.json / API)
7無 → Consul health check (HTTP / TCP / TTL)
8無 → Consul service discovery (DNS / HTTP)
9無 → Consul Connect (mTLS + intentions)
10無 → Consul WAN federation (multi-DC)
11無 → Consul ACL token + policyMigration 不只是 KV API 對位、是 application 增能。
API 對位
1# etcd basic KV
2etcdctl put /myapp/config/db_url 'postgres://...'
3etcdctl get /myapp/config/db_url
4
5# Consul KV (對位)
6consul kv put myapp/config/db_url 'postgres://...'
7consul kv get myapp/config/db_url1# etcd watch
2etcdctl watch --prefix /myapp/config/
3
4# Consul blocking query (long polling)
5curl 'http://consul:8500/v1/kv/myapp/config?recurse&index=5&wait=10s'
6# X-Consul-Index header 為 watch cursor 1# etcd transaction (multi-key atomic)
2etcdctl txn <<EOF
3compares:
4mod("/myapp/lock") = "0"
5success requests:
6put /myapp/lock "owner1"
7EOF
8
9# Consul session + KV CAS (對位)
10SESSION_ID=$(curl -X PUT 'http://consul:8500/v1/session/create' | jq -r .ID)
11curl -X PUT 'http://consul:8500/v1/kv/myapp/lock?acquire='$SESSION_ID -d 'owner1'
12# 若失敗 lock 已被別人持有Application 重設計
1# Before: etcd
2import etcd3
3etcd = etcd3.client(host='etcd', port=2379)
4etcd.put('/myapp/config/db_url', 'postgres://...')
5db_url = etcd.get('/myapp/config/db_url')[0]
6
7# After: Consul (KV-only)
8import consul
9c = consul.Consul(host='consul', port=8500)
10c.kv.put('myapp/config/db_url', 'postgres://...')
11_, kv = c.kv.get('myapp/config/db_url')
12db_url = kv['Value']
13
14# (額外加進) After: Consul service discovery
15c.agent.service.register(
16 name='myapp',
17 service_id='myapp-1',
18 address='10.0.0.10',
19 port=8080,
20 check=consul.Check.http('http://10.0.0.10:8080/health', '10s', '5s', '30s')
21)
22
23# DNS-based discovery (其他 service 找 myapp)
24# dig +short myapp.service.consul SRVMigration 流程
11. Pre-migration audit
2 - 列 etcd 使用的所有 application
3 - 評估每個 application 是否 *需要* Consul extras(service discovery / health / mesh)
4 - 純 KV use case 標 *low-effort migration*、用得到 extras 標 *value-add migration*
5
62. Consul cluster build
7 - 跨 DC 設計(WAN federation 規劃)
8 - ACL system 配置(不要 default open)
9 - 性能 sizing(Consul agent 比 etcd 重)
10
113. Application migration(per-app)
12 - 純 KV: SDK 換、API 對位、cutover
13 - Service discovery: 加 registration + health check + DNS lookup
14 - Service mesh: 加 Connect proxy + intentions
15
164. Dual-run period
17 - etcd 仍跑、application 漸進切到 Consul
18 - 每 application cutover 後驗證
19
205. etcd decommission
21 - 確認所有 application 已切
22 - K8s control plane(如果是 etcd 唯一 user)保留不切整體 2-4 個月、依 application 數量跟 extras 採用程度。
Production 故障演練
Case 1:KV API 對位看似 1:1、watch event model 不同
徵兆:application 端從 etcd watch 切 Consul blocking query 後、event 處理 latency 從 50ms 漲到 1-5s;應用以為 event push 即時、實際變 polling。
根因:etcd watch 是 gRPC stream、event 即時 push;Consul blocking query 是 long-polling、有 wait timeout、event 在 timeout 內到才即時收到。
修法:
- 降
waittimeout 跟業務需求對齊(default 5min、可設 10s) - 多 instance 並發 polling:N 個 application instance 各自 polling、降單點 event 延遲
- 架構:critical event 用 Consul event API(
PUT /v1/event/fire/<name>)+ blocking query event endpoint、跟 KV change 分開 - 保留 etcd for critical watch:mission-critical watch 用 etcd 不切
Case 2:Session-based lock 跟 etcd lease 差
徵兆:原本 etcd lease 5s TTL、lease holder application 失聯時 5s 內 lock 自動釋放;切 Consul session 後、session TTL 仍生效、但 health check 整合複雜、偶發 lock not released。
根因:Consul session 有兩種模式 — delete(session expire 時 release lock)vs release(release lock 但 KV 保留);TTL 配 health check 時行為複雜。
修法:
1# 明示 session behavior
2session_id = c.session.create(
3 name='myapp-lock',
4 ttl=15, # 15s TTL
5 behavior='delete' # session 過期時 lock 自動 release
6)
7c.kv.put('myapp/lock', 'owner1', acquire=session_id)session TTL 範圍 10s-86400s、不能 < 10s(etcd 可以 1s);critical low-latency lock 不適用 Consul。
Case 3:Multi-DC failover、KV 寫到 wrong DC
徵兆:跨 DC 部署後、某 application 寫 KV、但 read 不到;發現 application 端 hardcode 一個 DC 端點、write 到 us-east 但 read 來自 us-west。
根因:Consul WAN federation 跨 DC 不自動同步 KV;KV 是 per-DC、跨 DC sync 需要 Consul Enterprise license 或自管 consul-replicate。
修法:
- 每 application instance 連 local DC Consul:write/read 同 DC
- KV replication 跨 DC:用 consul-replicate 自管、或升 Enterprise
- Architecture:跨 DC 共享 config 改用 DB-backed config(持久 + 跨 DC)+ Consul KV 只存 DC-local config
Case 4:ACL system 預設 open、cutover 後曝險
徵兆:Consul cluster 上線 1 個月後 SOC 跑 audit、發現任何 application 都能 read 任何 KV;ACL 沒設、所有 token 都全權限。
根因:Consul ACL 預設 disabled、需要 bootstrap;很多 setup tutorial 簡化跳過 ACL、cutover 後沒補。
修法:
1# Bootstrap ACL system
2consul acl bootstrap
3# 生成 management token、保留為 root credential
4
5# 建 policy
6consul acl policy create -name 'myapp-readonly' \
7 -rules 'key_prefix "myapp/" { policy = "read" }'
8
9# 建 token 給 application
10consul acl token create -policy-name 'myapp-readonly'Production setup 第一步就 bootstrap ACL、不可以延後。
Case 5:Health check failure 連鎖、service discovery 失效
徵兆:某 application instance 因 GC pause 5 秒未 respond health check、被 Consul 標 failed;DNS query 不返回該 instance;流量切走;GC 結束後 instance 仍 healthy 但 Consul 端 still failed、需要 minutes recover。
根因:Consul health check 失敗後進入 critical state、需要 連續 N 次成功 才回 passing;default 1-2 次成功即可、但實際時間視 check interval 而定。
修法:
success_before_passing設低(1)讓快速恢復failures_before_critical設高(3-5)容忍 transient failure- Multi-check strategy:HTTP + TCP + script check 三軸、不靠單 check
- Application-side hint:JVM application 配
MaxGCPauseMillis限制 GC pause < health check interval
Capacity / cost
| 維度 | etcd | Consul |
|---|---|---|
| Cluster baseline | 3-5 node Raft cluster | 3-5 server + N agent (per host) |
| Memory per node | 2-8GB | 4-16GB(含 agent) |
| Operational FTE | 0.2-0.5 | 0.5-1.0(多 features 多運維) |
| Feature surface | Pure KV | KV + service mesh + multi-DC + ACL |
| Setup complexity | Low | Medium-High |
| Multi-DC support | 不支援 | Built-in WAN federation |
| License | Apache 2.0 (open) | MPL 2.0 (community) / commercial (enterprise) |
| Migration cost | - | 1-3 FTE × 2-4 個月 |
判讀:純 KV use case 走 etcd;service mesh / multi-DC / discovery 需求大走 Consul;混合 deployment 是 long-term default(K8s control plane 仍跑 etcd、service mesh 跑 Consul)。
整合 / 下一步
跟 Kubernetes 對位
K8s control plane 永遠 用 etcd、不切 Consul;Consul 是 K8s 外 的 service mesh + 跨 cluster discovery。兩者並存、不互斥。
跟 Vault 整合
Consul + Vault 是 HashiCorp 同生態、Consul 跑 service discovery / mesh、Vault 跑 secrets;Consul ACL token 可從 Vault dynamic engine 取得。
跟 Istio / Linkerd 對位
Consul Connect 是 service mesh paradigm、跟 Istio / Linkerd 並列;多數 K8s-native organization 用 Istio / Linkerd、Consul 強項在 跨 K8s + VM + multi-DC mesh。
反向 migration(Consul → etcd)
少數 organization 簡化 stack 時做、流程鏡像對稱、但 退掉 service mesh / multi-DC 是有意識降級、不能假裝功能等價。
下一步議題
- Consul Connect production rollout:mesh adoption 是 incremental、per-service intentions 漸進
- Multi-DC topology 設計:active-active vs active-passive、依 RPO/RTO 跟 cost trade-off
- 跟 Kubernetes Gateway API 整合:service mesh paradigm 在 K8s 內 vs 外整合策略
相關連結
- Target vendor:Consul
- 平行 migration playbook (Type E):Redis → Memcached(paradigm reduction 對偶)/ Kafka ↔ NATS
- 平行整合:HashiCorp Vault
- Methodology:Migration playbook methodology
#backend #deployment-platform #etcd #consul #paradigm-shift #migration #type-e