Query-Optimization on Tarragon

MySQL Query Optimization：從 EXPLAIN 看到實際執行、5 條 query 從 5 秒變 50ms 的 anatomy

Tue, 19 May 2026 00:00:00 +0000

本文是 MySQL overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 query optimization — EXPLAIN / optimizer trace / hint 三層工具跟 5 個實際 case。

5 個常見 production case

production 上 query 慢、root cause 幾乎都是 optimizer 選錯 plan。從以下 5 個 case 進入 query optimization：

Case 1：5 秒 → 50ms — JOIN 順序選錯

1-- 慢 (5 秒)：optimizer 選 customers 為 outer table、scan 全 1M row
2SELECT o.id, o.amount, c.name
3FROM orders o JOIN customers c ON o.customer_id = c.id
4WHERE o.created_at > '2026-05-01' AND c.region = 'TW';

EXPLAIN 顯示：

1+----+-------------+-------+------+---------------+--------+
2| id | select_type | table | type | possible_keys | rows   |
3+----+-------------+-------+------+---------------+--------+
4|  1 | SIMPLE      | c     | ALL  | NULL          | 1000000|
5|  1 | SIMPLE      | o     | ref  | idx_cust_id   | 100    |
6+----+-------------+-------+------+---------------+--------+

c table type=ALL（full scan）、rows=1M。問題：customers 沒在 region 上的 index、optimizer 預估「region=TW filter 沒效率、就 full scan」、但 region=TW 只佔 10% row（100K row）。

修法：

1ALTER TABLE customers ADD INDEX idx_region (region);
2ANALYZE TABLE customers;  -- 更新 statistics

加 index 後 optimizer 切 plan：先 scan customers 用 idx_region 篩 100K row、再 join orders。從 5 秒降到 50ms。

Case 2：30 秒 → 200ms — Range scan 退化 ALL

1SELECT * FROM events
2WHERE created_at BETWEEN '2026-05-01' AND '2026-05-02'
3AND user_id = 12345;

events 有 idx_user_id 跟 idx_created_at 兩個 index、optimizer 應該選一個 + 二級 filter、但實際 type=ALL（full scan）。

EXPLAIN ANALYZE 顯示：

1-> Filter: ((events.user_id = 12345) and (events.created_at between ...))  (cost=2M rows=100)
2    -> Table scan on events  (cost=2M rows=10000000)  (actual time=0.1..30s ...)

問題：optimizer estimated rows=100、實際 cardinality estimation 失準（distribution skew）、選了 ALL。

修法：

1-- 用 composite index 直接 cover 兩個條件
2ALTER TABLE events ADD INDEX idx_user_created (user_id, created_at);

Composite index 讓 optimizer 看到 單一 index 直接 satisfy 兩個 predicate、走 range scan + index condition pushdown。30 秒降到 200ms。

Case 3：8 秒 → 30ms — Subquery 沒 unnest

1SELECT * FROM orders
2WHERE customer_id IN (
3    SELECT id FROM customers WHERE region = 'TW' AND vip_level >= 3
4);

5.6 之前 MySQL 把 IN (subquery) 寫成 correlated subquery、外表每 row 都 re-run subquery、極慢。5.6+ 加 subquery unnesting、轉換成 JOIN，但某些情況 unnest 失敗。

EXPLAIN 顯示：

1+----+--------------------+-----------+-------+
2| id | select_type        | table     | type  |
3+----+--------------------+-----------+-------+
4|  1 | PRIMARY            | orders    | ALL   |
5|  2 | DEPENDENT SUBQUERY | customers | unique_subquery |
6+----+--------------------+-----------+-------+

DEPENDENT SUBQUERY 是危險訊號。修法：

1-- 手動改寫成 JOIN
2SELECT o.* FROM orders o
3JOIN customers c ON o.customer_id = c.id
4WHERE c.region = 'TW' AND c.vip_level >= 3;

或用 EXISTS（部分 case 比 IN plan 好）：

1SELECT * FROM orders o
2WHERE EXISTS (
3    SELECT 1 FROM customers c
4    WHERE c.id = o.customer_id AND c.region = 'TW' AND c.vip_level >= 3
5);

不同寫法 plan 差異需用 EXPLAIN 驗證、不能假設「JOIN 一定比 IN 快」。

Case 4：2 秒 → 100ms — Derived table 沒 materialize

1SELECT * FROM orders o
2JOIN (
3    SELECT customer_id, COUNT(*) AS order_count
4    FROM orders
5    GROUP BY customer_id
6) AS counts ON o.customer_id = counts.customer_id
7WHERE counts.order_count > 10;

5.6 之前 derived table（FROM subquery）每次 query 都 re-run、慢。5.7+ 有 derived table materialization、但 optimizer 有時不觸發。

EXPLAIN 顯示：

1+----+-------------+-------+------+
2| id | select_type | table | type |
3+----+-------------+-------+------+
4|  1 | PRIMARY     | o     | ALL  |
5|  2 | DERIVED     | orders| ALL  |  -- 沒 materialize、每次 join 都跑
6+----+-------------+-------+------+

修法：

1-- 顯式用 CTE + 改寫
2WITH counts AS (
3    SELECT customer_id, COUNT(*) AS order_count
4    FROM orders GROUP BY customer_id
5)
6SELECT o.* FROM orders o
7JOIN counts ON o.customer_id = counts.customer_id
8WHERE counts.order_count > 10;

但記得 MySQL CTE 也不 materialize 預設、可能要 temporary table 才強制 cache：

1CREATE TEMPORARY TABLE counts AS
2SELECT customer_id, COUNT(*) AS order_count FROM orders GROUP BY customer_id;
3SELECT o.* FROM orders o JOIN counts ON o.customer_id = counts.customer_id
4WHERE counts.order_count > 10;
5DROP TEMPORARY TABLE counts;

Case 5：10 秒 → 100ms — Optimizer 選 index 不對

1SELECT * FROM users WHERE age > 30 AND active = 1;

users 有 idx_active (selectivity 高) 跟 idx_age (selectivity 低)。Optimizer 選 idx_age、scan 60% rows、慢。

EXPLAIN：key: idx_age — 但 active=1 filter 後 row 量 < 5%。

修法選一：

Index hint 強制：

1SELECT * FROM users USE INDEX (idx_active)
2WHERE age > 30 AND active = 1;

Composite index 取代：

1ALTER TABLE users ADD INDEX idx_active_age (active, age);
2DROP INDEX idx_age ON users;

Optimizer hint (8.0+)：

1SELECT /*+ INDEX(users idx_active) */ * FROM users
2WHERE age > 30 AND active = 1;

Composite index 是最持久解（不依賴 hint）。Index hint 是 quick fix、但對 future schema change 脆弱。

EXPLAIN 三層工具

Tool 1：EXPLAIN — query plan preview

1EXPLAIN SELECT ...;

輸出每個 step 的估計 cost / row count / key used。用於 quick check plan 形狀。

關鍵欄位：

type：access type（ALL < index < range < ref < eq_ref < const）、ALL / index 是警訊
key：實際選的 index、可能跟 possible_keys 不同
rows：估計 scan row 數
Extra：Using filesort / Using temporary / Using index condition 等行為標記

Tool 2：EXPLAIN ANALYZE — 實際執行統計

8.0+ 加的。差別：實際 run query、回實際 row count / time、跟 estimate 對比。

1EXPLAIN ANALYZE SELECT ...;

輸出格式（tree format）：

1-> Nested loop inner join  (cost=2.4e6 rows=100000) (actual time=0.05..3.2 rows=10000 loops=1)
2    -> Index range scan on orders using idx_created (cost=2.4e6 rows=10000) (actual time=0.04..3.0 rows=10000 loops=1)
3    -> Single-row index lookup on customers using PRIMARY (cost=1 rows=1) (actual time=0.0001..0.0001 rows=1 loops=10000)

關鍵：對比 cost / rows（estimate） vs actual time / rows。如果 estimate=100K / actual=10M、optimizer 嚴重低估、可能選錯 plan。

Tool 3：Optimizer Trace — 看 optimizer 為何選這個 plan

1SET optimizer_trace='enabled=on';
2SELECT ...;
3SELECT * FROM information_schema.optimizer_trace;

輸出 JSON、列每個 step optimizer 考慮過的 plan + cost estimate + 為什麼選最終 plan。用於：optimizer 行為跟你預期不符時、debug 為什麼。

複雜 query 的 optimizer trace 可能 100+ KB、要熟讀 JSON 結構。production debug tool、不是常規 tool。

Optimizer hint vs Index hint

兩種 hint、語法不同、行為不同：

Index hint（5.x 就有）

1SELECT ... FROM table USE INDEX (idx_name) WHERE ...;
2SELECT ... FROM table FORCE INDEX (idx_name) WHERE ...;
3SELECT ... FROM table IGNORE INDEX (idx_name) WHERE ...;

USE INDEX：建議 optimizer 用這 index、但 optimizer 仍可拒絕
FORCE INDEX：強制用、optimizer 不能拒絕
IGNORE INDEX：禁止用

問題：

對 table name 寫死、refactor / partition 時容易斷
FORCE 太強、可能讓 optimizer 跑得比沒 hint 更慢（forced index 不是最佳 plan）

Optimizer hint（8.0+）

1SELECT /*+ INDEX(table_name idx_name) */ ... FROM table WHERE ...;
2SELECT /*+ JOIN_ORDER(t1, t2, t3) */ ... FROM t1, t2, t3 WHERE ...;
3SELECT /*+ HASH_JOIN(t1 t2) */ ... FROM t1 JOIN t2 ...;
4SELECT /*+ NO_INDEX_MERGE(table) */ ... FROM table WHERE ...;

更細粒度（join order / join method / index 選擇分開）
注入 query comment 內、不污染 SQL syntax
比 index hint 安全：optimizer 看 hint 但仍走 plan space search

5 個 Production 踩雷

1. Statistics 過時 — optimizer 估錯 row count

information_schema.STATISTICS 紀錄每個 index 的 cardinality。如果 過 1 個月沒 ANALYZE、statistics 跟實際資料 distribution 嚴重偏差、optimizer 估計錯。

修法：

定期跑 ANALYZE TABLE（大表改 nightly cron）
8.0+ innodb_stats_auto_recalc=ON 預設、但變更超過 10% row 才觸發
設 innodb_stats_persistent=ON（預設、把 statistics 存 disk）+ innodb_stats_persistent_sample_pages=20（提高 sample 精度）

2. Forced index 用錯 — Hint 比沒 hint 還慢

FORCE INDEX (idx) 強制 optimizer 用、但 idx 不是最佳 時、query 變慢。常見：開發 staging 試出 FORCE INDEX 有效、production 資料 distribution 不同、forced index 反而慢。

修法：

用 USE INDEX 而不是 FORCE INDEX（optimizer 仍可換）
不依賴 hint、用 composite index / 重寫 query 達到目的
已用 hint 的 query 進 staging review 機制、確認 plan 仍合理

3. Hash join 沒觸發 — Equality 是 expression

1SELECT ... FROM a JOIN b ON a.id = b.parent_id + 1;

b.parent_id + 1 是 expression、不是 raw column、optimizer 不選 hash join、用 nested loop。

修法：

Schema 改：把 parent_id + 1 變成 generated column
Query 改：JOIN 之前 預計算 expression 存 temp table
或 /*+ HASH_JOIN(a b) */ 顯式（但 plan 仍可能拒絕）

4. Range scan 退化 ALL — Cardinality 估計太低

1SELECT ... FROM t WHERE col IN (1, 2, 3, ..., 1000);

IN 1000 value、optimizer 預估「range scan 太多 lookup、不如 ALL」、選 full table scan。對 中型表（1M row）通常 IN 仍快、但 optimizer 估錯。

修法：

IN 拆成 temp table JOIN：

1CREATE TEMPORARY TABLE in_values (val INT);
2INSERT INTO in_values VALUES (1), (2), ..., (1000);
3SELECT t.* FROM t JOIN in_values iv ON t.col = iv.val;

或 optimizer_switch='index_merge=on'（multi-value IN 可能走 index merge）
或大 IN 改 application 層拆批 query

5. Derived table materialization off — 重複 scan

optimizer_switch='derived_merge=on'（預設 ON、derived table 自動 inline merge）某些 query 反而慢（merge 後 plan 變複雜）。或 反向問題：derived table 沒 materialize、每次都 re-run。

修法：

看 EXPLAIN 是否有 DERIVED row、確認 materialization 行為
可 optimizer_switch='derived_merge=off' 強制 materialize（影響整個 connection、謹慎用）
大 derived table 改 explicit temporary table 完全控制

跟 PostgreSQL EXPLAIN 對比

工具	MySQL	PostgreSQL
Query plan preview	`EXPLAIN`	`EXPLAIN`
實際執行統計	`EXPLAIN ANALYZE` (8.0+)	`EXPLAIN ANALYZE`
Optimizer 內部 trace	optimizer_trace (JSON)	`auto_explain` extension
Format	TABLE / JSON / TREE	TEXT / JSON / XML / YAML
Parallel query plan	受限（8.0 限 hash join）	Full（PG 10+ parallel scan / aggregate / join）
Index merge	有	有 (`bitmap index scan`)
Genetic Query Optimizer	無	PG 有（適合 > 12 table JOIN）
Cost estimate accuracy	中（histograms 8.0+）	高（成熟 statistics）

PG optimizer 整體更成熟、複雜 OLAP-style query plan 更穩定。MySQL 8.0 補了不少（histograms、hash join、derived table merge）、簡單 OLTP query 已 OK、複雜 query 仍弱。

跟其他模組整合

跟 Modern SQL Features

CTE / window function / lateral / hash join 都改變 query plan space、optimizer 跟著要識別新 pattern。8.0 optimizer 對新 SQL feature plan 仍有改進空間。詳見 Modern SQL Features。

跟 InnoDB Tuning

Query plan 受 buffer pool hit rate 影響 — optimizer 假設 random IO cost、實際資料在 buffer pool 內讀取快。Buffer pool 不夠時 plan estimate 失真。詳見 InnoDB Tuning。

跟 ProxySQL

ProxySQL query rule 不影響 optimizer plan、但可以 rewrite query（rule engine 的 replace_pattern）— 用於把 application 寫不好的 query 改成 optimizer-friendly 形式、application 不必改。詳見 ProxySQL 配置。

跟 Lock Contention

Slow query 持有 lock 久、其他 query wait、整個 cluster lock contention 爆。Query optimization 不只是 latency 問題、也是 lock 影響範圍 問題。詳見 Lock Contention deep dive 篇（待寫）。

跟 Partitioning

Partition pruning 是 optimizer 決定的、EXPLAIN PARTITIONS 看 partition 命中。partition + index 組合可能比 single big table + index 慢（cross-partition query overhead）。詳見 Partitioning 篇（待寫）。

觀測 metric

Production 持續 monitor：

Performance_schema.events_statements_summary_by_digest：每個 query digest 的累計 time / row examined / row sent
slow_query_log：slow query 進 log 檔（long_query_time=1）
sys.statements_with_full_table_scans：列 query 用 full scan 的歷史
sys.schema_unused_indexes：列從未用過的 index、可以 drop 省 write cost

把這些丟進 Datadog / Percona Monitoring & Management 做 trend analysis。

PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case

Tue, 19 May 2026 00:00:00 +0000

本文是 PostgreSQL overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 query optimization — EXPLAIN ANALYZE / auto_explain / pg_hint_plan 三層工具跟 4 個實際 case。

4 個常見 production case

PG query 慢的 root cause 多數是 planner 選錯 plan。從以下 4 個 case 進入 query optimization：

Case 1：5 秒 → 50ms — Seq scan vs index

1-- 慢 (5 秒)
2SELECT o.id, o.amount, c.name
3FROM orders o JOIN customers c ON o.customer_id = c.id
4WHERE c.region = 'TW' AND o.created_at > '2026-05-01';

EXPLAIN (ANALYZE, BUFFERS)：

1Hash Join  (cost=20000..50000 rows=100 width=...) (actual time=4900..5000 rows=10000)
2  ->  Seq Scan on customers c  (cost=0..20000 rows=1000000 width=...)
3      Filter: (region = 'TW')
4      Rows Removed by Filter: 900000
5  ->  Hash  (cost=...)
6      ->  Index Scan on orders_created_idx

問題：customers.region 沒 index、planner 選 seq scan、實際 region=TW 只 10% row。修法：

1CREATE INDEX CONCURRENTLY idx_customers_region ON customers(region);
2ANALYZE customers;  -- 更新 statistics、讓 planner 看到新 index

加完 5 秒降 50ms。

Case 2：30 秒 → 200ms — Hash join 沒觸發、用 nested loop

1SELECT u.name, count(o.id)
2FROM users u LEFT JOIN orders o ON o.user_id = u.id
3GROUP BY u.name;

EXPLAIN ANALYZE 顯示 Nested Loop 跑 1M 次 inner loop、執行 30 秒。Planner 估錯 row count、選 nested loop。Hash join 應該 < 200ms。

修法：

1ANALYZE users;
2ANALYZE orders;
3-- 提高 default_statistics_target 對 critical column
4ALTER TABLE orders ALTER COLUMN user_id SET STATISTICS 1000;
5ANALYZE orders;

統計精度提升、planner 估 row count 準、自動切 hash join。

Case 3：8 秒 → 100ms — Multi-column 統計缺

1SELECT * FROM orders WHERE status = 'pending' AND region = 'TW';

status = 'pending' 5% row、region = 'TW' 10% row。Planner 假設兩 column 獨立、估 0.5% (5K row)。實際 status=‘pending’ 跟 region=‘TW’ 強相關（TW 訂單多 pending）、實際 4% (40K row)。Planner 估錯 8x、選錯 plan。

修法（PG 10+）：

1CREATE STATISTICS stats_orders_status_region (dependencies, ndistinct, mcv)
2ON status, region FROM orders;
3ANALYZE orders;
4-- 之後 planner 知道 status+region 相關度、估準

Case 4：20 秒 → 5 秒 — Parallel query 沒觸發

1SELECT region, count(*), sum(amount) FROM orders GROUP BY region;

orders 100M row、預期 PG parallel scan + parallel aggregate、實際 single worker 跑 20 秒。

EXPLAIN：Workers Planned: 0。

修法：

1# postgresql.conf
2max_parallel_workers_per_gather = 4
3max_parallel_workers = 8
4max_worker_processes = 16
5parallel_setup_cost = 100        # 預設 1000、降低讓 planner 更敢 parallel
6parallel_tuple_cost = 0.01       # 預設 0.1

並行後 5 秒。

EXPLAIN 三層工具

Tool 1：EXPLAIN — Plan preview

1EXPLAIN SELECT ...;

輸出每個 node 的估計 cost / row count / width。用於 quick plan check。

關鍵欄位：

Plan node 類型：Seq Scan < Index Scan < Index Only Scan、警訊看 unexpected node type
cost=START..END：planner 估的 cost、START 是 startup cost、END 是 total
rows：估計 output row 數
width：每 row average byte（影響 sort / hash memory）

Tool 2：EXPLAIN ANALYZE — 實際執行 + 對比 estimate

1EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT ...;

差別：實際 跑 query、輸出實際 row count / time、跟 estimate 對比：

1Hash Join  (cost=20000..50000 rows=100) (actual time=400..500 rows=10000 loops=1)

rows=100 (estimate) vs rows=10000 (actual) — 估錯 100x、planner 可能選錯 plan。BUFFERS 顯示 disk read vs buffer cache hit。

注意：EXPLAIN ANALYZE 實際跑 query、修改性 query（UPDATE / DELETE）會真的改 data。讀 query 安全。修改性 query 包 transaction：

1BEGIN;
2EXPLAIN ANALYZE UPDATE orders SET status = 'x' WHERE ...;
3ROLLBACK;

Tool 3：auto_explain — Production query 自動 capture

auto_explain extension 自動 log slow query 的 plan：

1# postgresql.conf
2shared_preload_libraries = 'auto_explain'
3auto_explain.log_min_duration = '1s'    # 超過 1 秒 log plan
4auto_explain.log_analyze = on            # 含 ANALYZE 統計
5auto_explain.log_buffers = on
6auto_explain.log_format = 'json'         # JSON 格式給工具消費

Production slow query 自動進 log、不必手動 EXPLAIN。組合 pg_stat_statements + auto_explain 是 PG 標準 query observability。

pg_hint_plan vs Planner GUC

PG 兩種方式 nudge planner：

Planner GUC（global）

postgresql.conf 內：

enable_seqscan = off — 禁用 seq scan（force index）
enable_nestloop = off — 禁用 nested loop（force hash/merge join）
random_page_cost = 1.1 — SSD 設低（預設 4 是 HDD assumption）
effective_cache_size = '16GB' — buffer pool + OS cache 估、影響 planner

GUC 是 global — 影響所有 query。對 單一 query 用 hint：

pg_hint_plan extension（per-query hint）

1-- 強制特定 plan
2/*+ IndexScan(orders idx_orders_status) NestLoop(orders customers) */
3SELECT ... FROM orders JOIN customers ON ...;

Hint 形態：

IndexScan(t1 idx_name) — 強制 index scan
SeqScan(t1) — 強制 seq scan
HashJoin(t1 t2) / NestLoop(t1 t2) / MergeJoin(t1 t2)
Leading(t1 t2 t3) — 強制 join order
Rows(t1 t2 #100) — 強制 row 估計

5 個 Production 踩雷

1. Statistics 過時 — Planner 估錯 row count

ANALYZE 是 autovacuum 一部分、預設 autovacuum_analyze_scale_factor=0.1（10% row 變動才 analyze）。對 快速 grow 的表（log / event）、ANALYZE 跟不上、planner 用過時 statistics。

修法：

對 critical table 設 較 aggressive autovacuum_analyze_scale_factor：

1ALTER TABLE events SET (autovacuum_analyze_scale_factor = 0.02);

對 大批量寫入後、手動 ANALYZE events;
監控 pg_stat_user_tables.last_analyze — 跟 row count 比、判定是否需手動 trigger

2. Multi-column statistics — Planner 假設 column 獨立

如 Case 3、單 column statistics 對 相關 column 估錯。

修法：

對 常一起 query 的 column 組合、建 CREATE STATISTICS（PG 10+）
3 種 type：dependencies（functional dependency）、ndistinct（multi-column distinct count）、mcv（most common value combinations）
設完 必須跑 ANALYZE 才生效

3. Cost-base setting 不對齊硬體 — Planner 偏 seq scan

預設 random_page_cost = 4、seq_page_cost = 1 是 HDD assumption（random IO 比 sequential 慢 4x）。SSD / NVMe random / seq IO 差別小、planner 不該 4x penalty random。

修法：

1-- SSD
2ALTER SYSTEM SET random_page_cost = 1.1;
3
4-- NVMe
5ALTER SYSTEM SET random_page_cost = 1.0;
6
7SELECT pg_reload_conf();

random_page_cost 改了 planner 對 index scan 的 cost 估計更準、自動選 index 更積極。

4. `effective_cache_size` 不對齊實際 RAM

effective_cache_size 預設 4 GB、planner 假設 buffer pool + OS cache 共 4 GB。實際 server 64 GB RAM、shared_buffers = 16GB、OS page cache ~30 GB、實際可用 cache 46 GB。

修法：

1ALTER SYSTEM SET effective_cache_size = '46GB';  -- shared_buffers + OS cache 估

提升後 planner 估 query 多數 page 在 cache、降低 估計 random IO cost、選 index 更積極。

5. Parallel query 不觸發

預設 max_parallel_workers_per_gather = 2、有些 workload 不夠。或 table size 太小、min_parallel_table_scan_size = 8MB 預設、小表不 parallel。

修法：

1ALTER SYSTEM SET max_parallel_workers_per_gather = 4;
2ALTER SYSTEM SET parallel_setup_cost = 100;
3ALTER SYSTEM SET parallel_tuple_cost = 0.01;
4ALTER SYSTEM SET min_parallel_table_scan_size = '0';  -- 任何 size 都 parallel

監控 EXPLAIN 的 Workers Planned 數量、看是否真 parallel。

觀測 metric

Production 持續 monitor：

pg_stat_statements：每個 query digest 累計 calls / time / rows / IO
auto_explain log：slow query 的實際 plan + ANALYZE 統計
pg_stat_user_tables.last_analyze / last_autoanalyze：statistics 新鮮度
pg_stat_user_indexes.idx_scan：每個 index 使用次數 — 0 表示沒用、可考慮 drop

把這些丟進 Datadog / Prometheus（用 postgres_exporter / pg_exporter）做 trend analysis。

跟 MySQL Query Optimization 對照

維度	PG	MySQL
Query plan preview	`EXPLAIN`	`EXPLAIN`
實際執行統計	`EXPLAIN ANALYZE`	`EXPLAIN ANALYZE` (8.0+)
Auto-capture	`auto_explain` extension	`slow_query_log` + `pt-query-digest`
Optimizer trace	log_planner_stats / log_executor_stats	`optimizer_trace` (JSON)
Per-query hint	`pg_hint_plan` extension	optimizer hint comment (`/+ /`)
Multi-column statistics	`CREATE STATISTICS`	無原生（依賴 index 統計）
Parallel query	Full (scan / agg / join, PG 9.6+)	受限 (8.0 hash join)
Cost-base setting	random_page_cost / effective_cache_size	隱性、optimizer 預設

PG planner 整體成熟、複雜 OLAP-style query 處理較好。MySQL 8.0 補了不少（histograms / hash join）但複雜 query 仍弱於 PG。詳見 MySQL Query Optimization。

跟其他模組整合

跟 Autovacuum Tuning

ANALYZE 是 autovacuum 一部分、autovacuum 跟不上 → statistics 過時 → planner 估錯。詳見 Autovacuum Tuning。

跟 Replication Topology

Standby 上跑 query 用同 statistics（streaming replication copy 整個 system catalog）、planner 行為一致。但 standby 有 hot_standby_feedback 影響 primary autovacuum / ANALYZE 行為。詳見 Replication Topology。

跟 Partitioning

Partition pruning 跟 query plan 緊密 — EXPLAIN 看是否 prune 對的 partition。詳見 Declarative Partitioning。

何時用 pg_hint_plan vs GUC

情境	選擇
全 cluster 行為（如 SSD random_page_cost）	GUC
單一 critical query 強制特定 plan	pg_hint_plan
暫時 disable 某類 plan 給 debug	`SET enable_xxx=off` per-session
Production stable use	GUC + multi-column statistics 為主、hint 為 last resort

MongoDB Aggregation Pipeline Optimization：stage 順序、index 配合與 memory 邊界

Wed, 27 May 2026 00:00:00 +0000

MongoDB aggregation pipeline 是 document model 做 analytical query 的主要介面、stage stream 設計直觀但 production 容易踩雷 — 上線時 200ms、半年後資料量翻倍變 8s、加 index 沒用；profiler 顯示 stage 之間在 memory 累積上百 MB temp data。Aggregation pipeline 的最佳化跟 RDBMS 的 SQL planner 完全不同邏輯 — RDBMS 靠 planner 自動重排 join / filter、MongoDB 靠寫 query 的人手動排 stage 順序。本文把 stage 機制、index 配合、memory 邊界、cross-shard 限制講清楚、並對「report dashboard 跑爆 primary」這個常見 anti-pattern 給治理路徑。

本文不重複 MongoDB vendor overview 已寫過的 aggregation 簡介 — 而是 production tuning + 失敗修復的實作層教學。

前置閱讀：MongoDB workload 適配判讀（document shape 主導 / contract layer 該放哪 / 跨雲 hedging 是否需要）見 schema-design-pattern 開頭 3 軸前置判讀。本文聚焦 aggregation pipeline 操作層、是 已選 MongoDB 後 的 query 層工程議題、不重複前置判讀。

問題情境：aggregation 是 hot path 的反模式

典型觸發場景：報表 pipeline 上線時 200ms、半年後資料量翻倍變 8s、加 index 沒用；profiler 顯示 stage 之間在 memory 累積上百 MB temp data。

進一步徵兆：

「OLTP collection 上跑 analytical query」的混合 workload：把 $group + $lookup + $sort 接成長 pipeline、aggregation 把整個 working set 從 cache 擠走
Sharded cluster 上跑 cross-shard aggregation：$group / $sort 必須在 mongos 合併、mongos 變單點瓶頸
$lookup 出現在 hot path：每筆 input doc 都要去另一個 collection 查、嚴格意義上是 N+1
db.serverStatus().metrics.aggStageCounters 飆、executionStats.executionTimeMillis 跟 doc 數線性增長
Profiler 報 usedDisk: true、aggregation OOM kill QueryExceededMemoryLimitNoDiskUseAllowed

Case anchor：report dashboard 跑爆 primary 的具體 incident 細節需未來 case 補完、本文以「常見 anti-pattern」處理、不憑空編造 incident 數字。側面引用 9.C30 Microsoft 365 — 從 MongoDB 把 analytics 分離出來的 driver。

核心機制

Aggregation pipeline 是 stage 序列：每個 stage 接 stream of document、產出 stream of document。Stage 順序直接決定後續 stage 處理量 — 第一個 stage 是 IXSCAN 還是 COLLSCAN、$match 推到前面還是後面、$project 早 drop 還是晚 drop、都會放大或縮小後續 cost。

Optimizer rewrite：MongoDB 會自動把 $match / $project 往前推、把 $sort + $limit 合併成 top-K、但不保證所有 case。用 explain("executionStats") 看 rewrite 後的 effective pipeline、不要靠原始 pipeline 推斷實際執行順序。

Index 配合：pipeline 的 第一個 stage 若是 $match 或 $sort、且能對到 index、就走 IXSCAN。中間 stage 都是 in-memory stream、沒 index 概念。所以 $match 永遠該排第一、配合對應 index。

Memory 邊界：每個 aggregation stage 預設 100MB memory 上限、超過要 allowDiskUse: true（4.2+ 是預設）。Disk spill 啟動後 IO 嚴重拖慢、aggregation 變慢 50-100x。

$lookup 在 sharded cluster：foreign collection 不能 sharded（5.0 前完全不行、5.0+ 有限放寬）；$lookup 本質是 nested loop join、沒 hash join / merge join — 對大 collection 不可用。

$facet 平行多 pipeline：但所有 facet 共享同一個 100MB 限制、複雜 facet 容易撞 memory ceiling。

$merge / $out：把結果寫回 collection（pre-computed view / materialized view）— 把 hot analytical query 移出 read path、是治理 anti-pattern 的主要工具。

對應 knowledge card：hot-partition（aggregation 集中讀單 shard 的副作用）、document-store、stale-read（從 secondary 跑 aggregation 的 trade-off）。

操作流程

Step 0：把壞 pipeline 跟好 pipeline 並排。看一個簡化但典型的優化：

 1// 壞：lookup 在 match 前、sort 沒 limit、project 在最後
 2db.orders.aggregate([
 3  { $lookup: { from: "users", localField: "userId", foreignField: "_id", as: "user" } },
 4  { $match: { status: "completed", "user.region": "ap-tokyo" } },
 5  { $sort: { createdAt: -1 } },
 6  { $project: { _id: 1, total: 1, createdAt: 1 } }
 7])
 8
 9// 好：可推前的 match 寫前面、sort + limit 配對、project 早寫
10db.orders.aggregate([
11  { $match: { status: "completed" } },
12  { $sort: { createdAt: -1 } },
13  { $limit: 100 },
14  { $lookup: { from: "users", localField: "userId", foreignField: "_id", as: "user" } },
15  { $match: { "user.region": "ap-tokyo" } },
16  { $project: { _id: 1, total: 1, createdAt: 1, "user.name": 1 } }
17])

差別：壞 pipeline 對整個 orders 做 lookup、然後才過濾；好 pipeline 先過濾 + top-100、只對 100 筆做 lookup、再過濾 lookup 結果。實際 collection 大時兩者差 50-100x。

Step 1：拿 explain plan。

1db.coll.explain("executionStats").aggregate([...])

看 stages[] 顯示 rewrite 後的 effective pipeline、executionTimeMillis、totalDocsExamined / totalDocsReturned 比值、是否 usedDisk。

Step 2：把 $match 推到最前。越早過濾、後續 stage 處理量越小。Optimizer 通常自己會推、但 $lookup 之後的 $match 不會自動推到 $lookup 之前 — 因為 lookup 出的欄位才能被那個 match 用、邏輯依賴。寫 query 時就把能推前的 $match 寫前面。

Step 3：對 $match 欄位建 compound index。確保 executionStages 顯示 IXSCAN 而不是 COLLSCAN。Compound index 順序敏感 — { status: 1, createdAt: -1 } 對 { status: ..., createdAt: $gte: ... } 高效、對 { createdAt: $gte: ... } 走不到 index。

Step 4：$sort + $limit 寫在一起。Optimizer 才會推 top-K（不需要 full sort、只需要 heap）。單 $sort 不限 limit 會做 full sort、容易撞 memory。

Step 5：$project 早寫。把不需要的欄位早期 drop、減少後續 stage 處理 doc size。對大 document 特別有效。

Step 6：把 hot analytical pipeline 寫成 materialized view。

 1db.orders.aggregate([
 2  { $match: { createdAt: { $gte: ISODate("2026-05-01") } } },
 3  { $group: { _id: "$customerId", total: { $sum: "$amount" } } },
 4  { $merge: {
 5      into: "monthly_customer_summary",
 6      on: "_id",
 7      whenMatched: "merge",
 8      whenNotMatched: "insert"
 9  }}
10])

定時更新（cron / 5 分鐘一次）、application 讀 materialized view 而不是即時跑 aggregation。

Step 7：sharded cluster 處理。避免在 hot path 用 cross-shard $lookup / $group、或把這類 query 路由到 analytical replica（用 tag set + read preference）、見 replica set read preference。

驗證點：

executionTimeMillis 在預期 budget 內
totalDocsExamined / totalDocsReturned 比值接近 1（過濾效率高）
無 usedDisk: true
無 stage 看到 inMemory > 50MB

Rollback boundary：pipeline 改寫是 application code 變更、可以灰度；materialized view（$merge）需備份 target collection 才能還原。

典型 tuning 過程（200ms → 8s → 250ms）

一個常見的 production pipeline 演化路徑：

上線時 200ms：collection 100K doc、$match 過濾 95%、$lookup 只跑 5K 次、in-memory $sort 處理 5K row 在 100MB 內
半年後 8s：collection 長到 2M doc、$match 仍過濾 95% 但變 100K row、$lookup 跑 100K 次（5K → 100K 是 20x）、$sort 在 in-memory 撞 100MB 開始 disk spill、IO 100x 退化
加 compound index 沒用：index 是給 $match 用的、但 $match 之後的 stage（$lookup / $sort）走的是 in-memory pipeline、index 救不了
修法到 250ms：(a) $sort + $limit 配對讓 optimizer 走 top-K、避免 full sort (b) 改 schema embed 把 $lookup 拿掉（見 schema design pattern）(c) hot pipeline 寫成 $merge materialized view、application 讀 view 不跑 aggregation

關鍵教訓：aggregation 慢的原因不在 query 本身、在 資料形狀演進。Index 是 hot path 的第一個槓桿、但只對 $match / $sort 第一 stage 有效；後續 stage 要靠 stage 順序、materialized view、schema denormalize 來救。

失敗模式

$lookup 在 hot path：list page 每行去另一 collection 查、p99 隨 page size 線性增。應在 schema design 階段 denormalize、把 read-together 資料 embed 回 aggregate root（見 schema design pattern）。

$sort 不帶 limit + 沒 index：全表 in-memory sort、撞 100MB 限制 → OOM 或 disk spill。allowDiskUse: true 解 OOM 但 IO 100x 退化。修法是建對應 index 走 IXSCAN sort、或限 limit 走 top-K。

Sharded cluster cross-shard aggregation：$group 階段所有 partial result 跑到 mongos 合併、mongos memory + CPU 爆。修法是 group key 包含 shard key prefix（讓 group 在 shard 內完成）、或路由到 analytical replica 跑。

Stage 順序錯：$lookup 放在 $match 前、等於對全表都做 lookup 再過濾、每個 input doc 都觸發 lookup。$match 永遠該排第一。

Aggregation 把 working set 擠走：OLTP 的 hot page 被 aggregation 的 cold scan 擠出 cache、整體 query latency 一起退化。修法是 analytical workload 跟 OLTP read 隔離（read preference tag）、或搬走 analytical（見下面 anti-recommendation）。

$facet 滿載：四個 facet 各跑大 pipeline、共享 100MB 限制立刻爆。修法是拆成獨立 query、不要硬塞 facet。

Anti-recommendation：

報表 / BI / analytics workload 跑 MongoDB primary 是反模式：應該 (a) 設定 analytical secondary + read preference tag (b) 用 $merge 寫到 reporting collection (c) 進階用 BI Connector / data lake / 把 analytical workload 整批搬到 ClickHouse / BigQuery
「report dashboard 跑爆 primary」典型 anti-pattern：BI 工具直連 MongoDB primary 跑長 pipeline、cache eviction 把 OLTP working set 擠走、p99 latency 在報表時段集體升。沒拿到具體 incident 數字、不在本文編造、改寫成「常見 anti-pattern」並推到治理路徑
Aggregation 不能解 read scaling：aggregation 是 OLTP 的補位、不是 read scaling 的主路。Read scaling 在大規模 OLTP 走 cache + freshness token（見 connection management and cache layer）、不是把 aggregation 跑爆 secondary

容量與觀測

關鍵 metric：

Aggregation operation time 分布
Disk spill 次數
opcounters.command 中 aggregate 比例
Cache eviction rate 在 aggregation 高峰時的變化

Mongo command：

db.currentOp({ "command.aggregate": { $exists: true } })：當前 aggregation 在跑
db.serverStatus().metrics.aggStageCounters：stage 級別 counter
explain("executionStats")：單 query 詳細分析

Profiler：db.setProfilingLevel(1, {slowms: 200})、看 usedDisk flag 跟 numYield。

回到 4.20 observability evidence：aggregation slow log + cache hit ratio + disk spill rate 是「analytical 壓力」的 evidence 三件套。

回到 9.5 bottleneck localization：用 explain executionStats 把 pipeline stage 對到瓶頸（IXSCAN 還是 COLLSCAN、in-memory 還是 disk spill、shard-local 還是 mongos merge）。

邊界與整合

Sibling deep articles：

schema design pattern — embedded 設計可消除大部分 $lookup
shard key selection — 決定 aggregation 是 shard-local 還是 cross-shard
replica set read preference — aggregation 跑 secondary 的 stale read trade-off
connection management and cache layer — report dashboard 跑爆 primary 時的 cache + read scaling 主路

Migration playbook：analytical workload 大到不能繼續混在 MongoDB → split 出 → Cosmos DB MongoDB API + Synapse 或 → DynamoDB + Athena/Glue（access pattern 重設計）。

跟 1.x 互引：1.10 KV / Document DB 容量規劃把 aggregation 列為 read-shape 的成本維度；1.1 高併發資料存取處理「OLTP + analytical 同 cluster」的反模式。

Query-Optimization on Tarragon

MySQL Query Optimization：從 EXPLAIN 看到實際執行、5 條 query 從 5 秒變 50ms 的 anatomy

5 個常見 production case

Case 1：5 秒 → 50ms — JOIN 順序選錯

Case 2：30 秒 → 200ms — Range scan 退化 ALL

Case 3：8 秒 → 30ms — Subquery 沒 unnest

Case 4：2 秒 → 100ms — Derived table 沒 materialize

Case 5：10 秒 → 100ms — Optimizer 選 index 不對

EXPLAIN 三層工具

Tool 1：EXPLAIN — query plan preview

Tool 2：EXPLAIN ANALYZE — 實際執行統計

Tool 3：Optimizer Trace — 看 optimizer 為何選這個 plan

Optimizer hint vs Index hint

Index hint（5.x 就有）

Optimizer hint（8.0+）

5 個 Production 踩雷

1. Statistics 過時 — optimizer 估錯 row count

2. Forced index 用錯 — Hint 比沒 hint 還慢

3. Hash join 沒觸發 — Equality 是 expression

4. Range scan 退化 ALL — Cardinality 估計太低

5. Derived table materialization off — 重複 scan

跟 PostgreSQL EXPLAIN 對比

跟其他模組整合

跟 Modern SQL Features

跟 InnoDB Tuning

跟 ProxySQL

跟 Lock Contention

跟 Partitioning

觀測 metric

相關連結

PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case

4 個常見 production case

Case 1：5 秒 → 50ms — Seq scan vs index

Case 2：30 秒 → 200ms — Hash join 沒觸發、用 nested loop

Case 3：8 秒 → 100ms — Multi-column 統計缺

Case 4：20 秒 → 5 秒 — Parallel query 沒觸發

EXPLAIN 三層工具

Tool 1：EXPLAIN — Plan preview

Tool 2：EXPLAIN ANALYZE — 實際執行 + 對比 estimate

Tool 3：auto_explain — Production query 自動 capture

pg_hint_plan vs Planner GUC

Planner GUC（global）

pg_hint_plan extension（per-query hint）

5 個 Production 踩雷

1. Statistics 過時 — Planner 估錯 row count

2. Multi-column statistics — Planner 假設 column 獨立

3. Cost-base setting 不對齊硬體 — Planner 偏 seq scan

4. effective_cache_size 不對齊實際 RAM

5. Parallel query 不觸發

觀測 metric

跟 MySQL Query Optimization 對照

跟其他模組整合

跟 Autovacuum Tuning

跟 Replication Topology

跟 Partitioning

何時用 pg_hint_plan vs GUC

相關連結

MongoDB Aggregation Pipeline Optimization：stage 順序、index 配合與 memory 邊界

問題情境：aggregation 是 hot path 的反模式

核心機制

操作流程

典型 tuning 過程（200ms → 8s → 250ms）

失敗模式

容量與觀測

邊界與整合

相關連結

4. `effective_cache_size` 不對齊實際 RAM