#database
# 1. Overview
- SQL 是 declarative。需要 DBMS 的优化器来决定最优的执行方式。
- IBM System R
- 1970s **首次实现** query optimizer
- 它的许多概念和设计今天仍然在沿用
- 两种查询优化的方式:
- **Heuristics/Rules**
匹配我们已知的模式,再应用规则
- rewrite query 来移除低效的东西
- 实施时可能需要查看 catalog,但不需要查看数据
- **Const-based Search**
- 使用一个模型来估算执行一个 plan 的 cost
- 评估一个查询的多个等价的 plan,选择 cost 最低的
- 需要以某种方式查看数据
- **Architecture overview**
```mermaid
flowchart TD
subgraph s[" "]
direction TB
应用程序 -- SQL Query --> Rewriter["SQL Rewriter(可选)"] -- SQL Query --> Parser;
Parser -- AST --> Binder -- LogicalPlan --> TreeRewriter["Tree Rewriter(可选)"]
TreeRewriter -- LogicalPlan --> Optimizer -- PhysicalPlan --> Execution
end
subgraph s1[" "]
direction RL
Catalog -.-> Binder
Catalog -. SchemaInfos .-> TreeRewriter
Catalog -. SchemaInfos .-> Optimizer
end
subgraph s2[" "]
direction RL
CostModel["Cost Model"] -. Estimates .-> Optimizer
end
classDef node fill:#f96
```
- **SQL rewriter**: 如分布式数据库
- **Binder**:将 SQL 中 named objects referenced 转换到内部 ID,需要查阅 catalog
- **Tree Writer:** 需要查看 catalog 场景示例:[[14-optimization-1#^ead4c8 ]] 删除冗余谓词(如 WHERE a IS NOT NULL,在 a 是主键时,条件是冗余的)
- **Logical vs. Physical Plans**
- optimizer 生成从 **逻辑关系表达式** 到 **优化的等价的物理关系表达式** 之间的映射。
- physical operator 定义了特定的执行策略
- 取决于数据的物理格式(sorting,compression)
- logical 到 physical 之间不总是一一对应的 (比如 join + order-by 对应 sort-merge-join)
---------
# 2. Relational Algebra Equivalences
- 两个关系代数表达式是等价的,如果它们生成 the same **set** of tuples。
> relation model is unordered
- query rewriting:DBMS生成更好的 query plans,不需要 cost model。
## 2.1 Predicate pushdown
**SQL:**
```
SELECT s.name, e.cid
FROM student AS s, enrolled AS e
WHERE s.sid = e.sid
AND e.grade = 'A'
```
**关系代数:**
$\large \Pi_{name,cid}(\ \sigma_{grade='A'}(student \bowtie enrolled)\ )$
**重写为:**
$\large \Pi_{name,cid}((student \bowtie (\sigma_{grade='A'}(enrolled)))$
> pushdown 不总是好的,比如 predicate 计算很耗时(如UDF),可能等 JOIN 完数据变小再应用谓词会更好。
## 2.2 Selections
- 尽可能早地应用 filters
- **Reorder predicates**,应用 the most-selective one。
- 拆开复杂的 predicate,然后再下推
- $\Large\sigma_{p1\wedge p2 \cdots \wedge pn}(R) = \sigma_{p1}(\sigma_{p2}(\cdots\sigma_{pn}(R)))$
- 拆开的原因:如有的 predicate 计算代价高、列存
- 简化复杂的 predicate
- `(X = Y AND Y = 3)` $\rightarrow$ `X=3 AND Y=3`
## 2.3 Projects
- 尽早执行来使用 tuples 更小,并且减少中间结果
- 减少 operators 之间拷贝的数据量,尤其对于行存系统
- 需要保留必要的 attributes(如 join keys)
- 示例:
- 关系代数 :$\large \Pi_{name,cid}((student \bowtie (\sigma_{grade='A'}(enrolled)))$
- 重写为:$\large \Pi_{s.name,e.cid} (\ \Pi_{sid,name}(student)) \ \bowtie_{s.sid=e.sid} \ \Pi_{sid,cid}( \sigma_{grade='A'}(enrolled))\ )$
## 2.4 More Examples
### Impossible/Unnecessary Predicates
^ead4c8
- **`SELECT * FROM A WHERE 1=0;`**
impossible predicate,可以跳过 scan,直接输出空结果
- **`SELECT * FROM A WHERE 1=1;`**
unnecessary predicate,可以直接忽略 predicate。
重写为 `SELECT * FROM A;`
### Join Elimination
```SQL
SELECT A1.*
FROM A AS A1 JOIN A AS A2
ON A1.id = A2.id;
```
重写为:
```SQL
SELECT * FROM A;
```
### Ignoring Projections
```SQL
SELECT * FROM A AS A1
WHERE EXISTS(SELECT val FROM A AS A2
WHERE A1.id = A2.id);
```
针对 A1 中的每一条记录,检查子查询是否能查出记录,能查出则返回 TRUE,然后 emit A1 的该条记录。
重写为:
```SQL
SELECT * FROM A;
```
### Merging Predicates
```SQL
SELECT * FROM A
WHERE val BETWEEN 1 AND 100
OR val BETWEEN 50 AND 150;
```
重写为:
```SQL
SELECT * FROM A
WHERE val BETWEEN 1 AND 150;
```
## 2.5 Joins
- Join 满足交换律和结合律
- $R \ \bowtie \ S = S\ \bowtie \ R$
- $(R\bowtie S)\bowtie T = R \bowtie (S \bowtie T)$
- N-way Join 有多少种顺序
- [Catalan number](https://en.wikipedia.org/wiki/Catalan_number)
- $\large \approx 4^n$