orthogonal-optimize-subqueries-and-aggregation

# Abstract - 生成可以高效执行的 small primitives，分为正交的两大领域： - correlation removal - outer joins and group-by ---------- # 1. Introduction - **背景**：我们观察到，在子查询执行技术和 GroupBy 求值等其他技术之间存在显著的重叠。 - **改进**： more primitive, 可以独立地进行优化 ## 1.1 Standard subquery execution strategies - SQL 中两种形式的聚合，区别是处理 empty input 时行为不一样： 1. Vector aggregation：按某列分组，如果组为空，则结果为空 - 记为 $\mathcal{G}_{A,F}$，其中 A 为分组列，F 为聚合计算 2. Scalar aggregation：不指定分组列，始终返回一行 - 记为 $\mathcal{G}_F^1$ ```sql -- Q1: 查找总订购金额超过1,000,000的客户 SELECT C_CUSTKEY FROM CUSTOMER WHERE 1000000 < (SELECT SUM(O_TOTALPRICE) FROM ORDERS WHERE O_CUSTKEY = C_CUSTKEY) ``` ### Correlated execution - 最接近 SQL 语句的执行方式： - 针对每一个 customer 计算 total amount，然后过滤指定金额的客户 - 通常被认为是**低效**的方式，因为需要对 customers 表进行逐行处理，而非面向 set - 但是如果 outer table 足够小，或者存在合适的索引，实际上会是最优的策略 ### Outerjoin, then aggregate - 最初由 Dayal 在 A unified approach to processing queries that contain nested subqueries, aggregates, and quantifiers 中提出。 - 为了使用 set-oriented 算法，我们先收集每一个 customer 的所有 orders，然后按 customer 分组进行聚合，最后对聚合结果进行过滤。 ```SQL SELECT C_CUSTKEY FROM CUSTOMER LEFT OUTER JOIN ORDERS ON O_CUSTKEY = C_CUSTKEY GROUP BY C_CUSTKEY HAVING 1000000 < SUM（O_TOTALPRICE） ``` - 使用 outerjoin，可以保留没有订单的 customer，其聚合结果为 NULL。 ### Aggregate, then join - 最初由 Kim 在 [On optimizing an SQL-like nested query](https://dl.acm.org/doi/pdf/10.1145/319732.319745) 中提出 - 直接在 ORDERS 表上聚合，算出每个 customer 的总金额，然后再跟 CUSTOMER 表 join。 - 这允许将聚合条件 push 到 join 下面。 ``` SELECT C_CUSTKEY FROM CUSTOMER, (SELECT O_CUSTKEY FROM ORDERS GROUP BY O_CUSTKEY HAVING 1000000 < SUM(O_TOTALPRICE)) AS AGGRESULT WHERE O_CUSTKEY = C_CUSTKEY ``` - 该 SQL 语句使用了一个 *derived table*，而非 subquery。 ## 1.2 Outer technique: Use primitive, orthogonal pieces - 从上面的策略列表来看，最优的策略取决于表的大小、过滤条件的 selectivity 以及聚合的 reduction factor。 - 我们使用正交的、可重用的 primitives 来导出上述的策略列表，然后使用成本估算从中挑选。 - 图 1 展示了 primitive optimizations 如何生成不同的子查询策略 - 上述的标准策略只是图中可能的两个方框之一。 - 实现一定程度的 syntax-independence <img src="https://img.jonahgao.com/oss/note/2025p1/orthogonal_subquery_f1.png" width="500" /> ^2fb6b2 ### Algebrize into initial operator tree - 相关性的代数表示基于 Apply 算子。 > Apply is a second-order relational construct that abstracts parameterized execution of subexpressions. ### Remove correlations - 包括将 Apply 重写为其他算子如 outerjoin。 - 针对 example query 使用的是 Dayal 提出的策略。 ### Simplify outerjoin - 基于 [Outerjoin simplification and reordering for query optimization](https://dl.acm.org/doi/pdf/10.1145/244810.244812) - 将 outerjoin 简化为 join，满足 null-rejecting 条件下 - 如 1000000 < x 拒绝了 x 的 NULL 值，满足优化条件 - GroupBy 算子添加了 null-rejection 的派生。 ### Reorder GroupBy - [Including Group-By in query optimization](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=30795447cec18753254edbbd7839f0afa58b2a39) - [Eager Aggregation and Lazy Aggregation](https://dl.acm.org/doi/10.5555/645921.673154) - 我们的实现： - adding reordering around outerjoins - 基于聚合函数的抽象属性进行操作，而非只考虑五种标准的 SQL 聚合。 ## 1.3 A useful tool: Represent parameterized execution algebraically ### Apply Construct - 类似 LISP 的 APPLY 或者 MAPCAR 操作符： - Evaluate an expression on a collection of items, and gather the results. - Apply 接收一个 relational input $R$ 和一个 parameterized expression $E(r)$, 对每一个 row $r ∈ R$ 计算表达式 $E(r)$，然后收集结果： - $\large R\ A^⊗ E = \bigcup_{r∈R}(\{r\} ⊗ E(r))$ - $⊗$ 是 joins 中的任意一种（cross product、left outerjoin、left semijoin 等） - 其中 union 操作是 UNION ALL，不去重 ```sql -- Q1: 查找总订购金额超过1,000,000的客户 SELECT C_CUSTKEY FROM CUSTOMER WHERE 1000000 < (SELECT SUM(O_TOTALPRICE) FROM ORDERS WHERE O_CUSTKEY = C_CUSTKEY) ``` - Q1 对应的代数形式： - <img src="https://img.jonahgao.com/oss/note/2025p1/orthogonal_subquery_f2.png" width="450" /> - 每次 parameterized expression 的调用都仅返回一个行，所以 Apply 后仍然保留跟CUSTOMER 一样的基数。 - 通常，output rows 的数量取决于 $E(r)$ 的基数，同时也取决于把结果与 r 合并的 join variant。 - 至少有三个研究/开发小组独立探索了 Apply 在 (sub-)query 处理中的作用： - [Tandem](https://ieeexplore.ieee.org/document/583381)：称为 tuple substituion join. - [d-join](https://dl.acm.org/doi/10.5555/890812) - [Parameterized queries and nesting equivalences. Technical report, Microsoft](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2000-31.pdf) ### SegmentApply Construct - **Apply** 只接受 scalar / row-valued parameters。 - **SegmentApply** 可以接收 table-valued parameters - 输入：关系 $R$ - parameterized expression：$E(S)$ - 来自 R 的 a set of segmenting columns：A - 使用 A 创建 R 的 segments，然后类似 GroupBy 对每一个 segment 执行 $E(S)$ - $\large R\ \,\mathcal{S} \mathcal{A}_{A}\ \ E=\bigcup_{a}(\{a\}\times E(\sigma_{A=a}R))$ - 其中 a 代表了 A 域的所有值 - SegmentApply 可以重写为 Apply，然后任何包含标准 operators 加 Apply 的表达式都可以重写为只包含标准表达式。 ---------- # 2. Representing And Normalizing Subqueries - 本章讨论：将 SQL 子查询生成不使用 correlated execution 的等价 operator tree，其结果将不再包含子查询。 ## 2.1 Direct algebraic representation with mutual recursion - 第一步：parser/algebrizer 将 SQL 语句生成一个 operator tree（同时包含了 relational 和 scalar operators）。 - **Q1** 生成的 operator tree： - <img src="https://img.jonahgao.com/oss/note/2025p1/orthogonal_subquery_f3.png" width="450"/> - 粗体的是 relational nodes。 - 这种表示方式下，scalar operator 可能存在 relational subexpressions 作为其 children。 - 直接执行该 operator tree 意味着 subquery 会采用 nested loops style，同时 relational 和 scalar 两个 execution components 会**相互递归**： 1. relational select 逐行执行 2. 调用谓词的 scalar evalutation 3. 而谓词又调用 subquery 的 relational execution ## 2.2 Algebraic representation with Apply - 通过引入 Apply operators 可以**移除** scalar 和 relational nodes 之间的**递归**。 - 通用的机制：在 scalar expressions 需要子查询的结果之前，显式地执行子查询 - $\odot$ 为 input R 上的一个 relational operator，e 为它的一个 scalar 参数，其中 e 会使用子查询 Q - 我们先使用 Apply 执行 subquery，子查询的结果就会变得可用（一个新列 q），然后替换对子查询的使用： - $\large \odot_{e(q)} \ R\ \rightsquigarrow \ \odot_{e(q)} (R\ \mathcal{A}^⊗ \ Q)$ - 将图3 转换为图2，图2 不再出现 scalar 表达式。 - 在 relational select 的下面引入 $\mathcal{A}^×$ operator 来计算子查询，结果存储在列 X 中。 - 我们展示了如何从一个 scalar expression 中移除一个子查询，但是该技术也适用于多个子查询。 - 总结： - 仍然是基于 nested loops 来执行 - 但移除了 scalar 和 relational execution 之间的递归调用，可以提升性能和简化实现。 > Scalar evaluation never needs to call back into the relational engine. ## 2.3 Removal of Apply > The process consists of pushing down Apply in the operator tree, towards the leaves, until the right child of Apply is no longer parameterized off the left child. <img src="https://img.jonahgao.com/oss/note/2025p1/orthogonal_subquery_f4.png" width="600"/> ## 2.4 All SQL subqueries - 对于布尔类型的子查询，如 EXISTS，NOT EXISTS，IN subquery 和 quantified comparsions，子查询可以重写为一个 scalar COUNT 聚合。 > From the utilization context of the aggregate result, either equal to zero or greater than zero, it is possible for the aggregate operator to stop requesting rows as soon as one has been found, since additional rows do not affect the result of the comparison.