13-query-execution-2 - JonahGao's Notes

#database # 1. Background - **Parallel query execution 优势：** - 性能（吞吐、延迟） - 响应性和可用性（单线程IO阻塞） - 更低TCO（硬件、软件许可、耗电等） - **Parallel vs. Distributed** **相同点**： - 将 database 分布到多个资源上，来提升并行能力 - 对用户来说，都展示为 a single database **Parallel DBMSs**： - 资源在物理上临近 - 资源间通过高速内连来通信 - 通信成本很低且可靠 **Distributed DBMSs**： - 通信开销和失败不能被忽视 -------- # 2. Process Models Process Model：系统如何架构来服务来自多个用户的并发请求。 ## 2.1 Process per DBMS Worker 每个 Worker 是操作系统的一个进程。 - 依赖 OS scheduler - global data structures 需要使用共享内存（比如 buffer pool） - 一个进程 crash 不影响整个系统产品：IBM DB2、Postgres、Oracle ## 2.2 Process Pool Worker 使用 pool 中空闲的进程。产品：IBM DB2、Postgres(2015) ## 2.3 Thread per DBMS Worker 单进程、多个 work thread。 - DBMS 管理 scheduling - 可能有/也可能没有一个 dispatcher thread - thread crash 后会影响整个系统优势： - 上下文切换开销低 - 不需要管理共享内存产品：IBM DB2、MSSQL、MySQL、Oracel(2014) --------- # 3. Execution Parallelism ^e287a2 ## 3.1 Inter-Query - 不同的查询并发执行 - 可以提升吞吐和降低延迟 ## 3.2 Intra-Query - 一个查询内的多个操作之间并行 - 可以降低 long-running queries 的延迟三种方式： - intra-operator（水平） - inter-operator（垂直） - bushy 也可以结合多种方式使用。 ### Intra-Operator - **方式：** - 多线程访问 centralized data structures - 使用 partitioning 切分 input data，并行处理不同的数据 - **exchange operator** 不同的类型： - **Gather**：合并多个worker的结果成一个 output stream - **Repartition**：多对多 - **Distribute**：一对多示例：在 query plan 中插入一个 exchange operator，来合并 children operators 的结果（Gather） ### Inter-Operator > The DBMS overlaps operators in order to pipeline data from one stage to the next **without materialization**. 也称为 pipelined parallelism。 ### Bushy - inter-operator 的扩展 - workers 同时执行来自一个 query plan 不同 segments 的多个 operators。（不同 worker 执行同一个 query plan 的**不同部分**） - 仍然需要 exchange 来合并不同 segments 的中间结果。 - 示例：A JOIN B JOIN C JOIN D，可以先拆分为 A JOIN B 和 C JOIN D 并行执行。 ----------------- # 4. I/O Parallelism - 如果磁盘是主要瓶颈，则使用额外的进程/线程并行执行不会起到作用。 - 将 DBMS 安装到多个存储设备上： - 每个 DB 多块磁盘 - 每块磁盘一个 DB - 每块磁盘一个 Table - 将 Table 拆分到多块磁盘上 - 将 DBMS 文件分布到不同存储设备上，并对 DBMS 透明（DBMS 看起来仍然是一块磁盘） - Storage Appliances - RAID - Database Partitioning - 一些 DBMS 允许为每个单独的 database 指定磁盘位置（buffer manager 需要知道每个 page 对应的磁盘位置） - 基于文件系统更容易（使用单独的文件夹），log file 可能需要共享 - Table Partitioning 分区最好对用户透明（某些分布式数据库没做到）两种方式： - Vertical partitioning：按列分割 - Horizontal partitioning：按行分割 - Hash partitioning - Range partitioning - ==Predicate partitioning== ------------- # Conclusion 很难实现正确： - Coordination overhead - Scheduling - Concurrency Issues - Resource Contention