fast-tree-search-paper - JonahGao's Notes

- Paper: [FAST: fast architecture sensitive tree search on modern CPUs and GPUs](https://dl.acm.org/doi/10.1145/1807167.1807206) # 1 Intro - 目标：In-memory tree structred index search - binary tree - 优化点：page size、cache line size、SIMD width。 - 结构上提升cache利用率 - 如何消除 latency 的负面影响： latency hiding - hierarchically blocked tree、software pipeling、prefetches。 - 利用多核并发和 SIMD - fast bulk updates： > we can re-build our index trees in less than 0.1 seconds for 64M keys on both CPU and GPU - CPU search becomes close to memory bandwidth bound as the tree size grows. - 解决方式：prefix compression，使用 SSE SIMD加速 # 2 RELATED WORK - 一个 tree node 多大： - cache line - 更大（比如大于512字节）：TLB miss会小 - 利用prefetch减少cache miss - 从 node 到它的 children 会承受 a full cache miss，prefetch 所有children不高效，因为 fallout 会很大。 - 因此等到知道需要移动到那个 child，只 prefetch 这一个 child。 - 为了增加 preftech distance（从发起 prefetch 到访问数据之间的指令数），使用 software pipelining。 - 使用 SIMD P-ary 加速 search 操作。 - memory 和 processor core 之间的传输单元是一个 cache line。 - compression技术可以缩短key的长度，增加tree的fanout，减小树高，从而较少cache miss提升search 性能。 # 3 ARCHITECTURE OPTIMIZATIONS - ILP: instruction-level parallelism - TLP: thread-level parallelism - DLP: data-level parallelism ## 3.1 ILP prefetch: 减少内存访问的延迟 ## 3.2 TLP/DLP 不同的线程并行执行不同查询 SIMD ## 3.3 Memory Bandwidth - 需要确保一个cache line在被淘汰前能够被充分利用 - cache line没被充分利用就意味着需要频繁访问内存，结果就是 bandwidth-bound - 数据压缩可以在一个cache line里打包更多数据，避免 bandwidth-bound # 4 ARCHITCTURE SENSITVIE TREE ## 4.1 Motivation - search 操作时树的每层只会访问一个 node，导致 cache line的使用不高效。 - 每次下降，都可能访问不同的内存page，引发TLB miss。 ## 4.2 Hierarchical Blocking ## 4.3 Compute And Bandwidth Analysis - 优化技术-software pipeline: - 跨cache line block访问时，先执行prefetch把下一个要访问的cache line block加载到cache中， - 然后在等待prefetch生效期间转去执行其他查询请求的比较计算，充分利用资源。 - 减少节点的大小：4字节keys，4字节rid # 5 CPU SEARCH VS GPU SEARCH ## 5.1 CPU Implementation - 每个CPU core 128位 SIMD（SSE）计算单元，每个SSE指令可以并行操作4个32位数据。 - Building The Tree： - 构建过程可以并行 - 构建速度：64M tuples i7 0.1seconds - 如何更新：缓存修改，到一定程度整体重建整个索引结构 - Traversing The Tree： - SIMD friendly tree traversal algorithm 对SIMD友好的遍历算法 ---- # 其他 - 代码实现：[https://db.in.tum.de/~leis/index/fast.cpp](https://db.in.tum.de/~leis/index/fast.cpp)