- Paper: [Are You Sure You Want to Use MMAP in Your Database Management System](https://db.cs.cmu.edu/mmap-cidr2022/)
# ABSTRACT
- MMAP的优势:
- 没有内核态到用户态的拷贝
> From a performance perspective, mmap should also have much lower overhead than a traditional buffer pool. Specifically, mmap does not incur the cost of explicit system calls (i.e., read/write) and avoids redundant copying to a buffer in user space because the DBMS can access pages directly from the OS page cache
- MMAP可能带来两方面的问题:
- 正确性和性能
> these problems involve both data safety and system performance concerns.
----
# 1 INTRODUCTION
- TLB shootdown:
- 淘汰page时,需要移除 page table和 每个core TLB 中的映射。
- 当前CPU没有提前 remote TLB 的一致性,OS需要发起一次昂贵的 inter-processor interrupt 来 flush TLB。
> When evicting pages, the OS also removes their mappings from both the page table and each CPU core’s TLB. Flushing the local TLB of the initiating core is straightforward, but the OS must ensure that no stale entries remain in the TLBs of remote cores. Since current CPUs do not provide coherence for remote TLBs, the OS has to issue an expensive inter-processor interrupt to flush them, which is called a TLB shootdown [11]
-------
# 2 BACKGROUNAD
## Posix API
- madvise:
- MADV_NORMAL:默认 hint,访问page时会同时访问 next 16 和 previous 15 pages。即4K page下,一次会访问 128 KB。
- MADV_RANDOM:只读取必要的 page。
- MADV_SEQUENTIAL:适合 OLAP workload 下的顺序scan
- mlock:
- pin page到内存中,OS不淘汰。
- 但是 Linux 实现中,OS允许 flush dirty pages 到 backing file 中,即使 page 是 pinned。
- msync:
- flushes the specified memory range to secondary storage.
- 保证更新可以持久化到 secondary storage。
## MMAP Gone Wrong
- 不能使用压缩:
- mmap 限制 secondary storage 中的存储跟内存中是一样的。
> the in-memory data layout needed to match the physical representation on secondary storage
------
# 3 PROBLEMS WITH MMAP
## 3.1 Problem 1:Transactional Safety
- OS可能在任意时间 flush a dirty page,即使 page 相关的写事务还有 commit。
- mmap-based DBMSs must therefore employ complicated protocols to ensure that transparent paging does not violate transactional safety guarantees.
- 三类方法:
- OS copy-on-wirte
- user space copy-on-write
- shadow paging
## 3.2 Problem 2: IO Stalls
- DBMS可以异步读(如 B+ tree scan,异步加载前面的leaf node),MMAP不支持 asynchronous reads。
- MMAP 会透明地淘汰 page,DMBS无法知道 page 是否在内存中,读取时可能遭遇未预期的 IO stall。
- mlock局限:每个进程 pin 的 page 数量有限;需要注意 unpin。
## 3.3 Problem 3: Error Handling
- mmap下每次 page 访问都需要校验 checksum,因为 OS 可能在上次访问后又把 page 淘汰了。
- 落盘时无法对内容实施检查(memory-unsafe语言可能对page意外修改)
- mmap-backed memory 需要依赖 SIGBUS 信号处理 IO 错误,不直接。
## 3.4 Problem 4: Performance Issues
- OS 的 page 淘汰机制扩展能力差(SSD等高带宽的设备下)
> we have found that the OS’s page eviction mechanisms cannot scale beyond a few threads for larger-than memory DBMS workloads on high-bandwidth secondary storage devices
- mmap-based file I/O 的三个瓶颈:
1. page table contention
2. single-threaded page eviction
3. TLB shootdowns.