cache - JonahGao's Notes

# Cache 类 - cache 公共接口类。 - cache 维护了 key(slice) 和 cache object 之间的映射。 - 按 key 插入，按 key 查询。 - **`Priority`** 枚举：优先级越高，淘汰的可能性越低 - `HIGH`：设置 `cache_index_and_filter_blocks_with_high_priority`后，适用于 SST 的 metablocks（如 index 和 filter blocks） - `LOW`：适用于普通的 data blocks - `BOTTOM`：使用于 BlobDB blob values. - **CacheItemHelper** - 跟 cache object 绑定的操作或属性。 - 关键成员： - **`del_cb`**：`DeleterFn`，object 被淘汰后如何删除它 - **`role`**：`CacheEntryRole`，object 的类型，如 `kDataBlock`，`kFilterBlock`等 - **Handle** - 一个 object 存储在 cache 中的句柄，通过它可以对 object 进行操作，如释放、获取 value，一般以指针形式出现 - 定义：`struct Handle {}`，实现类中会继承它，例如 `LRUHandle` ## API 函数 ### Insert() - 参数： - **`key`**: `const Slice&`， cache key - **`obj`**: `ObjectPtr`，cache object （`void *`指针） - 不能为 nullptr，nullptr 预留用于 secondary cache。 - **`helper`**：`const CacheItemHelper *`，object 关联的 helper - **`charge`**：`size_t`，object 占据的容量，例如 data block 的大小 - **`handle`**：`Handle **`，返回的 cache handle - **`priority`**：`Priority`，优先级，默认 LOW - 将 key->obj 映射插入到 cache 中 - 插入成功，则将 obj 的所有权转移给 cache，由 cache 负责释放； - 插入失败，则 obj 的所有权仍由 caller 管理。 ### Lookup() - 参数： - **`key`**: `const Slice&`， cache key - 返回值： `Handle *` - 查找 key 对应的 cache handle。 - 存在的话会对 cache handle 的引用计数加 1。 ### Release() - 参数： - **`handle`**: `Hanlde *` - **`erase_if_last_ref`**: `bool`，默认 false - 释放 cache handle（引用计数减一） - 释放后，可能仍然会在 cache 中存在，后续还能查到； - 如果设置了 `erase_if_last_ref`，当无引用时会主动删除 ### Value(handle) - 获取 handle 关联的 ObjectPtr。 ### Erase(key) - 删除 key 对应的 cache。 - 如果 object 还有引用则会等到现存的 cache handles 都 Relase() 后才能删除。 ### DisownData() - 析构时不释放 cache 分配的内存，用于加速 shutdown。 --- # SharedCache - 按 key hash 到多个 cache shard/分片，减少锁竞争。 - 每个分片是个 cache ，均分 capacity。 - 每个 shard 如 LRUCacheShard 内部有锁 - 分片数量取决于参数：`ShardedCacheOptions::num_shard_bits` - 数量等于 $2^{num\_shard\_bits}$ - 如果 `num_shard_bits` 小于 0 （默认），则根据 capacity 计算分片数量 - 分片数量等于 capacity / 512 KB，但不超过 64 个。 - 见 GetDefaultCacheShardBits() 函数 - 使用 key hash 的低 `num_shard_bits` 位，定位 shard。 - 核心成员： - **`shards_`**：`CacheShard *`，所有的分片，构造时初始化 - CacheShard 是模版类，对应 shard 的具体实现，例如 LRUCacheShard - SharedCache 类主要就是将操作路由到 cache shard。 --- # LRUCache ## LRUHandle - 代表一个 cache entry，继承了 Cache::Handle。 - 主要成员： - **`value`**：`Cache::ObjectPtr`，cache 的对象 - **`helper`**：`const CacheItemHelper`，关联的 helper，Insert 时指定 - **`next_hash`** ：`LRUHandle *`，Hashtable 内部冲突排解链的单链表后继 - **`next`**, **`prev`**：`LRUHandle *`，LRU 双链表指针 - **`total_charge`**：`size_t`， cache object 自身的 charge 加上元数据的大小 - 包含 LRUHandle 自身大小、key 的大小，即实际内存占用 - **`key_length`**：`size_t`，key 的长度 - **`key_data`**：`char []`， key 的数据 - **`hash`**：`uint32_t`，key 的 hash 值，缓存避免重复计算 hash 值 - **`refs`**：`uint32_t`，引用计数（外部， cache 自身不算入） - **`m_flags`**：`uint8_t`，可变的 flags，修改需要加锁，标志列表： - `M_IN_CACHE`：位于 hashtable 中 - `M_HAS_HIT`：被 Loopup 访问过 - `M_IN_HIGH_PRI_POOL`：位于在 high-pri pool 中 - `M_IN_LOW_PRI_POOL`：位于 low-pri pool 中 - **`im_flags`**: `uint8_t`，不可变的 flags，标志列表： - `IM_IS_HIGH_PRI`：high-pri entry - `IM_IS_LOW_PRI`：low-pri entry - `IM_IS_STANDALONE`：特殊 entry，不实际插入，只占用 charge - LRUHandle 的三种状态： 1. `refs >= 1 && in_cache == true` - 有外部引用，在 hash table 中。此时 entry 一定==不在== LRU list 中。 2. `refs == 0 && in_cache == true` - 无外部引用，在 hash table 中，且也在 LRU list 中，可以被淘汰释放。 3. `refs >= 1 && in_cache == false` - 有外部引用，不在 hash table 中，也不在 LRU list 中。 - 引用变成 0 后需要释放。 > [!summary] state > - `in_cache` 表示是否在 hashtable 中，即 Lookup 能否找到； > - LRU 链表仅存放 hashtable 中引用为 0 的 entries（可以被淘汰释放）； ## LRUHandleTable - 存储 LRUHandle 的 hashtable。 - 开散列结构，负载因子为 1。 ## LRUCacheShard - SharedCache 的一个 LRUCache 的分片。由 hashtable 和 LRU 链表两部分组成。 - 核心成员： - **`lru_`**：`LRUHandle`，LRU 链表的 dummy head - 它的 prev 是 newest entry - next 是 oldest entry（淘汰位置） - **`lru_low_pri_`**: `LRUHandle *`，指向 LRU 链表中 low-pri pool 的 head - **`lru_bottom_pri_`**: `LRUHandle *`，指向 LRU 链表中 bottom-pri pool 的 head - **`table_`**：`LRUHandleTable`，hash table，按 key 索引 handle - **`lru_usage_`**：`size_t`，LRU 链表中 entries 占用的内存大小 - 这些 entries 位于 cache （hashtable）中，但是引用计数都为 0 ，可以被淘汰释放 - **`usage_`**：`size_t`，cache 中所有 entries 占用的内存大小 - 除了 `lru_usage_`，还包含了 pinned usage 即引用计数不为 0 的 entries - **`high_pri_pool_usage_`**: `size_t`，high-pri pool 当前大小 - **`low_pri_pool_usage_`**: `size_t`，low-pri pool 当前大小 - **`mutex_`**: cache 的操作基本都需要加锁 - entries 的分布及 usage 的计算 - ![](https://img.jonahgao.com/oss/note/2025p2/rocksdb_lru_sets.svg) - 只淘汰 LRU 链表中元素（即无引用计数的） - 元素从 H -> L：Release 操作 - 元素从 H->R：Erase操作或者Insert时key相同被替换 - 元素从 L -> H-L：Lookup 操作增加引用 - LRU 链表结构： - ![](https://img.jonahgao.com/oss/note/2025p2/rocksdb_lru_list.svg) - 仅展示了 next 指针，每个 node 是一个 LRUHandle - high-pri：`lru_`之前的 node 为最新 - low-pri 和 bottom-pri head 为最新 ### MaintainPoolSize() - 确保 `high_pri_pool_usage_` 不大于 `high_pri_pool_capacity_` - 确保 `low_pri_pool_usage_` 不大于 `low_pri_pool_capacity_` - 各个 pool 的容量：总 capacity 乘以 LRUCacheOptions 中的 ratio 参数，默认 - `high_pri_pool_ratio` 为 0.5 - `low_pri_pool_ratio` 为 0.0 - 当 `high_pri_pool_usage_` 过大时，不断将末端（oldest）entries 溢出到 low-pri pool。 - ![](https://img.jonahgao.com/oss/note/2025p2/rocksdb_lru_maintain.svg) - 即前移 `lru_low_pri_` head node，扩张 low-pri pool。 - `lru_low_pri_ = lru_low_pri_->next` - low-pri pool 的溢出也类似。 ### LRU_Insert(e) - 将 `LRUHanle *e` 插入到 LRU 链表中。 - 目标 pool 的选择： - high-pri：`high_pri_pool_ratio_ > 0 && (e->IsHighPri() || e->HasHit())` - low-pri：`low_pri_pool_ratio_ > 0 && (e->IsHighPri() || e->IsLowPri() || e->HasHit())` - bottom-pri：其余情况 - high-pri pool 的插入：插入到 `lru_` 之前 - ![450](https://img.jonahgao.com/oss/note/2025p2/rocksdb_lru_insert_high.svg) - low-pri pool 和 bottom-pri pool：插入到 head 之前，新节点成为新 head - ![450](https://img.jonahgao.com/oss/note/2025p2/rocksdb_lru_insert_low.svg) - high-pri 和 low-pri 插入后，需要调用 MaintainPoolSize() 避免 pool 太大超过其 capacity。 ### LRU_Remove(e) - 从 LRU 链表中删除一个节点。 - 如果 e 是 low-pri 或 bottom-pri 的 head，则需要更新 head 为前驱节点。 - 并更新相应的 usage（总的和 pool 级别的） ### EvictFromLRU(charge) - 从 LRU 链表中淘汰节点，直到 `usage_ + charge <= capacity_` 或者链表为空。 - 淘汰位置：`lru_` 后驱节点 - 并从 `table_` 中删除被淘汰 entry - 如果 pinned usage 过多，LRU 链表可能会全部淘汰完。 ### Insert() ```mermaid stateDiagram-v2 [*] --> EvictFromLRU EvictFromLRU: **EvictFromLRU()** 淘汰 LRU 直到能容纳 charge 或者全部淘汰完 state HasCapacity <<choice>> EvictFromLRU --> CheckCacacity HasCapacity --> InsertEntry: 是 HasCapacity --> HandlerChoice: 否 CheckCacacity: 容量是否足够容纳 charge CheckCacacity --> HasCapacity state HandlerChoice <<choice>> HandlerChoice --> NoHandler: 不引用 HandlerChoice --> CheckStrict: 引用 HandlerChoice: Insert是否需要引用Handler NoHandler: 不插入直接释放 handler，返回 Ok（即使插入也因为无引用马上被淘汰） NoHandler --> [*] state CheckStrict <<choice>> CheckStrict --> YesStrict: true CheckStrict --> InsertEntry: false CheckStrict: 检测 strict_capacity_limit_ YesStrict: 失败，返回 MemoryLimit 错误 YesStrict --> [*] State InsertEntry { InsertHash: 插入到 Hashtable 中 InsertHash --> CheckOverwrite CheckOverwrite: 是否是覆盖写入 state OverwriteChoice <<choice>> OverwriteChoice --> FreeOld: 是 OverwriteChoice --> FinsihInsert: 否 CheckOverwrite --> OverwriteChoice FreeOld: 取消设置老 entry 的 in_cache 标记，如果无引用则从 LRU 链表删除并释放 FreeOld --> FinsihInsert FinsihInsert: 如果Insert不引用Handler，则将新 entry 放入 LRU，否则增加引用并返回Handler。返回 ok FinsihInsert --> [*] } ``` ### Lookup() - 从 cache 中查找 key 对应的 handle 1. 从 hash table 中按 key 查找； 2. 找到后如果之前无引用，则说明在 LRU 链表中，需要从中删除； 3. 增加引用，设置 Hit 标记，返回； ### Release(handle) - 释放引用计数 ```mermaid stateDiagram-v2 [*] --> Unref Unref: handle 引用计数减一 state CheckRef <<choice>> Unref --> CheckRef CheckRef --> WasInCache: 引用计数变为 0 CheckRef --> NotFree: 引用计数大于 0 NotFree: 不释放 handle，返回 false NotFree --> [*] state WasInCache <<choice>> WasInCache --> DoFree: 不在cache中 WasInCache --> RemoveChoice: 在cache中 WasInCache: 检查 e->InCache() DoFree: 释放 handle，返回 true DoFree --> [*] state RemoveChoice <<choice>> RemoveChoice: 容量满了或者 erase_if_last_ref==true RemoveChoice --> RemoveEntry: 是 RemoveChoice --> PutLRU: 否 RemoveEntry: 从 hashtable 中删除，设置 in_cache 为 false RemoveEntry --> DoFree PutLRU: 不从 cache 中删除，放入 LRU 淘汰链表 PutLRU --> NotFree ``` ### Erase(key) - 从 cache 中删除 key 对应的 entry 1. 从 hashtable 中删除 key 对应的 handle 2. 如果 handle 存在且无引用，说明存在 LRU 链表中，则从中删除 ### 总结 - 只能淘汰 cache 中无引用的 entries（位于 LRU 链表中的） > [!NOTE] 优先级 > - 只保证较高优先级的占用不超过预定的比例，超过则溢出到较低优先级； > - 最低优先级占用的容量则无比例限制，只是高优先级插入时如果容量不够，会优先淘汰它们； > [!warning] strict_capacity_limit = true > - LRU 淘汰完后，总的 usage_ + charge 仍然超出容量限制，则 Insert 会失败； > - 比较极端的场景，pinned usage + charge 超过总容量限制 > - 返回失败后，会导致上层的**读操作也失败** --- # block cache - [Clarify caching behavior for index and filter partitions](https://github.com/facebook/rocksdb/pull/9068) - index/filter partition blocks 始终使用 block cache，不受 `cache_index_and_filter_blocks` 影响，但受 `cache_index_and_filter_blocks_with_high_priority` 影响。 - `BlockBasedTable::GetCachePriority()` - top-level index / filter block 的管理和读取： ```mermaid stateDiagram-v2 [*] --> cached cached: cache_index_and_filter_blocks state IsCached <<choice>> cached --> IsCached IsCached --> CachedYes: true IsCached --> CachedNo: false CachedYes: pin_top_level_index_and_filter state IsPinned <<choice>> CachedYes --> IsPinned IsPinned --> PinYes: true IsPinned --> PinNo: false PinYes: TableReader 持有 cache handle，常驻 block cache，每次直接读取该 handle PinNo: 可能存在于 block cache，不存在（被淘汰）会读取磁盘 CachedNo: 文件打开时是否 prefetch top-level filter/index state IsPrefetch <<choice>> CachedNo --> IsPrefetch IsPrefetch --> PrefetchYes: true IsPrefetch --> PrefetchNo: false PrefetchYes: TableReader 直接持有 block，占用内存（总量取决于 max_open_files)，每次读取不涉及 IO 和 cache PrefetchNo: 每次使用都要从磁盘读取 ``` - **`ReadOption::fill_cache`** - 为 false时，读取过程中加载的 blocks 不放入 cache - 典型应用：迭代 compaction 的输入 - **`pin_l0_filter_and_index_blocks_in_cache`** - 为 true 时将 L0 的 filter 和 index blocks pin 到 cache 中 - ==限制==：只针对小于 `1.5 * write_buffer_size` 的 L0 文件，避免 pin memory 过大 - [make L0 index/filter pinned memory usage predictable #6911](https://github.com/facebook/rocksdb/pull/6911/) - 例如 intra-L0 compaction 产生的大 L0 文件或者注入的 ---- # TODO - [x] pin l0 如何实现的 - TableReader CreateIndexReader() 时设定 pin top-level index - CacheDependencies() 中让 reader 持有 partitions 的 cache handles - [ ] PinnableSlice - [ ] 多个线程同时 insert cache（虽然有锁），是否可能重复插入 - [ ] cache key 冲突： - Issue: [SST files overlap at a level and keys are not ordered in a SST file #7405](https://github.com/facebook/rocksdb/issues/7405) - PR: [New stable, fixed-length cache keys #9126](https://github.com/facebook/rocksdb/pull/9126) - [ ] [Support pro-actively erasing obsolete block cache entries #12694](https://github.com/facebook/rocksdb/pull/12694) - [ ] hyper clock cache - [ ] secondary cache - [ ] `pin_top_level_index_and_filter` 的影响 - 估算单个 sst 的 top-level block 极限能有多大，比如一个 key 只有10 bit，单 SST 128MB - [x] `strict_capacity_limit` - [x] `high_pri_pool_ratio`