block based - JonahGao's Notes

# BlockBased Format - RocksDB wiki： - https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format - https://github.com/facebook/rocksdb/wiki/Index-Block-Format - https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters - <img src="https://img.jonahgao.com/oss/note/2025p2/rocksdb_sst_format2.svg" width="650" height="auto" /> - Top-Level Filter Block：索引 filter partition blocks - Full Filter Block：全量 filter 数据 - Top-Level Index Block：索引 index partition blocks - Full Index Block：直接索引 data blocks - **BlockHandler**: 指向文件内的 Block（Data Block 或者 Meta Block） - **offset**：varint64 - **size**: varint64 - **Meta Block** 包括 - Filter block - Index block - Compresion dicationary block - Range deletion block - Stats/Properties block - Meta Block 的格式与 Data Block **一样**，压缩可选。 ## internal key 编码 | user_key | seq_and_type( 固定 64 bits，小端编码) | | -------- | ------------------------------ | - seq_and_type：低 8 bits 为 type，高 56 bits 为 seqno ```cpp // Pack a sequence number and a ValueType into a uint64_t inline uint64_t PackSequenceAndType(uint64_t seq, ValueType t) { return (seq << 8) | t; } ``` ----- # Data Block ## 格式 - data block 内存储**有序**的KV 对。 - **前缀压缩**（prefix-compressed） - 存储 key 时，丢弃跟==前一个== key 的共同前缀 - 每个 N 个 Key，存储一份全量 Key。称为 restart point。 - Block 尾部存储所有 restart points 的 offsets - 方便在 Block 内二分搜索某个 key。 - 开关：`BlockBasedTableOptions::use_delta_encoding` - 一个 KV entry 的存储结构： - **shared_bytes**: `varint32` - **unshared_bytes**: `varint32` - **value_length**: `varint32` ^958fdf - **key_delta**: `char[unshared_bytes]` - **value**: `char[value_length]` - Block 尾部结构： - **restarts**：`uint32[num_restarts]` // 每个 restart point 的在 block 内的偏移 - **num_restarts**: `uint32` ![](https://img.jonahgao.com/oss/note/2025p2/rocksdb_block_builder.svg) ### Hash 索引 - 当 `data_block_index_type` 为 `kDataBlockBinaryAndHash` 时生效（默认不启用）。 - 在原来 restarts array 二分索引的基础上，构建==附加==的 hash 索引（索引一个 block 的所有 keys）。 - 索引条目：user key hash 值 <-> restart index - 即根据 key 的 hash 值可以快速定位到该 key 位于 block 内的哪个 restart interval 中。 - 结构： - 根据 `data_block_hash_table_util_ratio` 确定 buckets 数量 - `num_buckets = num_keys_in_block / ratio` - ratio 越小，buckets 越多，冲突概率越低 - `buckets[key_hash % num_buckets] = restart_index` - 冲突：两个 key 位于相同的 bucket ，但它们的 `restart_idex` 不同 - 存储一个特殊值 kCollision（相当于索引失效了） - 存储：每个 buckets 的 `restart_index` 加上 buckets 的数量 - 每个 bucket 占用一个字节 - 存储在 block restart array 之后 > [!summary] > - 优势：可以提升点查性能； > - 劣势： > - 空间占用（磁盘+内存，大小取决于 keys 量级） > - 占用 block cache，加大 miss。 > - 限制： > - 最多支持 253 个 restart intervals > - 对自定义 Comparator 有[限制](https://github.com/facebook/rocksdb/wiki/Data-Block-Hash-Index#things-that-need-attention) ## BlockBuilder ^897012 - 用于写一个 block - 文件：*block_builder.cc* - 成员变量： - **`last_key_`**: `string`, 保存上一个写入的 Key，用于计算共同前缀 - **`buffer_`**： `string` , block 写入buffer - **`restarts_`**：`std::vector<uint32_t>` , restart point 在 buffer_ 内的偏移 - 初始为 `{0}` - **`use_delta_encoding_`**: `bool`，是否启用 key 前缀压缩 - **`counter_`**：添加过的 Key 计数，用于判断何时插入 restart point - **AddWithLastKey()** - 参数 last_key_param：上一个 key - 可以使用 BlockBuilder 内保存的 last_key_ - 也可以由外部直接提供，就不需要维护 last_key_ 了（避免拷贝）。 - value delta encoding - 当 `use_value_delta_encoding_` 打开时，适用于 index block - 这种格式不用编码 [[block based#^958fdf|value_length]]，index value 内容固定，自解释 - 当 key shared_bytes 不为空时，编码 delta value （BlockHandle 中的 size delta）；否则编码完整的 value - **Add()** - 写入一个 Key Value，使用内部维护的 last_key_ 调用 AddWithLastKey - 如果启用了 key 前缀压缩，更新 last_key_ - 注意：需要保证 key 大于之前添加的 key - Add 和 AddWithLastKey 不能穿插调用，除非调用了 Reset - **Finish()** - block 末尾写入 restart array - 写入 footer，包含了 - 索引类型 DataBlockIndexType - restart array 的长度 ## Block 读取相关 ### BlockContents - 代表从 SST 中读出的一个 block。 - 成员： - **`data`**：`Slice`，指向 block 数据 - **`allocation`**：`unique_ptr<char[], CustomDeleter>`，数据的物理分配，如果不为空表示由当前类持有 block 数据的所有权。 ### BlockIter - 用于迭代一个 Block 中的 entries，继承 `InternalIteratorBase<TValue>` - TValue: 迭代器返回的 value 类型 - 主要实现了 InternalIteratorBase 的`Seek*`/`Prev`/`Next` 等方法： 1. 调用子类相应的 Impl 方法 2. UpdateKey：设置 `key()` 的返回值，统一==应用 `global_seqno_`逻辑==。 - 迭代返回的 keys 可以是 ikey 或者 ukey，取决于子类的配置。 - 成员： - **`data_`**：`const char *`，目标 block 数据，包含了 entries 和 restarts 数据 - **`num_restarts_`**：`uint32_t`，restart points 个数 - **`restarts_`**：restarts array 在 `data_`中的 offset - 即第一个 restart point 位于 `data_ + restarts_` - **`restart_index_`**：当前 entry 位于第几个 restart block 内 - **`current_`**：当前 entry 在 `data_` 中的 offset - 大于 `restarts_`时视为非 Valid（即迭代到了末尾） - **`raw_key_`**：`IterKey`，block 内存储的原始 key，格式可能为 ikey 或 ukey。 - 取决于子类初始化时的设置，通过 `raw_key_.SetIsUserKey(bool)`。 - **`key_`**：`Slice`，迭代器 `key()` 函数的返回值，格式取决于 `raw_key_` - **`global_seqno_`**：`SequenceNumber`，SST 注入相关，值不为 `kDisableGlobalSequenceNumber`起作用，返回的 ikey 将都使用该 seqno - **`value_`**：`Slice`，当前 entry 的 value。 - *`cur_entry_idx_`*：当前 entry 是 block 内的第几个 - 仅当开启 `block_protection_bytes_per_key`时有效 - 计算依赖 `block_restart_interval_`（不开启 `block_protection_bytes_per_key` 时该值为 0） - PR：[Block per key-value checksum #11287](https://github.com/facebook/rocksdb/pull/11287) - **InitializeBase()** - 初始化类的各个数据成员。 - **UpdateKey()** - 根据 `raw_key_` 更新 `key_` - 如果 `raw_key_` 为 user key，则更新为 `raw_key_.GetUserKey()` - 如果 `raw_key_` 为 internal key，尝试应用 `global_seqno_` - **ParseNextKey()** - 参数： - **`is_shared`**: `bool *`，函数内设置，表示当前 key 跟前一个 key 是否有 shared bytes。 - **`DecodeEntryFunc`**：模版参数，用于解码 entry，子类指定。 - 作用：解析下一个 entry - 更新 `current_`，调用 DecodeEntryFunc 解码 entry，结果保存到 `raw_key_` 和 `value_` 中 - `current_` 更新为上一个 `value_`的后一字节 - 当前 key 没有 shared bytes 时，则尝试更新 `restart_index_`： - 不变性：下一个 restart point interval 的起始一定大于当前 entry 的起始 offset - *NOTE：没有 shared bytes 并不一定意味着是新的 restart point，restart point 的分布只取决于 `block_restart_interval` 参数* - **SeekToRestartPoint()** - 参数： - **`index`**： `uint32_t` - 移动到第 `index` 个 restart point 的起始处 - 主要是更新 `restart_index_` 和 `value_`（ ParseNextKey 依赖 `value_`） - **GetRestartInterval()** - 获取 `block_restart_interval` ，计算方式： - SeekToFirstImpl, 再调用 NextImpl 遍历，计算到下一个 restart point 之前有多少 entries。 - **BinarySeek()** ^9e6711 - 二分搜索 restarts array。 - 参数： - **`target`**：`const Slice &`，搜索目标 - **`index`**：`uint32_t *`，返回值，目标位于第几个 restart interval 内。 - **`is_index_key_result`**：`bool *`，返回值，index 处的 restart key 正好是目标 key，不用继续在 restart interval 中线性扫描。 - **FindKeyAfterBinarySeek()** ^471e60 - 参数： - **`target`**：`const Slice &`，搜索目标 - **`index`**：`uint32_t`，BinarySeek() 的结果，搜索第几个 restart interval - **`is_index_key_result`**：`bool`，BinarySeek() 的结果 - 在目标 restart interval 中线性扫描第一个大于 target。 - 如果没找到，则指向第 `index+1` restart inteval 的首个 key。 ### DataBlockIter - 继承自 `BlockIter<Slice>`，迭代普通的 data block。 - **SeekToFirstImpl()**： - 移动到首个 restart point，然后解析 entry - **SeekToLastImpl()**： - 收到到最后一个 restart point，依次解析 entries，直到 restart interval 的最后一个 entry。 - **SeekImpl(target)**： 1. 调用 BinnarySeek() 定位 restart interval 2. 调用 FindKeyAfterBinarySeek() 在 restart interval 中顺序查找 - **NextImpl()** - 调用 ParseNextDataKey() 移动到并解析下一个 entry - **SeekForGetImpl(target)** - 为点查而优化的 seek，跟 SeekImpl() 区别在于会使用 hash index 如果存在。 - 如果 target 不可能存在（包括下一个 block 内），则返回 false。 - **PrevImpl()** - 基础实现： 1. 移动到当前 entry offset (`current_`）之前的 restart point 2. 顺序解析 restart interval 的 entries 并跳过，直到当前 entry 的前一个 entry - 即 `NextEntryOffset() < current` - cache 优化： - 顺序解析 restart interval 中的 entries 时将解析结果缓存到一个entry 数组，下次调用 PrevImpl() 直接读取缓存（从数组后面开始） - 由此可知 PrevImpl 比 NextImpl 要耗时，需要移动 restart point 并解析跳过，NextImpl 按 entry 依次往下解析即可。 - **SeekForPrevImpl(target)** 1. 类似 SeekImpl，跳转到大于等于 target 的位置 2. 循环调用 PrevImpl()，直到当前 key 小于等于 target > [!summary] > - Seek类实现先定位、移动到目标 restart interval； >- PrevImpl 需要先移动到当前 entry 之前的 restart interval，再顺序解析，并跳过当前 entry 之前的。 > - 缓存跳过的 entries，下次 PrevImpl 可以直接读取缓存（只要 prev 范围还在当前 restart interval 内）。 > - data block 的访问会记录入统计信息（读放大） > - 依赖 `read_amp_bytes_per_bit`选项 ### MetaBlockIter - 继承自 `BlockIter<Slice>`，迭代 meta block。 - 与 DataBlockIter 的区别： - meta block 存储的是 user keys（无 seqno），固定使用 BytewiseComparator。 - 通过 `raw_key_.SetIsUserKey(true)` 设置 - PrevImpl 无缓存优化。 - 相比 data block，调用不频繁？ - 对 meta block 的访问不记录统计信息。 ### Block 类 - Block 在内存中的表示，已解压缩且解析过。 - 涵盖 data block、index block、meta block 等。 - 核心成员： - **`contents_`**：`BlockContents`, block 底层数据 - **`data_`**：`const char *`，即 `contents_.data.data()` - **`size_`**：`size_t`，即 `contents_.data.size()` - **`restart_offset_`**：`uint32_t`，restart array 在 `data_` 中的偏移 - **`num_restarts_`**：`uint32_t`，restart points 的数量 - **`data_block_hash_index_`**：`DataBlockHashIndex`，block 的 hash 索引 - 在构造函数中解析 block，初始化 `restart_offset_`，`num_restarts_`，`data_block_hash_index_`等成员。 - 提供了方法创建各类 BlockIter（如 DataBlockIter、MetaBlockIter、IndexBlockIter）。 ------- # Index Block - index block 一个索引条目对应一个 data block，参数 `BlockBasedTableOptions::index_type` 设置不同的索引类型，默认 `kBinarySearch`。 - 一个索引条目包含 Key 和 Value，所有的条目按照 data block 的格式存储到一个或多个 blocks（partitioned）中。 - key：满足大于等于 block last key 且小于下一个 block first key - value：block 在文件内的偏移和大小等信息 - ![450](https://img.jonahgao.com/oss/note/2025p2/rocksdb_index_block.svg) - **IndexType**： - **`kBinarySearch`**：默认类型，一个 sst 中存在一个 index block。 - 一条 index entry： - key：基于 block last key - value：block handle - **`kHashSearch`**：需要配置 `Options.prefix_extractor` - **`kTwoLevelIndexSearch`**：两层索引，存在多个 index blocks（partitions）以及在其上的一个 top-level index block。 - second level index blocks 始终使用 block cache。 - **`kBinarySearchWithFirstKey`**：类似 `kBinarySearch`，但索引内容里包含了每个 block 的首个 key。 - 针对 short range scans 场景可以减少读放大。 - 否则迭代器 seek 时，每个 L0 文件和 L0之外的每层，都至少要读取一个 block（MergingIterator MinHeap 逻辑） - 可能会导致索引数据很大，特别当 keys 很长时。 - PR: [Add an option to put first key of each sst block in the index #5289](https://github.com/facebook/rocksdb/pull/5289) ## IndexValue - IndexValue： ^fbbace - **`handle`**：`BlockHandle` - **`offset_`**：`uint64_t`，block 在文件内的偏移 - **`size_`**：`uint64_t`，block 的大小 - **`first_internal_key`**：`Slice`，IndexType 为 `kBinarySearchWithFirstKey` 时启用 ### 编码 - IndexValue::EncodeTo() - 不使用 `value_delta_encoding`时： - block first key：仅当 `BlockBasedTableOptions::IndexType` 为 `kBinarySearchWithFirstKey` 才包含，格式为 Internal key。 | block offset(Varint64) | block size(Varint64) | block first key (可选) | | ---------------------- | -------------------- | -------------------- | - 使用 `value_delta_encoding`： - 条件：`table_opt.format_version >= 4 && !table_opt.block_align` - 仅在同一个 restart interval 中第二个 entry 开始才使用该格式（==隐式条件==） - 不需要编码 BlockHandle 的 offset - 解码时根据上一个 block 的 offset 和 size，可以计算出来 - block size delta：当前 block size 跟上一个 block 的差值 | block size delta(Varsignedint64) | block first key可选） | | -------------------------------- | ------------------ | ```cpp // IndexValue::DecodeFrom() handle = BlockHandle(previous_handle->offset() + previous_handle->size() + BlockBasedTable::kBlockTrailerSize, previous_handle->size() + delta); ``` ``` restart_point 0: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz) restart_point 1: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz) ... restart_point n-1: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz) where, k is key, v is value, and its encoding is in parenthesis. ``` ## IndexBuilder - builder 的基类，主要是定义了统一的 virtual 方法列表。 - **IndexBlocks** - **`index_block_contents`**: `Slice` - index block 数据 - **`meta_blocks`**：`map<string, Slice>` - 可选地，primary index block 的 metadata，hash index 专用 - **virtual AddIndexEntry() = 0** - 添加一个 block index entry。 - 参数： - **`last_key_in_current_block`**：`std::string *` - 当前 block 的最后一个 key，用作 index key，可以会被具体的实现优化修改。 - **`first_key_in_next_block`**：`const Slice *` - 下一个 block 的首 key，可以为空（当前为最后一个 block）。 - **`block_handle`**: `const BlockHandle&` - block 的地址和大小，index entry 的 value。 - **virtual OnKeyAdd(key) {}** - 每当 BlockBasedTableBuilder 添加了一个 key 后调用，用以通知，默认无操作。 - **virtual Status Finish() = 0** - 完成索引构建。 - 可能会返回 `InComplete`，表示是 partitioned index，调用者需要继续调用 Finish 直到返回 `Status::OK`。 - 参数： - **`index_blocks`**：`IndexBlock *` - 填充构建结果。 - **`last_partition_block_handle`**：`const BlockHandle&` - 上一个构建完成的 index partition 的 block handle，用于构建 2nd level index。 - **ShouldUseKeyPlusSeqAsSeparator()** - 给出当前 block 的 last key 和下一个 block 的 first key，返回 index key 是否需要添加 seq（即使用 internal key）。 - 只有当两个 key 的 user key 相同时才需要（此时需要添加 seq 才能做出区分） ## ShortenedIndexBuilder - 优化： - 默认 `index_block_restart_interval` 为 1，即没有前缀压缩，加快搜索（二分即可，不要额外线性扫描）。 - 使用缩短后的 index key（不使用 full key，只需要足够分割两个 block 即可）。 - **IndexShorteningMode** 配置： - `kNoShortening`：使用 full key - `kShortenSeparators`：除文件的最后一个 block 外，其他 blocks 使用 shorten。默认。 - `kShortenSeparatorsAndSuccessor`：所有 blocks 都使用 shorten。 - 核心成员： - **`seperator_is_key_plus_seq_`**: `bool` - index key 是否需要包含 seqno，默认为 false - 当 ShouldUseKeyPlusSeqAsSeparator() 返回 true，该成员不可逆地改为 true。 - 当两个 block 边界的 user key 相同时，必须添加 seqno 才能区分。 - 每次 AddIndexEntry() 进行检查更新 - **`index_block_builder_`**：`BlockBuilder` - 用于构建 index block，index key 格式为 internal - **`index_block_builder_without_seq_`**：`BlockBuilder` - 用于构建 index block，index key 格式为 user，仅在`seperator_is_key_plus_seq_` 为 true 之前才更新 - Finish() 时，两个 builders 只使用其中的一个，取决于 `seperator_is_key_plus_seq_`的最终值。 - **`last_encoded_handle_`**：`BlockHandler` - 上一个 block 的 handler，用于实现 index value（offset）的 delta encoding - PR：[Index value delta encoding #3983](https://github.com/facebook/rocksdb/pull/3983) - **FindShortestInternalKeySeparator(start, limit)** - 查找位于`[start, limit)` 范围内的最短 key，更新回 start 中。 - 如果 start 确实缩短（更新了），会重设 start 的 type 和 seqno 为排序上最小的。 - 即 `(kMaxSequenceNumber, kValueTypeForSeek)` - **AddIndexEntry()** 1. 根据是否为最后一个 block、IndexShorteningMode 确定实际的 index key； 2. 编码 [[block based#^fbbace|IndexValue]]； 3. 将 index key 和 index value 写入 BlockBuilder > [!summary] ShortenedIndexBuilder > - 使用缩短后的 index key，并尽量不携带 seqno。 > - 优势：减少索引大小 > - 劣势：block的范围变得不再精准，可能引发不必要的读取 > - 例如两个 block: `[ ... AAA] B [C ...]`，B 为缩短后的索引 key，则 seek `(AAA, B]` 范围内的 keys 时，会读取第一个 block（这是不必要的） > - 特例：文件末尾的 block > - 默认配置使用 kShortenSeparators，最后一个 data block 的 index key 使用 full key，针对 last block 修复该问题。 ## PartitionedIndexBuilder - 两级索引，避免单个 index block 过大。 ``` [index block - 1st level] [index block - 1st level] ... [index block - 1st level] [index block - 2nd level] ``` - 每级均使用 ShortenedIndexBuilder 构建。 - 核心成员： - **`entries_`**: `list<Entry>`， partition builders 列表 - Entry：表示一个 partition - **`key`**: `string`，parition 的 last key - **value**：`unique_ptr<ShortendIndexBuilder>`，partition 对应的 builder - **`index_block_builder_`**, **`index_block_builder_without_seq_`** - top-level builder，两个 builder 作用类似 `ShortendIndexBuilder` - **`seperator_is_key_plus_seq_`**：`bool` - 作用同 `ShortendIndexBuilder`，每次 AddIndexEntry 时更新，其值为所有 sub indexes `seperator_is_key_plus_seq_` 的==或==。 - 即只要有一个为 true，两级索引都将为 true。整个文件统一，体现在 `TableProperties::index_key_is_user_key` 中。 - **`sub_index_builder_`**：`ShortenedIndexBuilder *` - 当前构建中的 partition index builder - **`partition_cut_requested_`**：`bool` - 外部请求切换到新的 index partition - **`cut_filter_block`**：`bool` - 标记，通知外部（PartitionedFilterBlockBuilder）切换到新 filter partition - **`flush_policy_`**：`unique_ptr<FlushBlockPolicy>` - block 的 flush/切换策略，默认按 `metadata_block_size` 进行flush - Partition flush： - index partition 和 filter parition 的切换会进行对齐，由 index builder 做决定，然后通过`cut_filter_block`同步给 filter builder。切换原因由两个： - 来自外部请求 (`partition_cut_requested_`标记，filter partition keys 到达固定数量后设置) - 内部的 block flush policy（block 大小达到 `metadata_block_size`限制） - **AddIndexEntry()** 1. 检查当前 partition 是否需要 flush（条件见上述），需要则进行 flush - flush：将当前 `sub_index_builder_` 追加到 `entries_` 后面，并设置 `cut_filter_block` 通知 filter builder 2. 按需创建新的 `sub_index_builder_` （首次调用或者刚 flush 完会为 nullptr) 3. 调用 `sub_index_builder_->AddIndexEntry()` 4. 从 `sub_index_builder_`同步 `seperator_is_key_plus_seq_` - 如果有变化，则需要更新 `flush_policy_`，让其使用最新的 BlockBuilder，来正确计算当前 block 的大小 - **Finish()** - 每次调用只 Build/Finish 一个 partition（`entries_`的头部），返回 Incomplete - 需要向 sub-builder 同步 `seperator_is_key_plus_seq_`，保证全文件统一 - 将对上一个 partition block 的索引写入 top-level buidler - key 为该 partition 的 last key - value 为 partition 的 BlockHandle，并使用 value delta encoding - 调用 BlockBuilder::Add() 直接写（因为 key 已经 shorten 过） - 最后一次调用 Finish top-level index builder（`entries_`为空），并返回 Status::OK ## HashIndexBuilder - 对应 `IndexType::kHashSearch`，前置要求：设置 `prefix_extractor`。 - 索引由两部分组成： - primary index：ShortenedIndex，单 block，等同于 `IndexType::kBinarySearch` - prefix index： - index key：文件内所有不相同的 key prefixes - index value：prefix 覆盖的 data blocks 范围 - 从文件内第几个 data block 开始及覆盖的 blocks 数量 - 该 index 数据存储在两个 meta blocks 中。 - 一个 meta block (`kHashIndexPrefixesBlock`) 记录所有的 prefixes - 另一个 `kHashIndexPrefixesMetadataBlock` 存储每个 prefix 的大小及 index value。 - 核心成员： - **`primary_index_builder_`**：`ShortenedIndexBuilder`，构建 primary index - **`current_restart_index_`**：`uint64_t`，当前是第几个 data block - 每次 AddIndexEntry() 时加一即可 - **`pending_entry_index_`**：`uint64_t`，出现新 prefix 时位于第几个 data block。 - 在 OnKeyAdded 检测 Prefix 变化 - **`pending_block_num_`**：`uint64_t`，当前 prefix 占据了多少 data blocks。 - **AddIndexEntry()** - 更新 primary index - 更新 `current_restart_index_` - **OnKeyAdded()** - 检测 prefix 变化，变化后则更新 hash index；不变化则尝试累加 `pending_block_num_`（需要检测`current_restart_index_`变化） > [!summary] Prefix Hash Index > - 加速 prefix seek（快速定位某个 prefix 位于文件的哪些 data blocks 内），需要配合 `prefix_extractor` 使用 > - primary index 单 block，有过大的风险 - 如何根据 block index 快速定位 block handle： - hash index 的 `index_block_restart_interval` 只能为 1，查找 restart array 即可。 ```cpp // BlockBasedTableFactory::InitializeOptions() if (table_options_.index_type == BlockBasedTableOptions::kHashSearch && table_options_.index_block_restart_interval != 1) { // Currently kHashSearch is incompatible with // index_block_restart_interval > 1 table_options_.index_block_restart_interval = 1; } ``` --- # Index Reader ## IndexBlockIter - 继承自 `BlockIter<IndexValue>`，即 value 类型为 [[block based#^fbbace|IndexValue]]，用来迭代一个 index block 中的 index entries。 - key 的格式（是否包含 seqno）取决于 `TableProperties::index_key_is_user_key` - Initialize函数的 `key_includes_seq` 参数 - 核心成员： - **`value_delta_encoded_`**：`bool`，是否使用了 value delta 编码，来自 `TableProperties::index_value_is_delta_encoded` - **`have_first_key_`**：`bool`，index value 是否包含了 block 首 key，解码时需要 - 当 `IndexType == kBinarySearchWithFirstKey` 时为true - **`prefix_index_`**：`BlockPrefixIndex`，hash index - 当 `IndexType == kHashSearch`不为空 - 查询 key prefix 可能存在哪些 data blocks 中 - **`global_seqno_state_`** ：`unique_ptr<GlobalSeqnoState>` - 文件存在 global seqno 且 `have_first_key_` 为 true 时有效，用于实现对 `first_internal_key`的 seqno 进行重写。 - state 中的 `first_internal_key` 成员保存每次重写后的结果，解码后的 IndexValue 中的 `first_internal_key` Slice 再指向它。 - **`decoded_value_`**：`IndexValue` - 当前 IndexValue，当 value 解码逻辑比较特殊时使用 - 比如使用了 value delta 编码、需要对 `first_internal_key` 应用 global seqno 等。 - **CompareBlockKey(block_index, target)** - 第 `block_index` 个 data block 的 last key 与 target 比较 - 仅 hash index 使用，restart interval 只能为 1，通过 restart array 定位该 block 的 index entry。 - **BinaryBlockIndexSeek()** - 对一组 `block_ids` 进行二分搜索，找到首个大于等于 target 的 block - 仅 hash index 使用 - **PrefixSeek()** - 通过 hash index 获取目标 prefix 可能存在的 blocks 范围，再调用 BinaryBlockIndexSeek 二分搜索，返回 target 所属的 block index - **SeekImpl(target)** 1. 如果有 hash index，则使用 PrefixSeek，否则使用 [[block based#^9e6711|BinarySeek]] 2. 调用 [[block based#^471e60|FindKeyAfterBinarySeek]] 移动到目标 entry > [!summary] IndexBlockIter Summary > - seek 会使用 index block 特有的 hash index > - 不存在则退化为二分搜索 restart array > - 解码特殊（ delta value encoding） > - 同 MetaBlockIter，没有 prev cache。 > - 对 `first_internal_key` 应用 global seqno > - index key 部分由 BlockIter 基类负责 ## PartitionedIndexIterator ^d722d4 - 迭代 partitioned index，继承自 `InternalIteratorBase<IndexValue>` - 核心成员： - **`index_iter_`**：`unique_ptr<InternalIteratorBase<IndexValue>>` - top-level index block 的 IndexBlockIter - **`block_iter_`**：`IndexBlockIter` - 当前的 second-level index block iter - **`block_iter_points_to_real_block_`**：`bool` - `block_iter_` 是否已经初始化 - **`prev_block_offset_`**：`uint64_t`，`block_iter`所属 partition block 的 offset - 用途：当初始化 `block_iter_` 时，如果上次的 offset 跟本次的目标 block offset 一样，则不用重复初始化 - **InitPartitionedIndexBlock()** - 读取 `index_iter_` （top-level）当前指向的 partition，读取该 parition block 并初始化`block_iter_` - **FindBlockForward()** - 当前 `block_iter_`（即 partition) 已经迭代完成，切换到下一个==非空的== parttition - 切换 parition：对`index_iter_`调用 Next，并初始化 `block_iter_`执行新 partition - **FindKeyForward()** - 检测当前 partition 是否已经迭代结束，结束的话调用 FindBlockForward() 切换到下一个partition - **SeekImpl(target)** 1. 两层 seek： - 先对`index_iter_`进行 seek，初始化 `block_iter_`指向当前 partition； - 再对 `block_iter_`调用 seek 2. 调用 FindKeyForward 跳过空 partition（迭代不出数据的） - 如果 target 为 nullptr，则 SeekToFirst - **Next()** 1. 对 `block_iter_`调用 Next() 2. 调用 FindKeyForward() ，跳过空 partition > [!summary] PartitionedIndexIterator > - Seek 依次对两层 index 进行 seek > - Next 操作需要处理 partition 切换 > - 切换新 partition 有 IO 操作（ InitPartitionedIndexBlock 函数中） ## TwoLevelIndexIterator - 功能上等同于 [[block based#^d722d4|PartitionedIndexIterator]]。 - 区别在于所有 partition blocks 都已经从磁盘读取并缓存（partition map 中），切换 partition 时不需要 IO（从 partition map 中查找即可）。 - partition map: `Map<uint64_t, CachableEntry<Block>>`，key 是 partition block offset - 典型应用场景：`pin_l0_filter_and_index_blocks_in_cache` ❓是否可以将 PartitionedIndexIterator 实现为一个 TwoLevelIndexIterator ## IndexReader - 接口类，定义了访问 index 的方法列表。 - **NewIterator()** - 构建 index 迭代器。返回 `InternalIterBase<IndexValue> *` - **ApproximateMemoryUsage()** - 报告 reader 占用了多少内存（不含 block cache 中分配的） ## IndexReaderCommon ^f01308 - 继承了 IndexReader，主要管理 top-level index block，包含 cache、pin 等逻辑。 - 核心成员： - **`index_block_`**：`CachableEntry<Block>`，可能为空 - **GetOrReadIndexBlock()** - 如果 `index_block_` 为空，则进行读取（如果开启 `cache_index_and_filter_blocks`则会走 block cache）。 - 不为空，直接返回其引用，无 IO 和 cache 操作 - `index_block_` 的几种情况： - `cache_index_and_filter_blocks` 为 false，如果不为空则直接持有 block； - 会显著影响每个 TableReader（打开的 SST 文件）的内存占用 - `cache_index_and_filter_blocks` 为 true 且 `pin_top_level_index_and_filter` 也为 true，则持有 cache handle（实现了 reader pin 住 cache） ## PartitionIndexReader - table/block_based/partitioned_index_reader.cc - 继承自 `IndexReaderCommon`。 - **CacheDependencies()** - 读取并将所有 partition blocks 加载到 block cache 中 - 先读取并迭代 top-level index block - 如果 pin 参数为 true，则持有 cache handle （保存到 `partition_map_`中） - `partition_map_`要么为空要么包含所有 partitions - **NewIterator()** - 构建 index 迭代器，根据 `partition_map_`是否为空，创建 TwoLevelIndexIterator 或者 PartitionedIndexIterator ## BinarySearchIndexReader - **NewIterator()** - 单 index block，直接在 index block 之上创建 IndexBlockIter ## HashIndexReader - **NewIterator()** - 同 BinarySearchIndexReader，单 index block，直接在 index block 之上调用 NewIndexIterator 创建 IndexBlockIter，但携带了 prefix_index ## 总结 ```mermaid classDiagram direction TD class IndexReader { <<interface>> +InternalIteratorBase~IndexValue~* NewIterator() +CacheDependencies() } IndexReader <|-- IndexReaderCommon: 实现 class IndexReaderCommon { #ReadIndexBlock() #GetOrReadIndexBlock() -const BlockBasedTable* table_ -CachableEntry~Block~ index_block_ } IndexReaderCommon <|-- PartitionIndexReader: 继承 IndexReaderCommon <|-- BinarySearchIndexReader: 继承 IndexReaderCommon <|-- HashIndexReader: 继承 class PartitionIndexReader { - UnorderedMap~uint64_t, CachableEntry~Block~~ partition_map_; } class HashIndexReader { - std::unique_ptr~BlockPrefixIndex~ prefix_index_; } PartitionIndexReader --> PartitionedIndexIterator: 创建 PartitionIndexReader --> TwoLevelIndexIterator: 创建 BinarySearchIndexReader --> IndexBlockIter: 创建 HashIndexReader --> IndexBlockIter: 创建 ``` - TableReader 根据 index_type 创建不同的 IndexReader - index_type 存储在 sst 的 table_properties 中 ```c++ const std::string BlockBasedTablePropertyNames::kIndexType = "rocksdb.block.based.table.index.type"; ``` - PartitionIndexReader 根据 partition blocks 是否已全部 pin 在缓存中，来创建不同的 index iterator（TwoLevelIndexIterator 或者 PartitionedIndexIterator） - BinarySearchIndexReader 和 HashIndexReader 都是单 index block，创建 index iterator 时区别在于有没有 hash prefix index ---------------- # Filter Block ## FilterBlockBuilder - 抽象类，提供了构建 filter 的相关方法。 - **void Add(const Slice& key) = 0** - 添加一个 key - **Slice Finish() = 0** - 完成构建，并返回构建完的数据 - 参数： - **`tmp`**: `const BlockHandle& tmp` - 仅对 partitioned filter 有效，表示上一个 filter partition 的 block handle，用于构建 filter partitions 的索引。 - **`status`**：`Status *`，状态码 - **`filter_data`**：`std::unique_ptr<const char[]>*` - 如果不为空，将构建完的数据转移给 filter_data 即 caller - 返回值：构建完后的数据的 Slice（即数据的引用） - **bool isEmpty() = 0** - empty：即没有添加过 key - **Status MaybePostVerifyFilter(const Slice& filter_content)** - 构建完后对 filter 数据进行校验是否损坏 - 当 `BlockBasedTableOptions::detect_filter_construct_corruption` 为 true 时启用，默认不启用 - PR: [Detect (new) Bloom/Ribbon Filter construction corruption #9342](https://github.com/facebook/rocksdb/pull/9342) ## FullFilterBlockBuilder - FilterBlockBuilder 的实现类，主要用途： - 构建一个 SST 的单个 filter block。 - 启用 partitioned filter 后，用于构建一个 partition。 > [!note] SeekForPrev 在 prefix mode 下的问题 > - 为了保证 SeekForPrev 在 prefix mode + partitioned filter 下能正确工作，每个 filter partition 需要包含上一个 partition last key 的 prefix （*each filter partition should include the bloom of the prefix of the last key in the previous partition*） > >- 原因： > - prefix seek 时可能会定位到下一个 partition，例如 filter partition 最后一个 key 是 `abc`，Seek `abcd`时，先查找 top-level index，`abcd > abc`，因此会定位到下一个 partition，PrefixMayMatch 可能就会返回 false。 > > - PRs： > - [Fix a bug for SeekForPrev with partitioned filter and prefix #8137](https://github.com/facebook/rocksdb/pull/8137) > - [Fix major bug with prefixes, SeekForPrev, and partitioned filters ](https://github.com/facebook/rocksdb/pull/12872) > - [Reduce cases of impacted performance from bug fix #12874](https://github.com/facebook/rocksdb/pull/12874) - 核心成员： - **`filter_bits_builder_`**：`unique_ptr<FilterBitsBuilder>`，内存中的 filter 构建类 - 通常为 [[bloom filter#^34417e|FastLocalBloomBitsBuilder]] - **`prefix_extractor_`**：`const SliceTransform *`，来自 Options，不为空的话会将 key 的前缀加入 filter - **`whole_key_filtering_`**：`bool`，来自 Options，将完整 key 加入 filter（而不仅仅是 key 的前缀），默认 true - **`last_prefix_recorded_`**：`bool`，确保一个 partition 只记录一次 `last_prefix` - **`last_prefix_str_`**：`bool`，上一个 partition 的最后一个 prefix，每次 AddPrefix() 时更新 - **`last_whole_key_recorded_`**, **`last_whole_key_str_`**: - whole key 去重，只有跟上次 whole key 不同才添加到 filter - 这是因为如果 whole keys 之间穿插了 prefix，则 FilterBitsBuilder 无法准确去重（依赖判断上一次的 hash 值） - PR：[Skip duplicate bloom keys when whole_key and prefix are mixed #3764](https://github.com/facebook/rocksdb/pull/3764) - **void AddKey(const slice& key)** - 调用 `filter_bits_builder_->AddKey(key)`，将完整的 key 加入 filter - **void AddPrefix(const Slice& key)** - 添加 key 的前缀加入 - **void Add(const Slice& key)** - 分别判断和添加 prefix 和 whole key - 添加 prefix 的前提：`prefix_extractor_ && prefix_extractor_->InDomain(key_without_ts)` - 添加 whole key 前提：`whole_key_filtering_` 为 true 且与上次 whole key 不同（去重） - **void Reset()** - 设置 `last_whole_key_recorded_` 和 `last_prefix_recorded_` 为 false - PartitionedFilterBlockBuilder 会重复利用同一个 FullFilterBlockBuilder 来构建不同的 partition，切换 partitition 时会调用 Reset ## PartitionedFilterBlockBuilder - 继承了 FullFilterBlockBuilder，利用它构建 per partition 的 filter。 - 一个 SST 的 filter 数据组织为多个 filter partition blocks 加一个 filter index block - ![](https://img.jonahgao.com/oss/note/2025p2/rocksdb_partitioned_filter.svg) - 读取时先根据 top-level index block 查询目标 key 位于哪个 filter partition，再读取该 partition。 - 核心成员： - **`index_on_filter_block_builder_`**, **`index_on_filter_block_builder_without_seq_`** - filter 的 top-level index builder - **`p_index_builder_`**：`PartitionedIndexBuilder *` - 关联的 PartitionedIndexBuilder，保持两类 builder 的 partition 数量相同 - **`filters`**：`deque<FilterEntry>` - FilterEntry: `<index_key, filter_data, filter>` - 每个 partition 的构建结果 - `index_key`来自跟其同步构建的 `p_index_builder_`（Shorten 后的），用于构建 top-level filter index - **`keys_per_partition_`**：`uint32_t` - 每个 filter partition 预期包含多少个 key，超出后会切换到新 partition - 计算方式：`partition_size * 8 / bits_per_key` - `partition_size`：近似于 `metadata_block_size` 配置 - **MaybeCutAFilterBlock()** - 每次添加 key 时都会调用，检测满足 `keys_per_partition_`，则向 `p_index_builder_` 请求切换 partition - `p_index_builder_`切换 partition 后，构建当前 filter partition，放入 `filters` 中缓存（内存中），并且 Reset 当前 parition 的 filter builder。 - **Add(key)** - 调用 MaybeCutAFilterBlock()，执行 partition 切换逻辑 - 添加到当前 parition （调用 FullFilterBlockBuilder::Add） - **Slice Finish()** - 类似于IndexBuilder，每次调用 Finish 仅返回一个 partition 的构建结果，且 status 返回 InComplete，最后一次调用时构建 top-level index block，并返回 OK。 - 构建 filter partition 时传入上一个 partition 的 handler，写入 index builder > [!summary] PartitionedFilterBlockBuilder > - partition 切换需要跟 index builder 进行同步 > - top-level index 可能基于 internal key 构建 ## FilterBlockReader - 所有 filter reader 的基类，用于读取一个 SST 的 filter。 - **bool KeyMayMatch(key)** - 判断 user key 是否可能存在。 - 对于 partitioned filter 需要额外提供 internal key（查询 top-level index）。 - **void KeysMayMatch(range)** - 检查 MGet 对应的多个 keys 是否存在，对于确定不存在的从 range 中进行剔除 - 提供了默认实现：迭代每个 key 并依次调用 KeyMayMatch - **PrefixMayMatch(prefix)**，**PrefixesMayMatch(range)** - 类似 KeyMayMatch 和 KeysMayMatch，检查 prefix 或者 range 内 keys 的前缀是否可能存在 ## FilterBlockReaderCommon - 类似于 [[block based#^f01308|IndexReaderCommon]]，管理 full filter block 或者 filter 的 top-level index block（启用 partitioned filter 后）。 - 核心成员： - **`filter_block_`**：`CachableEntry<TBlocklike>` - TBlocklike 模版类，`ParsedFullFilterBlock` 或者是 `Block_kFilterPartitionIndex` ## FullFilterBlockReader - 继承 FilterBlockReaderCommon，适用于 full filter。 - 类似于 FullFilterBlockBuilder，相比 FilterBitsReader 多了 whole key 和 prefix 的相关逻辑。 - 例如 `whole_key_filtering` 为 false，则 KeyMayMatch 始终返回 true（无 filter）。 ## PartitionedFilterBlockReader - 读取某个 SST 的 partitioned filter。 - 核心成员： - **`filter_map_`**: `UnorderedMap<uint64_t, CachableEntry<ParsedFullFilterBlock>> - 实现 pin filter partitions，例如开启 `pin_l0_filter_and_index_blocks_in_cache` - CacheDependencies() 中填充 - **GetFilterPartitionHandle(ikey)** - 从 top-level index 中搜索 ikey，返回对应 filter partition 的 BlockHandle - `filter_block->NewIndexIterator()->Seek(ikey)` - **GetFilterPartitionBlock(handle)** - 根据 filter partition 的 BlockHandle 读取对应 block，返回 `CachableEntry<ParsedFullFilterBlock>` - 先尝试从 `filter_map_` 中读取 - **MayMatch()** 1. 获取 top-level block：`FilterBlockReaderCommon::GetOrReadFilterBlock()` 2. 迭代 Seek top-level block，获取 partition BlockHandle：`GetFilterPartitionHandle()` 3. 读取 partition block：`GetFilterPartitionBlock()` 4. 查询 parition block: `FullFilterBlockReader::KeyMayMatch()` > [!summary] PartitionedFilterBlockReader > - 两级查询：先从 top-level 按 ikey 搜索 partition，再从 partition 中查询 > - 每个 partition 复用 FullFilterBlockReader 查询 ## 总结 ### Builders ```mermaid classDiagram direction TD class FilterBlockBuilder { <<interface>> +Add() +Finish() +IsEmpty() +ResetFilterBitsBuilder() +MaybePostVerifyFilter() } FilterBlockBuilder <|-- FullFilterBlockBuilder: 实现 FullFilterBlockBuilder <|-- PartitionedFilterBlockBuilder: 继承 class FullFilterBlockBuilder { -FilterBitsBuilder* filter_bits_builder_ } class PartitionedFilterBlockBuilder { -BlockBuilder index_on_filter_block_builder_ -std::deque<FilterEntry> filters } ``` - FullFilterBlockBuilder 主要处理 whole key 和 prefix key 相关逻辑； - PartitionedFilterBlockBuilder 利用父类的 FullFilterBlockBuilder 构建 per-partition，每个 partition 构建完都先缓存到内存中（data blocks 还没写完，无法确定 filter 的写入位置）。 ### Readers ```mermaid classDiagram direction TD class FilterBlockReader { <<interface>> +KeyMayMatch() +KeysMayMatch() +PrefixMayMatch() +PrefixesMayMatch() +RangeMayExist() } FilterBlockReader <|-- FilterBlockReaderCommon: 实现 class FilterBlockReaderCommon { #ReadFilterBlock() #GetOrReadFilterBlock() -const BlockBasedTable* table_ -CachableEntry~Block~ filter_block_ } FilterBlockReaderCommon <|-- FullFilterBlockReader: 继承 FilterBlockReaderCommon <|-- PartitionedFilterBlockReader: 继承 class PartitionedFilterBlockReader { -UnorderedMap~uint64_t, CachableEntry~ParsedFullFilterBlock~~ filter_map_; } ``` ----- # Meta Index block - 用于按名字索引 meta block - meta index block 里的每条 index entry 对应一个 meta block - Index key：meta block 的名字 - Index value：block handle，meta block 的位置和大小 ![500](https://img.jonahgao.com/oss/note/2025p2/rocksdb_meta_index.svg) ## MetaIndexBuilder - 用于构建 meta index block，封装了 [[block based#^897012|BlockBuilder]]，每条 entry 为 `<key, encoded_block_handle>`。 - key 为 meta block 的名字 - `block_restart_interval`固定为 1。 ---- # BlockBasedTableBuilder - 用于构建一个 SST 文件 - table/block_based/block_based_table_builder.h ----- # BlockBasedTable - Reader 类，用于读取一个 SST 文件。 - table/block_based/block_based_table_reader.h ## ApproximateMemoryUsage() ^8a19fc - 包含以下内存占用： - Rep 类大小 - FilterBlockReader 和 IndexReader 类大小 - UncompressionDictReader 类大小 - TableProperties 占用大小 -------- # TODO - [ ] BlockBasedTableOptions - [ ] BuildTable - [x] block index 使用 ikey 还是 ukey - 尽量使用 ukey，按需使用 ikey - [ ] kMaxBlockSizeSupportedByHashIndex 为什么有这个限制 - [ ] BlockReadAmpBitmap - [ ] `protection_bytes_per_key` - [Add WriteOptions protection_bytes_per_key #10037](https://github.com/facebook/rocksdb/pull/10037) - [Block per key-value checksum #11287](https://github.com/facebook/rocksdb/pull/11287) - [ ] BlockPrefetcher - [ ] RangeMayExist - [ ] 自定义 table property - [ ] PrefetchTail - [x] 是否包含 filter/index partitions - [ ] prefetch 的数据是否过大？ ---- # 实验 - [x] 验证 `value_delta_encoding` - 默认 `index_block_restart_interval = 1` 是否不起作用 - [x] 启用 PartitionedIndex 后，如果 data block 数量很少，是否也一定会有两级 index - 是的，最坏情况一个 top-level index block 和一个 second-level index block ❓build table 时是否能识别这种情况，退化为非 partitioned？以及是否有必要