--------
# Write
## Write Rate Limiter
### Wiki
- [Rate Limiter](https://github.com/facebook/rocksdb/wiki/Rate-Limiter)
---------
# Compaction
## Bytes Pending Compaction(BPC)
- **计算函数**
- `EstimateCompactionBytesNeeded()` // VersionStorageInfo
- 每次 compaction 或 flush 完(即 version 有更新)调用(通过 `ComputeCompactionScore`)
- **关联参数**
- `soft_pending_compaction_bytes_limit` // AdvancedColumnFamilyOptions,默认64G
- `hard_pending_compaction_bytes_limit` // AdvancedColumnFamilyOptions,默认256G
- 当 BPC 达到 soft limit 时会进入 write slowdown,通常是 1 ms 甚至更长
- 当 BPC 达到 hard limit 时会进入 write stall,等待 BPC 小于 hard limit 才能恢复写入
- **计算逻辑**
L0:
```
If num_files(L0) or size(L0) > size(L1_target):
increment by size(L0) + size(L1)
```
Ln,n > 0 时:
```
If size(Ln) > target_size(Ln):
increment by [ size(Ln) - target_size(Ln) ] *
[ (size(Ln+1) / size(Ln)) + 1 ]
```
- 本层超出大小 加上 compaction可能涉及的下层大小(本层超出大小乘以一个比例)
- **查看**
- GetProperty Key: `rocksdb.estimate-pending-compaction-bytes`
### Challenges
- BPC 可能会突然变化(变化不连续)
- 当 L0 的文件个数到达 4 个 (`level0_file_num_compaction_trigger`) 时才会突然更新 BPC。
- All of the debt can be spent at one level
- 基于 PBC 来限制写入的目的是保证 LSM tree 呈金字塔的结构
- BPC 的统计范围是全局的(即涵盖了所有的层次)
- 例如 soft limit 是 64 G,则在限制写入前,L1可以有约 64 G 的 compaction debt(全都是 L0 导致的,而不是均摊到各个层次),是否可能无法达成保证 LSM tree 的结构足够健康的目标
### Issue #9423
- [https://github.com/facebook/rocksdb/issues/9423](https://github.com/facebook/rocksdb/issues/9423)
- There are too many write stalls because the write slowdown stage is frequently skipped
- **CalculateBaseBytes** 函数 计算 `level_max_bytes_` (每层允许的最大大小)时:
- 先计算 L1 target size 等于 `max(sizeof(L0), options.max_bytes_for_level_base)`
- 然后计算 $L_1$ - $L_{max-1}$ ($L_{max}$ 不用算),使用 L1 target size 乘以层次倍数
- **问题:**
- 当 L0 -> L1 compaction 完成后,导致 `sizeof(L0)` 突然变小,L1 的 target_size 也随之变小为 `max_bytes_for_level_base`
- 而 BPC = (sizeof(level) - target_size(level)) ,间接导致 BPC 变得很大
- **POC diff**
- computes level target from largest to smallest
#### Fixed #10057
- [Change The Way Level Target And Compaction Score Are Calculated](https://github.com/facebook/rocksdb/pull/10057)
- **Summary**
- L0->L1 compaction 完成后, target level size 会急剧地改变
- 当 L0 太多时,低层次的 compaction 会被延迟,但 L0-> L1 compaction 完成后会恢复,因此预期的写放大优势可能无法实现
- proposal: revert level target size,依赖调整每层的 score 来优先考虑最需要 compact 的层次
- **Basic idea**
- 不调整 target level size,但调整 score
- score 计算方式:actual level size / (target size + estimated upper bytes coming down)
- So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size.
- **Two Changes:**
- 不调整 target level size 而是调整 score,目的是让 PCB 计算更稳定
- 变更 level compact 的优先级
## Compaction Stall Counters
```
Stalls(count): 713 level0_slowdown, 706 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 1160 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 351 total count
Cumulative stall: 00:03:31.992 H:M:S, 65.1 percent
```
- 打印: `InternalStats::DumpCFStatsNoFileHistogram()`
- **Stall的原因**:`ColumnFamilyData::GetWriteStallConditionAndCause()`
- 按序检查多个条件,有一个满足后就返回
- **计算**:`ColumnFamilyData::RecalculateWriteStallConditions()` // version有变更时
- 计算完缓存到 super_version 中
- 内部会更新 stall 相关统计数据
- 延迟写入:`DBImpl::DelayWrite()`
### Stop Write Counters
- **memtable_compaction**
- 条件:unflushed_memtables >= max_write_buffer_number
- 打印日志: "Stopping writes because we have %d immutable memtables ..."
- **level0_numfiles**
- 条件:level0_numfiles >= level0_stop_writes_trigger
- 打印: "Stopping writes because we have %d level-0 files ..."
- **level0_numfiles_with_compaction**
- 条件同上,且 L0 compaction 正在进行中
- **pending_compaction_bytes**
- 条件:>= `hard_pending_compaction_bytes_limit`
- 打印: "Stopping writes because of estimated pending compaction bytes ...".
### Stall Write Counters
- **memtable_slowdown**
- 三个条件同时满足:
- options.max_write_buffer_number > 3
- num_unflushed_memtables >= max_write_buffer_number - 1
- num_unflushed_memtables - 1 >= min_write_buffer_number_to_merge
- 打印: "Stalling writes because we have %d immutable memtables (waiting for flush) ..."
- **level0_slowdown**
- 条件:L0 文件个数 >= level0_slowdown_writes_trigger.
- 打印:” Stalling writes because we have %d level-0 files rate ...“
- **level0_slowdown_with_compaction**
- 同上,L0 compaction正在进行中
- **slowdown for pending_compaction_bytes**
- 条件:BCP >= soft_pending_compaction_bytes_limit
- 打印: "Stalling writes because of estimated pending compaction bytes ..."
## Intra-L0 Compaction
-------
# 参考资料
- [RocksDB internals: write rate limiter](http://smalldatum.blogspot.com/2022/01/rocksdb-internals-write-rate-limiter.html)
- [RocksDB internals: compaction stall counters](http://smalldatum.blogspot.com/2022/01/rocksdb-internals-compaction-stall.html)
- [RocksDB internals: bytes pending compaction](http://smalldatum.blogspot.com/2022/01/rocksdb-internals-bytes-pending.html)
- [RocksDB internals: intra-L0 compaction](http://smalldatum.blogspot.com/2022/01/rocksdb-internals-intra-l0-compaction.html)