hudi-rfc-42 - JonahGao's Notes

出处: [rfc-42](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) # Abstract - **场景** Upsert操作 + 去重，需要依赖索引来支持点查。 - **现状** bucket-index（[RFC-29](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index)) 相比 bloom filter 有 3 倍 upsert 吞吐提升。 bucket-index 的问题： - 只能支持预设的 bucket 数量，后续无法更改 - hash bucket 和 file group 是一一映射的，带来 data screw 和 file group size 无法限制的问题，无法解决 - **解决** 引入 **Consistent Hashing Index** 支持 bucket resizing（通过分裂或者合并），同时能做到大多数 group 的数据不用动。可以实现在**后台**动态调整 bucket number ，并且尽可能小地影响下游。 # Background - Bucket Index bucket 和 file group 一一对应 ![](https://github.com/apache/hudi/raw/master/rfc/rfc-42/basic_bucket_hashing.png) 解决：允许一个bucket中包含多个 file groups - Consistent Hash re-hashing 只需要更改若干 local buckets： ![](https://github.com/apache/hudi/raw/master/rfc/rfc-42/consistent_hashing.png) bucket #2 分为成两个 bucket，总体 buckets 数量加 1。需要引入一个额外的 range mapping layer，连接 hash values 和 buckets。 # Implementation ![](https://github.com/apache/hudi/blob/master/rfc/rfc-42/basic_bucket_hashing.png) ## Hashing Metadata ## Bucket Resizing（Splitting & Merging） ## Concurrent Writer & Reader ## Performance