criterion-book - JonahGao's Notes

#rust # 1. Criterion.rs - 移植自 [Haskell's Criterion](https://hackage.haskell.org/package/criterion) - **Github**：[bheisler/criterion.rs](https://github.com/bheisler/criterion.rs) - **API Doc**: http://bheisler.github.io/criterion.rs/criterion/ - 开启 debug output： ````bash CRITERION_DEBUG=1 cargo bench ```` ## 1.1. Getting Started - Cargo 将每个 benchmarks 编译成单独的 crate，独立于 main crate（测试的目标 crate） - 需要以外部 crate 的形式引入 library crate - 只能测试 public functions ```rust use bench::fibonacci; use criterion::{black_box, criterion_group, criterion_main, Criterion}; pub fn criterion_benchmark(c: &mut Criterion) { c.bench_function("fib 20", |b| b.iter(|| fibonacci(black_box(20)))); } criterion_group!(benches, criterion_benchmark); criterion_main!(benches); ``` - `criterion_main` macro：展开成 main 函数，来运行给定 groups 中的所有 benchmarks - `criterion_group` macro：定义一个 group - `bench_function()` ：指定 benchmark ID 和运行的 closure ----------- # 2. User Guide ## 2.1. Migrating from libtest ## 2.2. Command-Line Output 输出示例： ```sh $ cargo bench -- --verbose Benchmarking alloc Benchmarking alloc: Warming up for 1.0000 s Benchmarking alloc: Collecting 100 samples in estimated 13.354 s (5050 iterations) Benchmarking alloc: Analyzing alloc time: [2.5094 ms 2.5306 ms 2.5553 ms] thrpt: [391.34 MiB/s 395.17 MiB/s 398.51 MiB/s] change: [-38.292% -37.342% -36.524%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 4 (4.00%) high mild 4 (4.00%) high severe slope [2.5094 ms 2.5553 ms] R^2 [0.8660614 0.8640630] mean [2.5142 ms 2.5557 ms] std. dev. [62.868 us 149.50 us] median [2.5023 ms 2.5262 ms] med. abs. dev. [40.034 us 73.259 us] ``` ### Warmup - 自动地 iterate 被测试的函数，持续一段可配置的 warmup 时间（默认是 3 秒）。 - 用途：warm up CPU caches 和文件系统 caches（如果有） ### Collecting Samples - 自动地 iterate 多次（次数不定），统计每次迭代的耗时。 - 采样的时间可配置 ### Time ```sh time: [2.5094 ms 2.5306 ms 2.5553 ms] thrpt: [391.34 MiB/s 395.17 MiB/s 398.51 MiB/s] ``` - 显示了该基准测试的每次迭代时间的置信区间 - 左边和右边的值分别是 lower 和 upper bounds - 中间值是最好的值 - confidence level 可配置 - level 越大（例如 99%），will widen the interval and thus provide the user with less information about the true slope - level 越小（例如 90%） - 95% 通常是一个比较好的平衡点 - 通过执行 [bootstrap resampling](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) 来生成这些 confidence intervals。 - bootstrap samples 的次数是可配置的，默认是 100,000 - 可选地，也可以报告吞吐（每秒的字节数或者元素数） ### Change ```sh change: [-38.292% -37.342% -36.524%] (p = 0.00 < 0.05) Performance has improved. ``` - benchmark 运行时，会在 `target/criterion` 中保存统计信息。 - 后续的执行会加载之前的数据，跟当前的 sample 进行比较。 - 第二行是一句总结，性能是提升了还是倒退了。 - noise_threshold 内会忽略，默认是 `+-2%` ### Detecting Outliers ```sh Found 8 outliers among 100 measurements (8.00%) 4 (4.00%) high mild 4 (4.00%) high severe ``` - 检测异常高或低的样本，将其报告会异常值 - 如果出现大量异常值表明基准结果存在噪音，应以适当的怀疑态度看待 - 异常值较多可能的原因： - 机器负载变化 - 线程/进程调度 - 目标函数本身耗时不规律 - 措施： - 增加压测时间来减少异常值的影响 - 增加 warmup 时间 ### Additional Statistics ```sh slope [2.5094 ms 2.5553 ms] R^2 [0.8660614 0.8640630] mean [2.5142 ms 2.5557 ms] std. dev. [62.868 us 149.50 us] median [2.5023 ms 2.5262 ms] med. abs. dev. [40.034 us 73.259 us] ``` - 使用线性回归来计算每次迭代的时间。 - 第一行显示了线性回归的斜率的置信区间，R^2区域表示该置信区间下界和上界的拟合优度值。 - R^2 的值很低，可能表明基准测试在每次迭代中执行的工作量不同。 - 第二行显示了每次迭代时间的平均值和标准偏差的置信区间。 - 如果 std. dev. 比上面的 time values 大，说明有噪音。 - 第三行类似第二行，不一样的地方在于它使用了中位数和 [median absolute deviation](https://en.wikipedia.org/wiki/Median_absolute_deviation). ## 2.3. Command-Line Options - 执行`cargo bench -- -h` 列出所有选项。 - 过滤 benchmarks：`cargo bench -- <filter>` - `<filter>` 是一个正则表达式来匹配 benchmark ID。 - 如 `cargo bench -- fib_\d+` 可以匹配 fib_20 - 打印更多详细输出： `cargo bench -- --verbose` - 禁用 colored output：`cargo bench -- --color never` - 禁用 plot generation： `cargo bench -- --noplot` - 运行指定时长，且不保存、分析、绘制结果，使用 `cargo bench -- --profile-time <num_seconds>` - 用途：profiling - 保存 baseline：`cargo bench -- --save-baseline <name>` - 与已有的 baseline 比较：`cargo bench -- --baseline <name>` - 只验证是否成功，不关注性能：`cargo test --benches` - 指定默认的 plotting backend：`cargo bench -- --plotting-backend gnuplot` - 更改输出格式：`cargo bench -- --output-format <name>`，支持的格式 - `crierion`：原生格式 - `bencher`：类似 bencher crate 或者 libtest ### Baselines - 默认跟上次运行的进行比较。 - 支持自定义 baselines： - `--save-baseline <name>` 与指定的 baseline 比较，然后覆盖它 - `--baseline <name>` 只与指定的 baseline 比较，不覆盖 - `--load-baseline <name>` 加载指定的 baseline 作为 new data set ```sh git checkout master cargo bench -- --save-baseline master git checkout feature cargo bench -- --save-baseline feature git checkout optimizations # Some optimization work here # Measure again cargo bench # Now compare against the stored baselines without overwriting it or re-running the measurements cargo bench -- --load-baseline new --baseline master cargo bench -- --load-baseline new --baseline feature ``` ## 2.4. HTML Report - 生成目录：`target/criterion/report/index.html` - 默认使用 gnuplot 绘图（如果不存在使用 [plotters](https://github.com/38/plotters) crate） ## 2.5. Plots & Graphs ### File Structure - plots 和 saved data 存储在 `target/criterion/$BENCHMARK_NAME/` 下面。 ``` $BENCHMARK_NAME/ ├── base/ │ ├── raw.csv │ ├── estimates.json │ ├── sample.json │ └── tukey.json ├── change/ │ └── estimates.json ├── new/ │ ├── raw.csv │ ├── estimates.json │ ├── sample.json │ └── tukey.json └── report/ ├── both/ │ ├── pdf.svg │ ├── regression.svg │ └── iteration_times.svg ├── change/ │ ├── mean.svg │ ├── median.svg │ └── t-test.svg ├── index.html ├── MAD.svg ├── mean.svg ├── median.svg ├── pdf.svg ├── pdf_small.svg ├── regression.svg (optional) ├── regression_small.svg (optional) ├── iteration_times.svg (optional) ├── iteration_times_small.svg (optional) ├── relative_pdf_small.svg ├── relative_regression_small.svg (optional) ├── relative_iteration_times_small.svg (optional) ├── SD.svg └── slope.svg ``` - `new`文件夹包含了最近一次的统计信息 - `base`文件夹是名为 `base` 的 baseline 的最近一次运行结果 - 图表位于 `report` 文件夹内，只保存最后一次运行的数据 - `report/both`：一张图上显示两次运行 - `report/change`：两次运行的差异 ## 2.6. Benchmarking With Inputs - 支持使用一个或者多个不同的输入值来运行压测，来研究不同输入下的性能表现。 ### One Input ```rust #![allow(unused)] fn main() { use criterion::BenchmarkId; use criterion::Criterion; use criterion::{criterion_group, criterion_main}; fn do_something(size: usize) { // Do something with the size } fn from_elem(c: &mut Criterion) { let size: usize = 1024; c.bench_with_input(BenchmarkId::new("input_example", size), &size, |b, &s| { b.iter(|| do_something(s)); }); } criterion_group!(benches, from_elem); criterion_main!(benches); } ``` - 自动地通过 `black_box` 来传入 input，不需要手动调用。 ### A Range Of Values - 可以使用 BenchmarkGroup 在一系列输入上比较函数的性能。 ```rust #![allow(unused)] fn main() { use std::iter; use criterion::BenchmarkId; use criterion::Criterion; use criterion::Throughput; fn from_elem(c: &mut Criterion) { static KB: usize = 1024; let mut group = c.benchmark_group("from_elem"); for size in [KB, 2 * KB, 4 * KB, 8 * KB, 16 * KB].iter() { group.throughput(Throughput::Bytes(*size as u64)); group.bench_with_input(BenchmarkId::from_parameter(size), size, |b, &size| { b.iter(|| iter::repeat(0u8).take(size).collect::<Vec<_>>()); }); } group.finish(); } criterion_group!(benches, from_elem); criterion_main!(benches); } ``` - `throughput` 函数：表示每次迭代操作 `size` bytes ## 2.7. Advanced Configuration ### Configuring Sample Count & Other Statistical Settings - 允许用户调整某些统计参数。 - 使用 `BenchmarkGroup`结构来设置 ```rust #![allow(unused)] fn main() { use criterion::*; fn my_function() { ... } fn bench(c: &mut Criterion) { let mut group = c.benchmark_group("sample-size-example"); // Configure Criterion.rs to detect smaller differences and increase sample size to improve // precision and counteract the resulting noise. group.significance_level(0.1).sample_size(500); group.bench_function("my-function", |b| b.iter(|| my_function()); group.finish(); } criterion_group!(benches, bench); criterion_main!(benches); } ``` - 使用 `criterion_group`宏的完整形式来设置 ```rust criterion_group!{ name = benches; // This can be any expression that returns a `Criterion` object. config = Criterion::default().significance_level(0.1).sample_size(500); targets = bench } ``` ### Throughput Measurements ### Sampling Mode ## 2.8. Comparing Functions - 可以自动地比较一个函数的多个实现，进行对比。 ## 2.9. CSV Output - `raw.csv`文件内，可以结合 [lolbench](https://github.com/anp/lolbench) 使用跟踪历史性能 ## 2.10. Known Limitations - 需要提供自己的 `main` 函数（使用 `criterion_main` 宏），导致一些限制 - 不能包含 `src` 目录下的 benchmarks - 不能 benchmark 非 pub 函数 - 不能 benchmark binary crates 中的函数（binary crate 不能被其他 crate 依赖） - 无法对不提供 rlib 的 crate 中的函数进行基准测试。 - `black_box`函数不如官方版本的可靠，可能会导致 dead-code-elimination - Nightly rust 可以使用 `real_blackbox` feature ```toml criterion = { version = '...', features=['real_blackbox'] } ``` ## 2.11. Bencher Compatibility Layer - 提供了一个小 crate `criterion_bencher_compat` 可以兼容 `bencher` ## 2.12. Timing Loops - [`Bencher`](https://bheisler.github.io/criterion.rs/criterion/struct.Bencher.html) 结构体提供了许多函数来实现不同的 timling loops，用于测试一个函数的性能。 ### iter - 在一个 tight loop 中执行 benchmark N 次，然后记录整个循环的耗时。 - 轻量级：只在循环前后记录时间 - 问题： - 如果某个 value 实现了 Drop，那么 drop 函数的时间也会被统计在内。 - 不支持 per-iteration setup，例如排序函数需要先准备无序数据，但同时不想准备工作详细测量 ### iter_with_large_drop - 解决 drop 问题。 - 先将 benchmark 的结果收集到一个 `Vec`，测量完成后再释放 - 问题：内存占用 ### iter_batched/iter_batched_ref - 需要两个 closures - 第一个用于生成 setup data - 第二个是 benchmark 的目标函数 - 生成一批 inputs，然后测量这一批 inputs 的执行时间。 - 同时也类似 `iter_with_large_drop` 收集结果到 `Vec` 中，避免 drop 问题。 - 使用于 input 数据是动态的，如果输入是固定的，可以直接使用 `iter` - 接收一个参数用于控制 batch 的大小。 ## 2.13. Custom Measurements ## 2.14. Profiling ### Note on running benchmark executables directly - 使用 `--bench`参数直接运行 ### --profile-time - 执行指定时长，但进行分析、保存结果，跳过 criterion.rs 自身的代码逻辑 ### Implementing In-Process Profiling Hooks ```rust #![allow(unused)] fn main() { pub trait Profiler { fn start_profiling(&mut self, benchmark_id: &str, benchmark_dir: &Path); fn stop_profiling(&mut self, benchmark_id: &str, benchmark_dir: &Path); } } ``` - `--profile-time` 模式下会自动调用 ## 2.15. Custom Test Framework - 需要 nightly compiler - 使用 `#[criterion]` 宏 ## 2.16. Benchmarking async functions - 需要提供一个 async runtime - 异步函数会有额外开销，小函数推荐同步方式 -------------- # 3. cargo-criterion - cargo-criterion 是一个实验性的 Cargo 扩展，可以替代 cargo bench。 ---------- # FAQ ## Unrecognized Options - 只运行 criterion benchmark：`cargo bench --bench my_benchmark -- --verbose` - 或者为 lib / app crate 禁用 benchmarks，例如： ``` [lib] bench = false ``` ## When Should I Use criterion::black_box - `black_box`是一个用于阻塞某些编译器优化的函数。 - 例如经常使用 constant parameters 来压测函数，rustc 可能会进行优化（将函数调用替换为一个常量） -----------