Add complexity estimation of iterating over HashSet and HashMap #97215

AngelicosPhosphoros · 2022-05-20T12:57:18Z

It is not obvious (at least for me) that complexity of iteration over hash tables depends on capacity and not length. Especially comparing with other containers like Vec or String. I think, this behaviour is worth mentioning.

I run benchmark which tests iteration time for maps with length 50 and different capacities and get this results:

capacity - time
64       - 203.87 ns
256      - 351.78 ns
1024     - 607.87 ns
4096     - 965.82 ns
16384    - 3.1188 us

If you want to dig why it behaves such way, you can look current implementation in hashbrown code.

Benchmarks code would be presented in PR related to this commit.

rust-highfive · 2022-05-20T12:57:20Z

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with r? rust-lang/libs-api @rustbot label +T-libs-api -T-libs to request review from a libs-api team reviewer. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

Stabilizing library features
Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
Changing public documentation in ways that create new stability guarantees
Changing observable runtime behavior of library APIs

rust-highfive · 2022-05-20T12:57:22Z

r? @thomcc

(rust-highfive has picked a reviewer for you, use r? to override)

AngelicosPhosphoros · 2022-05-20T12:57:28Z

I also tested what happens if limit amount of iteration by length of hash set and discovered that this slightly improve performance because hashbrown still searches items in empty buckets even after yielding all elements already.

I don't think that this should be mentioned because

the difference is really small;
I am planning to add this check to hashbrown crate later.

Benchmarks code

use std::collections::HashSet;

use criterion::{black_box, criterion_group, criterion_main, BatchSize, BenchmarkId, Criterion};

use rand::{Rng, SeedableRng};
use rand_chacha::ChaCha20Rng;

fn get_random_ints(amount: usize) -> Vec<usize> {
    let mut v = vec![0; amount];
    ChaCha20Rng::seed_from_u64(500).fill(v.as_mut_slice());
    v
}

pub fn bench_iterations(c: &mut Criterion) {
    let random_ints = get_random_ints(50);
    let make_hash_set = |capacity: usize| -> HashSet<usize> {
        let mut set = HashSet::with_capacity(capacity);
        set.extend(random_ints.iter());
        set
    };

    let mut group = c.benchmark_group("HashTableIteration");
    let caps = [64, 256, 1024, 4096, 16384];
    for capacity in caps {
        group.bench_with_input(
            BenchmarkId::new("HashTableIterationFull", capacity),
            &capacity,
            |b, &capacity| {
                b.iter_batched(
                    || make_hash_set(capacity),
                    |set| {
                        let sum: usize = set.iter().copied().sum();
                        (black_box(set), black_box(sum))
                    },
                    BatchSize::LargeInput,
                )
            },
        );
    }
    for capacity in caps {
        group.bench_with_input(
            BenchmarkId::new("HashTableIterationLimited", capacity),
            &capacity,
            |b, &capacity| {
                b.iter_batched(
                    || make_hash_set(capacity),
                    |set| {
                        let sum: usize = set.iter().copied().take(set.len()).sum();
                        (black_box(set), black_box(sum))
                    },
                    BatchSize::LargeInput,
                )
            },
        );
    }
}

criterion_group!(benches, bench_iterations);
criterion_main!(benches);

Criterion output

HashTableIteration/HashTableIterationFull/64
                        time:   [203.04 ns 203.87 ns 204.99 ns]
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
Benchmarking HashTableIteration/HashTableIterationFull/256: Collecting 100 samples in estimated 5.0015 s (3.0M iteration                                                                                                                        HashTableIteration/HashTableIterationFull/256
                        time:   [351.44 ns 351.78 ns 352.16 ns]
Found 16 outliers among 100 measurements (16.00%)
  8 (8.00%) high mild
  8 (8.00%) high severe
Benchmarking HashTableIteration/HashTableIterationFull/1024: Collecting 100 samples in estimated 5.0021 s (919k iteratio                                                                                                                        HashTableIteration/HashTableIterationFull/1024
                        time:   [606.89 ns 607.87 ns 608.93 ns]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
Benchmarking HashTableIteration/HashTableIterationFull/4096: Collecting 100 samples in estimated 5.0108 s (409k iteratio                                                                                                                        HashTableIteration/HashTableIterationFull/4096
                        time:   [963.44 ns 965.82 ns 968.33 ns]
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild
Benchmarking HashTableIteration/HashTableIterationFull/16384: Collecting 100 samples in estimated 5.0325 s (197k iterati                                                                                                                        HashTableIteration/HashTableIterationFull/16384
                        time:   [3.1129 us 3.1188 us 3.1250 us]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
Benchmarking HashTableIteration/HashTableIterationLimited/64: Collecting 100 samples in estimated 5.0061 s (3.8M iterati                                                                                                                        HashTableIteration/HashTableIterationLimited/64
                        time:   [193.94 ns 194.34 ns 194.76 ns]
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe
Benchmarking HashTableIteration/HashTableIterationLimited/256: Collecting 100 samples in estimated 5.0061 s (3.4M iterat                                                                                                                        HashTableIteration/HashTableIterationLimited/256
                        time:   [343.02 ns 343.62 ns 344.36 ns]
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe
Benchmarking HashTableIteration/HashTableIterationLimited/1024: Collecting 100 samples in estimated 5.0172 s (1.2M itera                                                                                                                        HashTableIteration/HashTableIterationLimited/1024
                        time:   [603.05 ns 604.44 ns 605.92 ns]
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
Benchmarking HashTableIteration/HashTableIterationLimited/4096: Collecting 100 samples in estimated 5.0057 s (449k itera                                                                                                                        HashTableIteration/HashTableIterationLimited/4096
                        time:   [951.03 ns 952.52 ns 954.32 ns]
Benchmarking HashTableIteration/HashTableIterationLimited/16384: Collecting 100 samples in estimated 5.0776 s (146k iter                                                                                                                        HashTableIteration/HashTableIterationLimited/16384
                        time:   [3.0661 us 3.0733 us 3.0806 us]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

I don't think that there is any reason to include this benchmarks to our codebase because they don't have big value themselves.

library/std/src/collections/hash/map.rs

It is not obvious (at least for me) that complexity of iteration over hash tables depends on capacity and not length. Especially comparing with other containers like Vec or String. I think, this behaviour is worth mentioning. I run benchmark which tests iteration time for maps with length 50 and different capacities and get this results: ``` capacity - time 64 - 203.87 ns 256 - 351.78 ns 1024 - 607.87 ns 4096 - 965.82 ns 16384 - 3.1188 us ``` If you want to dig why it behaves such way, you can look current implementation in [hashbrown code](https://github.com/rust-lang/hashbrown/blob/f3a9f211d06f78c5beb81ac22ea08fdc269e068f/src/raw/mod.rs#L1933). Benchmarks code would be presented in PR related to this commit.

thomcc · 2022-05-20T17:28:07Z

@bors r+ rollup

bors · 2022-05-20T17:28:10Z

📌 Commit de97d73 has been approved by thomcc

…askrgr Rollup of 7 pull requests Successful merges: - rust-lang#97109 (Fix misleading `cannot infer type for type parameter` error) - rust-lang#97187 (Reverse condition in Vec::retain_mut doctest) - rust-lang#97201 (Fix typo) - rust-lang#97203 (Minor tweaks to rustc book summary formatting.) - rust-lang#97208 (Do not emit the lint `unused_attributes` for *inherent* `#[doc(hidden)]` associated items) - rust-lang#97215 (Add complexity estimation of iterating over HashSet and HashMap) - rust-lang#97220 (Add regression test for#81827) Failed merges: r? `@ghost` `@rustbot` modify labels: rollup

Add shortcircuit in iteration if we yielded all elements Current implementation works little slower than `set.iter().take(set.len())`. See my comment [here](rust-lang/rust#97215 (comment)). So why not avoid extra integer which added by `Iterator::take` if we can add limiting logic into our iterator itself? I don't really know how this change affects [reflect_toogle_full](https://github.com/rust-lang/hashbrown/blob/f3a9f211d06f78c5beb81ac22ea08fdc269e068f/src/raw/mod.rs#L2019) and implementation of [FusedIterator](https://github.com/rust-lang/hashbrown/blob/f3a9f211d06f78c5beb81ac22ea08fdc269e068f/src/raw/mod.rs#L2150). Maybe I should make inner iterator "jump" to the end of its memory block?

rust-highfive assigned thomcc May 20, 2022

rustbot added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label May 20, 2022

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label May 20, 2022

This was referenced May 20, 2022

Add shortcircuit in iteration if we yielded all elements rust-lang/hashbrown#338

Merged

What is complexity of iteration over indexmap? indexmap-rs/indexmap#227

Closed

est31 requested changes May 20, 2022

View reviewed changes

library/std/src/collections/hash/map.rs Outdated Show resolved Hide resolved

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 20, 2022

matthiaskrgr mentioned this pull request May 20, 2022

Rollup of 7 pull requests #97224

Merged

bors merged commit ac634bc into rust-lang:master May 20, 2022

rustbot added this to the 1.63.0 milestone May 20, 2022

AngelicosPhosphoros deleted the add_hashtable_iteration_complexity_note branch May 20, 2022 21:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add complexity estimation of iterating over HashSet and HashMap #97215

Add complexity estimation of iterating over HashSet and HashMap #97215

AngelicosPhosphoros commented May 20, 2022

rust-highfive commented May 20, 2022

rust-highfive commented May 20, 2022

AngelicosPhosphoros commented May 20, 2022

thomcc commented May 20, 2022

bors commented May 20, 2022

Add complexity estimation of iterating over HashSet and HashMap #97215

Add complexity estimation of iterating over HashSet and HashMap #97215

Conversation

AngelicosPhosphoros commented May 20, 2022

rust-highfive commented May 20, 2022

rust-highfive commented May 20, 2022

AngelicosPhosphoros commented May 20, 2022

thomcc commented May 20, 2022

bors commented May 20, 2022