Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: collect_columns quadratic complexity #11843

Merged
merged 1 commit into from
Aug 6, 2024

Conversation

crepererum
Copy link
Contributor

Which issue does this PR close?

-

Rationale for this change

Fix accidental $O(n^2)$ in collect_columns.

What changes are included in this PR?

There are the following ways to insert a clone into a hash set:

  • clone before check: Basically set.insert(x.clone()). That's rather expensive if you have a lot of duplicates.
  • iter & clone: That's what we do right now, but that's in $O(n^2)$.
  • check & insert: Basically if !set.contains(x) {set.insert(x.clone())} That requires two hash probes though.
  • entry-based API: Sadly the stdlib doesn't really offer any handle/entry-based APIs yet (see Tracking issue for HashSet entry APIs rust-lang/rust#60896 ), but hashbrown does, so we can use the nice set.get_or_insert_owned(x) which will only clone the reference if it doesn't exists yet and only hashes once.

We now use the last approach.

Are these changes tested?

Functional tests still pass, performance change is only visible for large inputs (like 100+ columns).

Are there any user-facing changes?

⚠️ Breaking change: collect_columns now returns hashbrown::HashSet instead of std::collections::HashSet!

Fix accidental $O(n^2)$ in `collect_columns`.

There are the following ways to insert a clone into a hash set:

- **clone before check:** Basically `set.insert(x.clone())`. That's
  rather expensive if you have a lot of duplicates.
- **iter & clone:** That's what we do right now, but that's in $O(n^2)$.
- **check & insert:** Basically `if !set.contains(x) {set.insert(x.clone())}`
  That requires two hash probes though.
- **entry-based API:** Sadly the stdlib doesn't really offer any
  handle/entry-based APIs yet (see rust-lang/rust#60896 ),
  but `hashbrown` does, so we can use the nice
  `set.get_or_insert_owned(x)` which will only clone the reference if it
  doesn't exists yet and only hashes once.

We now use the last approach.
@github-actions github-actions bot added the physical-expr Physical Expressions label Aug 6, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @crepererum

I'll also run the planning benchmarks to see if I can see any improvement

@alamb
Copy link
Contributor

alamb commented Aug 6, 2024

I ran the benchmarks and don't see any negative (or positive) direct effect

Details

crepererum/fix_collect_columns_o2

Benchmarking logical_select_one_from_700: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.5s, enable flat sampling, or reduce sample count to 60.
logical_select_one_from_700
                        time:   [1.0805 ms 1.0827 ms 1.0849 ms]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

physical_select_one_from_700
                        time:   [3.4578 ms 3.4601 ms 3.4625 ms]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild

logical_select_all_from_1000
                        time:   [17.965 ms 17.977 ms 17.990 ms]
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

physical_select_all_from_1000
                        time:   [43.469 ms 43.504 ms 43.547 ms]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

Then main

Benchmarking logical_select_one_from_700: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.5s, enable flat sampling, or reduce sample count to 60.
logical_select_one_from_700
                        time:   [1.0891 ms 1.0908 ms 1.0926 ms]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild

physical_select_one_from_700
                        time:   [3.5458 ms 3.5485 ms 3.5515 ms]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

logical_select_all_from_1000
                        time:   [18.000 ms 18.012 ms 18.025 ms]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

physical_select_all_from_1000
                        time:   [43.609 ms 43.639 ms 43.671 ms]
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

@alamb alamb merged commit 16a3557 into apache:main Aug 6, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants