Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast counts #168

Merged
merged 19 commits into from
Feb 11, 2022
Merged

Fast counts #168

merged 19 commits into from
Feb 11, 2022

Conversation

saik0
Copy link
Contributor

@saik0 saik0 commented Jan 31, 2022

See #167

Add methods to compute the cardinality of two bitmaps without materializing them. Once the naming conventions and overall structure for union is approved I will add tests and the remaining ops

TODO:

  • Bitmap
  • Treemap
    • union
    • intersection
    • difference
    • symmetric difference
    • tests

Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, it would be indeed faster to compute the cardinality of operations 😃

@saik0
Copy link
Contributor Author

saik0 commented Feb 6, 2022

dd70be9 greatly simplified the implementation, though it is slower without cached len. This will be much faster once #45 is fixed

Cardinality of all ops can all be trivially implemented in terms of the cardinality of the intersection |A ∩ B|.

|A \ B| = |A| − |A ∩ B|
|A ∪ B| = |A| + |B| − |A ∩ B|
|A △ B| = |A| + |B| − 2|A ∩ B|

Intersection cardinality is cheapest to compute, so it's both simpler to implement and has better runtime performance once we have the len cached.

@saik0
Copy link
Contributor Author

saik0 commented Feb 10, 2022

Benchmarks done now. It's about 0.5x compared to materializing container. This is good enough to ship even without #45 it's still a win

@saik0 saik0 marked this pull request as ready for review February 11, 2022 03:45
@saik0 saik0 changed the title Draft: Fast counts Fast counts Feb 11, 2022
@saik0
Copy link
Contributor Author

saik0 commented Feb 11, 2022

Curious how much this can improve milli. I would think this would be used a lot for any document distance metrics.

@Kerollmops
Copy link
Member

Kerollmops commented Feb 11, 2022

Curious how much this can improve milli. I would think this would be used a lot for any document distance metrics.

I'm not sure about that, the most used features of roaring are obviously set operations and multi-ops (unions of a lot of bitmaps). It can take up to 70-80% (on ~35-50ms) of the search time sometimes!

flamegraph after.svg.zip

@saik0
Copy link
Contributor Author

saik0 commented Feb 11, 2022

Curious how much this can improve milli. I would think this would be used a lot for any document distance metrics.

I'm not sure about that, the most used features of roaring are obviously set operations and multi-ops (unions of a lot of bitmaps). It can take up to 70-80% (on ~35-50ms) of the search time sometimes!

flamegraph after.svg.zip

Without seeing the use site I cant tell if you need to materialize the containers or not

Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you!

src/bitmap/ops.rs Outdated Show resolved Hide resolved
src/treemap/ops.rs Outdated Show resolved Hide resolved
src/treemap/ops.rs Outdated Show resolved Hide resolved
src/treemap/ops.rs Outdated Show resolved Hide resolved
Comment on lines 47 to 54
fn intersection_len_impl(lhs: &RoaringTreemap, rhs: &RoaringTreemap) -> u64 {
let mut len = 0;
for (key, right_rb) in &rhs.map {
if let Some(left_rb) = lhs.map.get(key) {
len += left_rb.intersection_len(right_rb);
}
}
len
}

// We make sure that we apply the intersection operation on the biggest map.
if self.map.len() < other.map.len() {
intersection_len_impl(self, other)
} else {
intersection_len_impl(other, self)
}
self.pairs(other)
.map(|pair| match pair {
(Some(..), None) => 0,
(None, Some(..)) => 0,
(Some(lhs), Some(rhs)) => lhs.intersection_len(rhs),
(None, None) => 0,
})
.sum()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay, was thinking about that but forget a second after, nice change here! ❤️

@Kerollmops
Copy link
Member

Kerollmops commented Feb 11, 2022

Without seeing the use site I cant tell if you need to materialize the containers or not

I assure you that those bitmaps are computed then intersected to reduce the number of candidates 😄
Milli doesn't only return the counts of them, however, most of the time a lot of operations are applied in unspecified orders and could probably be optimized by using some kind of lazy API.

Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you again!
bors merge

@bors
Copy link
Contributor

bors bot commented Feb 11, 2022

Build succeeded:

@bors bors bot merged commit 100ce5f into RoaringBitmap:master Feb 11, 2022
not-jan pushed a commit to not-jan/roaring-rs that referenced this pull request Aug 31, 2022
168: Fast counts r=Kerollmops a=saik0

See RoaringBitmap#167

Add methods to compute the cardinality of two bitmaps without materializing them. Once the naming conventions and overall structure for union is approved I will add tests and the remaining ops

TODO:
 - Bitmap
   - [x] union
   - [x] intersection
   - [x] difference
   - [x] symmetric difference
   - [x] tests
   - [x] benchmarks
     - Blocked by RoaringBitmap#129
 - Treemap
   - [x] union
   - [x] intersection
   - [x] difference
   - [x] symmetric difference
   - [x] tests

Co-authored-by: saik0 <github@saik0.net>
Co-authored-by: Joel Pedraza <github@saik0.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants