Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: RoaringBitmap::from_bitmap_bytes #288

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

lemolatoon
Copy link

@lemolatoon lemolatoon commented Aug 21, 2024

The Feature Explanation

  • Adding new creation function for RoaringBitmap
 pub fn from_bitmap_bytes(offset: u32, bytes: &[u8]) -> RoaringBitmap

Function Behavior

  • Interpret bytes as little endian bytes bitmap, and construct RoaringBitmap
  • offset can be used to offset the passing bitmap's index
  • If offset is not aligned to # of bits of Container's Store::Bitmap (# of bits of Box<[u64; 1024]>), this function panics
use roaring::RoaringBitmap;

let bytes = [0b00000101, 0b00000010, 0b00000000, 0b10000000];
//             ^^^^^^^^    ^^^^^^^^    ^^^^^^^^    ^^^^^^^^
//             76543210          98
let rb = RoaringBitmap::from_bitmap_bytes(0, &bytes);
assert!(rb.contains(0));
assert!(!rb.contains(1));
assert!(rb.contains(2));
assert!(rb.contains(9));
assert!(rb.contains(31));

let rb = RoaringBitmap::from_bitmap_bytes(8, &bytes);
assert!(rb.contains(8));
assert!(!rb.contains(9));
assert!(rb.contains(10));
assert!(rb.contains(17));
assert!(rb.contains(39));

Motivation

Sometimes bitmap is calculated by SIMD instructions. The result of SIMD instruction is likely to be already bitmask, not the series of bitmap indicies.

Under current implementation, when you intend use RoaringBitmap with bitmask produced by SIMD instruction, you have to use RoaringBitmap::sorted_iter or just insert one by one.

To solve this problem, I implemented RoaringBitmap::from_bitmap_bytes, which can be used to construct directly from bitmask.

Example of Production of Bitmask by SIMD instructions

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=40ddd13554c171be31fe53893401d40f

use std::arch::x86_64::*;

#[target_feature(enable = "avx2")]
unsafe fn compare_u8_avx2(a: &[u8], b: &[u8]) -> u32 {
    assert!(
        a.len() == 32 && b.len() == 32,
        "Inputs must have a length of 32."
    );

    // Load the data into 256-bit AVX2 registers
    let a_vec = _mm256_loadu_si256(a.as_ptr() as *const __m256i);
    let b_vec = _mm256_loadu_si256(b.as_ptr() as *const __m256i);

    // Perform comparison (a == b)
    let cmp_result = _mm256_cmpeq_epi8(a_vec, b_vec);

    // Extract the comparison result as a bitmask
    let mask = _mm256_movemask_epi8(cmp_result);

    mask as u32
}

fn main() {
    let a: [u8; 32] = [
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
        26, 27, 28, 29, 30, 31, 32,
    ];
    let b: [u8; 32] = [
        1, 0, 3, 4, 5, 6, 7, 8, 0, 10, 11, 12, 13, 0, 15, 16, 0, 18, 19, 20, 21, 0, 23, 24, 0, 26,
        27, 0, 29, 0, 31, 32,
    ];

    let mask = unsafe {
        compare_u8_avx2(&a, &b)
    };
    println!("Bitmask: {:#034b}", mask);
    // Bitmask: 0b11010110110111101101111011111101
    print!("Bitmask (little endian u8): ");
    for b in mask.to_le_bytes() {
        print!("{:08b} ", b);
    }
    println!();
    // Bitmask (little endian u8): 11111101 11011110 11011110 11010110 
    
    let n = 2;
    println!("Bitmask at {n}: {}", mask & (1 << n) != 0);
    // Bitmask at 2: true
}

Benchmark Result

On my laptop (Apple M3 MacBook Air Sonoma14.3 Memory 16 GB), in most cases from_bitmap_bytes is much faster than from_sorted_iter.

Part of Results

creation/from_bitmap_bytes/census-income_srt                                                                             
                        time:   [984.25 µs 987.00 µs 990.37 µs]
                        thrpt:  [6.1521 Gelem/s 6.1731 Gelem/s 6.1904 Gelem/s]
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
creation/from_sorted_iter/census-income_srt                                                                            
                        time:   [23.383 ms 23.397 ms 23.413 ms]
                        thrpt:  [260.24 Melem/s 260.41 Melem/s 260.57 Melem/s]

Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @lemolatoon 👋

Thank you very much for these changes. The results look very good, indeed. However, could you:

  • Write a better explanation of what offset means. I understand, but it needs to be clearer. Maybe talk about internal containers that are aligned around 64k values integer groups?
  • Explain what kind of input is expected in plain text in the function description (endianness, size, alignment).
  • Move this function and the test into the serialization module.

Thank you very much for the work!

@lemolatoon
Copy link
Author

Hi @Kerollmops 👋

Thank you for your quick reply. I have just made changes based on your review.

Basically I did:

  • Moved RoaringBitmap::from_bitmap_bytes, and its tests to serialization module.
  • Added detailed decument to the RoaringBitmap::from_bitmap_bytes, including offset, bytes explanations.
  • Relaxed the alignment requirement for offset.
    • Thanks to #[inline], I belive the compiler can easily optimize if offset is actually aligned to 8, or even 64Ki

/// - `offset: u32` - The starting position in the bitmap where the byte slice will be applied, specified in bits.
/// This means that if `offset` is `n`, the first byte in the slice will correspond to the `n`th bit(0-indexed) in the bitmap.
/// Must be a multiple of 8.
/// - `bytes: &[u8]` - The byte slice containing the bitmap data. The bytes are interpreted in little-endian order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

endianness doesn't apply within a byte, endianness refers to the order of bytes within a word.

The bitvec crate refers to this order as "Least-Significant-First" bit order, and I think that's a more accurate description.

Comment on lines 138 to 147
debug_assert!(first_container_bytes.len() <= number_of_bytes_in_first_container);
// Safety:
// * `first_container_bytes` is a slice of `bytes` and is guaranteed to be smaller than or equal to `number_of_bytes_in_first_container`
unsafe {
core::ptr::copy_nonoverlapping(
first_container_bytes.as_ptr(),
bits_ptr,
first_container_bytes.len(),
)
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same below, but I don't think this is correct on Big Endian systems. E.g. I've got the bytes [0x01, 0, 0, 0, 0, 0, 0, 0]. If I plop those down into the u64s of the bitmap, I've set the least significant byte to 1 on a little endian system, but I've set the most significant byte to 1 on a big endian system. So the least significant bit of the u64 is set for a little endian system, but unset for a big endian system?

Copy link
Author

@lemolatoon lemolatoon Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add tests using miri, which is an interpreter of Rust's intermediate representation. That can even behave like big-endian system. I found the usage in miri's README.

I'll push the additional tests for big endian system (by modifiying CI) and some patch (if needed) later (maybe 2 weeks later).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure miri will detect algorithm correctness issues.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least, miri can emulate a big-endian system, allowing us to ensure that tests also pass on big-endian systems.

@lemolatoon
Copy link
Author

lemolatoon commented Sep 5, 2024

@Dr-Emann I've just fixed the documentation and made the implementation endian-aware.
You can try big endian system by running cargo +nightly miri test --target s390x-unknown-linux-gnu --package roaring --lib -- bitmap::serialization::test::test_from_bitmap_bytes.
I also added this big endian test to CI.

* Directly create an array/bitmap store based on the count of bits
* Use u64 words to count bits and enumerate bits
* Store in little endian, then just fixup any touched words
* Only use unsafe to reinterpret a whole bitmap container as bytes, everything
  else is safe
* Allow adding bytes up to the last possible offset
Dr-Emann and others added 2 commits September 7, 2024 11:08
we can setting an initial value in that case
Speed up from_bitmap_bytes, and use less unsafe
@lemolatoon
Copy link
Author

lemolatoon commented Sep 12, 2024

I have just merged patch from @Dr-Emann (Thank you so much.) If merge commit aace6b8 is unnecessary, I'll remove it by force push.

/// assert!(rb.contains(17));
/// assert!(rb.contains(39));
/// ```
pub fn from_bitmap_bytes(offset: u32, mut bytes: &[u8]) -> RoaringBitmap {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I'd rather the name from_lsb0_bytes be used here, or use from_bitmap_bytes everywhere instead, but it would be nice to keep them consistent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, from_lsb0_bytes can be more concise. from_bitmap_bytes is named after what I think is the fact that bytes in Bitmap containers are ordered as Lest-Significant-Bits, regardless of its endian, but actually it is not (in big-endian system we have to reorder internally).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants