Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bitarray related bugs #1

Merged
merged 14 commits into from
Apr 7, 2020
Merged

Fix bitarray related bugs #1

merged 14 commits into from
Apr 7, 2020

Conversation

Zhicheng-Liu
Copy link

@Zhicheng-Liu Zhicheng-Liu commented Apr 4, 2020

This pull request contains changes to fix the following issues:

  1. Since bitarray 0.9.0, the numpy.where(bitarray) call does not return
    the correct non zero positions for bits. Instead, it returns the indices
    of non zero bytes. For example, for bitarray("0000000001000000"),
    the function call returns non zero positions [1], instead of [9], since
    bitarray release 0.9.0. This pull request fixes that issue by unpacking the
    bitarray first before getting non zero positions.

  2. The bitarray constructor does not initialise values. For example, if you
    want to set the initial values in the constructed bitarray when you call
    bitarray(size) you will have to call setAll(0) to set all bits to 0. In our
    BloomFilter class we call the bitarray consturctor that way, but did not
    initialise the bitarray with zeros. This pull request fixes that issue.

  3. Also fixes the Travis build.

Co-authored-by: leandro

Zhicheng-Liu added 6 commits April 4, 2020 21:21
Since bitarray 0.9.0, the `numpy.where(bitarray)` call does not return
the correct non zero positions for bits. Instead, it returns the index
of non zero bytes. For example, for `bitarray("0000000001000000")`, the
function call returns non zero positions `[1]`, instead of `[9]`, since
bitarray release 0.9.0.

This change fixes this issue by iterating over the bitarray and returning
the list of indices at which the element is `True`.
@iqbal-lab
Copy link

Any chance of a code review @mbhall88 @martinghunt or @leoisl?

@mbhall88
Copy link
Member

mbhall88 commented Apr 6, 2020

I think I found a much faster solution (thanks SO). Benchmarking done in jupyter notebook

from bitarray import bitarray
import numpy as np
from itertools import compress, count

arr = bitarray("0000000001000000")

def f1(arr):
    """Taken from https://stackoverflow.com/a/4111521/5299417"""
    gen = compress(count(), arr)
    return gen

>>> list(f1(arr))
[9]

%%timeit
f1(arr)

# 338 ns ± 7.86 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

## the above is slightly unfair as it returns a generator
## a version which evaluates the whole generator

def f1a(arr):
    """Taken from https://stackoverflow.com/a/4111521/5299417"""
    gen = compress(count(), arr)
    return list(gen)

>>> f1a(arr)
[9]

%%timeit f1a(arr)
# 672 ns ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

def f2(arr):
    return [index for index in range(len(arr)) if arr[index]]

>>> f2(arr)
[9]

%%timeit
f2(arr)

# 2.14 µs ± 58.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Note: I tried a bunch of others but nothing came close to the itertools magic.

And on a massive bit array

arr = bitarray('001101100110101001001010') * 2000000

%%timeit
f1(arr)
# 333 ns ± 8.49 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%%timeit
f1a(arr)
# 1.6 s ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
f2(arr)
# 6.17 s ± 57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I guess I don't really know how the output of this function will be used but generators are great and will save lots of time and memory.

Zhicheng-Liu and others added 4 commits April 6, 2020 16:24
`numpy.where(numpy.unpackbits(bitarray))[0].tolist()` seems to perform
better than `[index for index, bit in enumerate(bitarray) if bit]` and
`list(itertools.compress(itertools.count(), bitarray))` for a `bitarray`
that has large number (>1000s) of elements.
-If we declare a bit array of size m, the bits are initialiased with random values (that is why that sometimes tests work, sometimes they do not);
-we actually have to run self.bitarray.setall(0) after this command to ensure that the bitarray is built properly;
-this is also written in the docs of bitarray: https://github.com/ilanschnell/bitarray/blob/master/bitarray/__init__.py#L24-L25
-this fixes this issue
@Zhicheng-Liu
Copy link
Author

Hi @mbhall88 , that's interesting and prompted me to do some profiling on this particular case with four different implementations:

  1. [index for index in range(len(bitarray)) if bitarray[index]]

  2. [index for index, bit in enumerate(bitarray) if bit]

  3. numpy.where(numpy.unpackbits(bitarray))[0].tolist()

  4. [itertools.compress(itertools.count(), bitarray)]

And I profiled them with these length categories of bitarrays:

  • 1 - 800
  • 800 - 4,000
  • 4,000 - 16,000
  • 16,000 - 80,000

The results are interesting:

  1. impl 1 is slowest in most categories, except for the first categories.
  2. impl 2 is fast and comparable to itertools one in lower categories, but slow down in longer categories.
  3. impl 3 is slowest in the first category, but it is the fastest one in the last category.
  4. impl 4 is consistently fast across all categories compared to others, but slower than numpy implementation in the last category.

So, after seeing this, I have decided to revert back to use numpy in the implementation, because it is only with larger bitarray that the performance could become a concern. I will use the implementation 2 with the builtin enumerate function in the test for readability. I did not use itertools a lot before, it took me a little while to figure what implementation 4 is doing.

This has been updated in b508e22.

Thank you very much for pointing this out. The itertools is really good, very useful in a lot of cases I believe. As you rightly pointed out, the generator is performant, time and space wise.

Copy link

@leoisl leoisl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work @Zhicheng-Liu ! The changes and tests are essential. I just have a minor comment.

optional-requirements.txt Outdated Show resolved Hide resolved
@leoisl leoisl mentioned this pull request Apr 7, 2020
Copy link

@leoisl leoisl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All fine for me!

@Zhicheng-Liu Zhicheng-Liu changed the title Fix a bug where non zero positions are incorrect for bitarray Fix bitarray related bugs Apr 7, 2020
@Zhicheng-Liu Zhicheng-Liu merged commit 5374d86 into iqbal-lab-org:master Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants