-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bitarray related bugs #1
Conversation
Since bitarray 0.9.0, the `numpy.where(bitarray)` call does not return the correct non zero positions for bits. Instead, it returns the index of non zero bytes. For example, for `bitarray("0000000001000000")`, the function call returns non zero positions `[1]`, instead of `[9]`, since bitarray release 0.9.0. This change fixes this issue by iterating over the bitarray and returning the list of indices at which the element is `True`.
Any chance of a code review @mbhall88 @martinghunt or @leoisl? |
I think I found a much faster solution (thanks SO). Benchmarking done in jupyter notebook from bitarray import bitarray
import numpy as np
from itertools import compress, count
arr = bitarray("0000000001000000")
def f1(arr):
"""Taken from https://stackoverflow.com/a/4111521/5299417"""
gen = compress(count(), arr)
return gen
>>> list(f1(arr))
[9]
%%timeit
f1(arr)
# 338 ns ± 7.86 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
## the above is slightly unfair as it returns a generator
## a version which evaluates the whole generator
def f1a(arr):
"""Taken from https://stackoverflow.com/a/4111521/5299417"""
gen = compress(count(), arr)
return list(gen)
>>> f1a(arr)
[9]
%%timeit f1a(arr)
# 672 ns ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
def f2(arr):
return [index for index in range(len(arr)) if arr[index]]
>>> f2(arr)
[9]
%%timeit
f2(arr)
# 2.14 µs ± 58.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) Note: I tried a bunch of others but nothing came close to the And on a massive bit array arr = bitarray('001101100110101001001010') * 2000000
%%timeit
f1(arr)
# 333 ns ± 8.49 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
f1a(arr)
# 1.6 s ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
f2(arr)
# 6.17 s ± 57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) I guess I don't really know how the output of this function will be used but generators are great and will save lots of time and memory. |
`numpy.where(numpy.unpackbits(bitarray))[0].tolist()` seems to perform better than `[index for index, bit in enumerate(bitarray) if bit]` and `list(itertools.compress(itertools.count(), bitarray))` for a `bitarray` that has large number (>1000s) of elements.
-If we declare a bit array of size m, the bits are initialiased with random values (that is why that sometimes tests work, sometimes they do not); -we actually have to run self.bitarray.setall(0) after this command to ensure that the bitarray is built properly; -this is also written in the docs of bitarray: https://github.com/ilanschnell/bitarray/blob/master/bitarray/__init__.py#L24-L25 -this fixes this issue
Hi @mbhall88 , that's interesting and prompted me to do some profiling on this particular case with four different implementations:
And I profiled them with these length categories of bitarrays:
The results are interesting:
So, after seeing this, I have decided to revert back to use This has been updated in b508e22. Thank you very much for pointing this out. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work @Zhicheng-Liu ! The changes and tests are essential. I just have a minor comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All fine for me!
This pull request contains changes to fix the following issues:
Since bitarray 0.9.0, the
numpy.where(bitarray)
call does not returnthe correct non zero positions for bits. Instead, it returns the indices
of non zero bytes. For example, for
bitarray("0000000001000000")
,the function call returns non zero positions
[1]
, instead of[9]
, sincebitarray release 0.9.0. This pull request fixes that issue by unpacking the
bitarray first before getting non zero positions.
The bitarray constructor does not initialise values. For example, if you
want to set the initial values in the constructed bitarray when you call
bitarray(size)
you will have to callsetAll(0)
to set all bits to 0. In ourBloomFilter
class we call thebitarray
consturctor that way, but did notinitialise the
bitarray
with zeros. This pull request fixes that issue.Also fixes the Travis build.
Co-authored-by: leandro