Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bloom Filter #8615

Merged
merged 32 commits into from
Apr 8, 2023
Merged
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
173ab0e
Bloom filter with tests
isidroas Apr 6, 2023
08bc970
has functions constant
isidroas Apr 6, 2023
0448109
fix type
isidroas Apr 6, 2023
486dcbc
isort
isidroas Apr 6, 2023
4111807
passing ruff
isidroas Apr 6, 2023
e6ce098
type hints
isidroas Apr 6, 2023
e4d39db
type hints
isidroas Apr 6, 2023
7629686
from fail to erro
isidroas Apr 6, 2023
3926167
captital leter
isidroas Apr 6, 2023
280ffa0
type hints requested by boot
isidroas Apr 6, 2023
5d460aa
descriptive name for m
isidroas Apr 6, 2023
cc54095
more descriptibe arguments II
isidroas Apr 6, 2023
78d19fd
moved movies_test to doctest
isidroas Apr 7, 2023
8b1bec0
commented doctest
isidroas Apr 7, 2023
28e6691
removed test_probability
isidroas Apr 7, 2023
2fd7196
estimated error
isidroas Apr 7, 2023
314237d
added types
isidroas Apr 7, 2023
9b01472
again hash_
isidroas Apr 7, 2023
c132d50
Update data_structures/hashing/bloom_filter.py
isidroas Apr 8, 2023
313c80c
from b to bloom
isidroas Apr 8, 2023
18e0dde
Update data_structures/hashing/bloom_filter.py
isidroas Apr 8, 2023
54041ff
Update data_structures/hashing/bloom_filter.py
isidroas Apr 8, 2023
483a2a0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 8, 2023
174ce08
syntax error in dict comprehension
isidroas Apr 8, 2023
00cc60e
from goodfather to godfather
isidroas Apr 8, 2023
35fa5f5
removed Interestellar
isidroas Apr 8, 2023
5cd20ea
forgot the last Godfather
isidroas Apr 8, 2023
7617143
Revert "removed Interestellar"
isidroas Apr 8, 2023
799171a
pretty dict
isidroas Apr 8, 2023
1a71f4c
Apply suggestions from code review
cclauss Apr 8, 2023
4e0263f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 8, 2023
e746746
Update bloom_filter.py
cclauss Apr 8, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions data_structures/hashing/bloom_filter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
"""
See https://en.wikipedia.org/wiki/Bloom_filter

The use of this data structure is to test membership in a set.
Compared to Python's built-in set() it is more space-efficient.
In the following example, only 8 bits of memory will be used:
>>> bloom = Bloom(size=8)

Initially, the filter contains all zeros:
>>> bloom.bitstring
'00000000'

When an element is added, two bits are set to 1
since there are 2 hash functions in this implementation:
>>> "Titanic" in bloom
False
>>> bloom.add("Titanic")
>>> bloom.bitstring
'01100000'
>>> "Titanic" in bloom
True

However, sometimes only one bit is added
because both hash functions return the same value
>>> bloom.add("Avatar")
>>> "Avatar" in bloom
True
>>> bloom.format_hash("Avatar")
'00000100'
>>> bloom.bitstring
'01100100'

Not added elements should return False ...
>>> not_present_films = ("The Godfather", "Interstellar", "Parasite", "Pulp Fiction")
>>> {
... film: bloom.format_hash(film) for film in not_present_films
... } # doctest: +NORMALIZE_WHITESPACE
{'The Godfather': '00000101',
'Interstellar': '00000011',
'Parasite': '00010010',
'Pulp Fiction': '10000100'}
>>> any(film in bloom for film in not_present_films)
False

but sometimes there are false positives:
>>> "Ratatouille" in bloom
True
>>> bloom.format_hash("Ratatouille")
'01100000'

The probability increases with the number of elements added.
The probability decreases with the number of bits in the bitarray.
>>> bloom.estimated_error_rate
0.140625
>>> bloom.add("The Godfather")
>>> bloom.estimated_error_rate
0.25
>>> bloom.bitstring
'01100101'
"""
from hashlib import md5, sha256

HASH_FUNCTIONS = (sha256, md5)


class Bloom:
def __init__(self, size: int = 8) -> None:
self.bitarray = 0b0
self.size = size

def add(self, value: str) -> None:
h = self.hash_(value)
self.bitarray |= h

def exists(self, value: str) -> bool:
h = self.hash_(value)
return (h & self.bitarray) == h

def __contains__(self, other: str) -> bool:
return self.exists(other)

def format_bin(self, bitarray: int) -> str:
res = bin(bitarray)[2:]
return res.zfill(self.size)

@property
def bitstring(self) -> str:
return self.format_bin(self.bitarray)

def hash_(self, value: str) -> int:
res = 0b0
for func in HASH_FUNCTIONS:
position = (
int.from_bytes(func(value.encode()).digest(), "little") % self.size
)
res |= 2**position
return res

def format_hash(self, value: str) -> str:
return self.format_bin(self.hash_(value))

@property
def estimated_error_rate(self) -> float:
n_ones = bin(self.bitarray).count("1")
return (n_ones / self.size) ** len(HASH_FUNCTIONS)