Skip to content

Commit 4e29a04

Browse files
isidroascclausspre-commit-ci[bot]
authored andcommitted
Bloom Filter (TheAlgorithms#8615)
* Bloom filter with tests * has functions constant * fix type * isort * passing ruff * type hints * type hints * from fail to erro * captital leter * type hints requested by boot * descriptive name for m * more descriptibe arguments II * moved movies_test to doctest * commented doctest * removed test_probability * estimated error * added types * again hash_ * Update data_structures/hashing/bloom_filter.py Co-authored-by: Christian Clauss <cclauss@me.com> * from b to bloom * Update data_structures/hashing/bloom_filter.py Co-authored-by: Christian Clauss <cclauss@me.com> * Update data_structures/hashing/bloom_filter.py Co-authored-by: Christian Clauss <cclauss@me.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * syntax error in dict comprehension * from goodfather to godfather * removed Interestellar * forgot the last Godfather * Revert "removed Interestellar" This reverts commit 35fa5f5. * pretty dict * Apply suggestions from code review * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update bloom_filter.py --------- Co-authored-by: Christian Clauss <cclauss@me.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 32e639b commit 4e29a04

File tree

1 file changed

+105
-0
lines changed

1 file changed

+105
-0
lines changed
+105
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
"""
2+
See https://en.wikipedia.org/wiki/Bloom_filter
3+
4+
The use of this data structure is to test membership in a set.
5+
Compared to Python's built-in set() it is more space-efficient.
6+
In the following example, only 8 bits of memory will be used:
7+
>>> bloom = Bloom(size=8)
8+
9+
Initially, the filter contains all zeros:
10+
>>> bloom.bitstring
11+
'00000000'
12+
13+
When an element is added, two bits are set to 1
14+
since there are 2 hash functions in this implementation:
15+
>>> "Titanic" in bloom
16+
False
17+
>>> bloom.add("Titanic")
18+
>>> bloom.bitstring
19+
'01100000'
20+
>>> "Titanic" in bloom
21+
True
22+
23+
However, sometimes only one bit is added
24+
because both hash functions return the same value
25+
>>> bloom.add("Avatar")
26+
>>> "Avatar" in bloom
27+
True
28+
>>> bloom.format_hash("Avatar")
29+
'00000100'
30+
>>> bloom.bitstring
31+
'01100100'
32+
33+
Not added elements should return False ...
34+
>>> not_present_films = ("The Godfather", "Interstellar", "Parasite", "Pulp Fiction")
35+
>>> {
36+
... film: bloom.format_hash(film) for film in not_present_films
37+
... } # doctest: +NORMALIZE_WHITESPACE
38+
{'The Godfather': '00000101',
39+
'Interstellar': '00000011',
40+
'Parasite': '00010010',
41+
'Pulp Fiction': '10000100'}
42+
>>> any(film in bloom for film in not_present_films)
43+
False
44+
45+
but sometimes there are false positives:
46+
>>> "Ratatouille" in bloom
47+
True
48+
>>> bloom.format_hash("Ratatouille")
49+
'01100000'
50+
51+
The probability increases with the number of elements added.
52+
The probability decreases with the number of bits in the bitarray.
53+
>>> bloom.estimated_error_rate
54+
0.140625
55+
>>> bloom.add("The Godfather")
56+
>>> bloom.estimated_error_rate
57+
0.25
58+
>>> bloom.bitstring
59+
'01100101'
60+
"""
61+
from hashlib import md5, sha256
62+
63+
HASH_FUNCTIONS = (sha256, md5)
64+
65+
66+
class Bloom:
67+
def __init__(self, size: int = 8) -> None:
68+
self.bitarray = 0b0
69+
self.size = size
70+
71+
def add(self, value: str) -> None:
72+
h = self.hash_(value)
73+
self.bitarray |= h
74+
75+
def exists(self, value: str) -> bool:
76+
h = self.hash_(value)
77+
return (h & self.bitarray) == h
78+
79+
def __contains__(self, other: str) -> bool:
80+
return self.exists(other)
81+
82+
def format_bin(self, bitarray: int) -> str:
83+
res = bin(bitarray)[2:]
84+
return res.zfill(self.size)
85+
86+
@property
87+
def bitstring(self) -> str:
88+
return self.format_bin(self.bitarray)
89+
90+
def hash_(self, value: str) -> int:
91+
res = 0b0
92+
for func in HASH_FUNCTIONS:
93+
position = (
94+
int.from_bytes(func(value.encode()).digest(), "little") % self.size
95+
)
96+
res |= 2**position
97+
return res
98+
99+
def format_hash(self, value: str) -> str:
100+
return self.format_bin(self.hash_(value))
101+
102+
@property
103+
def estimated_error_rate(self) -> float:
104+
n_ones = bin(self.bitarray).count("1")
105+
return (n_ones / self.size) ** len(HASH_FUNCTIONS)

0 commit comments

Comments
 (0)