Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
259 changes: 259 additions & 0 deletions allhands/spring2025/weeksixteen/teamfour/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,259 @@
---
author: [Joseph Oforkansi, Hank Gref, Coltin Colucci, Grant Anderson, Gabriel Salvatore, Javier Bejarano Jimenez]
title: Collision and Run Time Analysis of Hashing Algorithms
page-layout: full
categories: []
date: "2025-4-22"
date-format: long
toc: true
format:
html:
code-links:
- text: Github Repository
icon: github
href: https://github.com/ganderson03/algorithmology-all-hands-3-group-4
code-fold: true
code-summary: "Show the code"

---

## Introduction

we picked this project because we wanted to figure out how different ways of hashing data affect how often you get the same "address" (collisions) and how fast it takes to do the hashing. Our main question is: **How do different hashing methods (like Python's built-in hash(), MurmurHash, DJB@, and modulo_hash) change the number of collisions and how quickly they run when you're storing data in a dictionary?**

It's really important to understand this because when you store lots of information, you want to find it quickly and not have different pieces of information end up in the same spot. That's what collisions are, and too many slow things down. So, by comparing these hashing methods, we hope to give people good information for choosing the best one for their needs.

To test this, we made up some generated datasets with different amounts of data: 5,000, 10,000, and 20,000 items. These datasets are like the kind of information you might store in a dictionary. The keys in these datasets are just random strings, so we can see how the hashing works with different kinds of inputs. By using different sizes of datasets, we can also see if some hashing methods work better with more or less data.

Right now, we're looking at Python's built-in hash(), a simple math-based hashing method we made (modulo hashing), MurmurHash, and also SHA-256 and CityHash, which are more advanced. For now, we're counting how many times different keys end up with the same hash and timing how long it takes to hash all the keys. This should give us some solid answers to our main question.

## Implementation

During this experiment, we tested multiple different hashing algorithms and tracked both the number of collisions and the runtimes. We conducted a doubling experiment, increasing the dataset sizes from 5k to 10k and finally to 20k. This allowed us to identify the time complexities of each algorithm and gain a better understanding of how additional data entries impact the number of collisions.

### Python Hash

### Modulo Hash
```{python}
def simple_modulo_hash(key, modulo=1000):
# Calculate the sum of ASCII values of the characters in the key
ascii_sum = sum(ord(char) for char in key)
# Return the hash value as the modulo of the sum
return ascii_sum % modulo
```

This function takes two inputs, `key` which is a string and `modulo` as an integer that determines the range of the hash values (default 1000). Each character in the string is converted to its ASCII value using the `ord` function. The sum of the ASCII values is divided by the `modulo` value, and then the remainder is taken. This ensures the hash value is between 0 and the `modulo - 1`, which is important for indexing the hash value inside the bucket.

```{python}
# Inline implementation of hashing with the simple modulo-based hash function
hashed_data = {}
for key, value in dataset.items():
h = simple_modulo_hash(key, modulo) # Use the simple modulo-based hash function
if h in hashed_data:
hashed_data[h].append((key, value)) # Handle collisions
else:
hashed_data[h] = [(key, value)]
```

The keys in `hashed_data` are the hash values computed from the `simple_modulo_hash` function, and the values in `hashed_data` are lists of tuples. Each tuple contains the original key and its associated value from the dataset.

The algorithm iterates through all key-value pairs in the dataset and computes hash values for the keys using simple_modulo_hash. If a hash value already exists as a key in hashed_data, the key-value pair is appended to the list of items stored under that hash value (h). If the hash value does not exist, a new entry is created in hashed_data with the hash value as the key and the key-value pair as the first item in the list.

The Modulo Hashing algorithm has a time complexity of O(n), also known as linear time. We came to this conclusion by running a doubling experiment using 5k, 10k, and 20k-key dictionaries and timing how long it took to hash each dataset. As the number of keys doubled, so did the runtimes, confirming the linear time complexity.

This is shown by the average runtimes for each dataset:

5k → 0.0129 seconds

10k → 0.0247 seconds

20k → 0.0526 seconds

This data shows a clear doubling in runtime as the number of entries doubles. While the pattern is mostly linear, there was a small uptick in runtimes with each increase—likely due to a rise in collisions for the larger datasets.

### Murmur Hash

```{python}
def murmurhash(key: str, seed: int = 0) -> int:
"""Compute the MurmurHash for a given string.
"""
key_bytes = key.encode('utf-8')
length = len(key_bytes)
h = seed
c1 = 0xcc9e2d51
c2 = 0x1b873593
r1 = 15
r2 = 13
m = 5
n = 0xe6546b64

# Process the input in 4-byte chunks
for i in range(0, length // 4):
k = int.from_bytes(key_bytes[i * 4:(i + 1) * 4], byteorder='little')
k = (k * c1) & 0xFFFFFFFF
k = (k << r1 | k >> (32 - r1)) & 0xFFFFFFFF
k = (k * c2) & 0xFFFFFFFF

h ^= k
h = (h << r2 | h >> (32 - r2)) & 0xFFFFFFFF
h = (h * m + n) & 0xFFFFFFFF

# Process the remaining bytes
remaining_bytes = length & 3
if remaining_bytes:
k = int.from_bytes(key_bytes[-remaining_bytes:], byteorder='little')
k = (k * c1) & 0xFFFFFFFF
k = (k << r1 | k >> (32 - r1)) & 0xFFFFFFFF
k = (k * c2) & 0xFFFFFFFF
h ^= k

# Finalize the hash
h ^= length
h ^= (h >> 16)
h = (h * 0x85ebca6b) & 0xFFFFFFFF
h ^= (h >> 13)
h = (h * 0xc2b2ae35) & 0xFFFFFFFF
h ^= (h >> 16)

return h


def load_dataset(file_path: str) -> List[str]:
"""Load a dataset from a file, where each line is treated as a separate entry."""
with open(file_path, 'r', encoding='utf-8') as file:
return [line.strip() for line in file]


def hash_dataset(file_path: str, seed: int = 0) -> List[int]:
"""Hash each line of a dataset using MurmurHash."""
dataset = load_dataset(file_path)
return [murmurhash(line, seed) for line in dataset]


if __name__ == "__main__":
# Example usage
dataset_file = "dataset.txt" # Replace with the path to your dataset file
seed_value = 42

try:
hashes = hash_dataset(dataset_file, seed_value)
for i, h in enumerate(hashes, start=1):
print(f"Line {i}: Hash = {h}")
except FileNotFoundError:
print(f"Error: The file '{dataset_file}' was not found.")
```

The implementation of MurmurHash in the provided code is a 32-bit non-cryptographic hash function designed for efficiency and uniform distribution.

The function begins by encoding the input string into bytes and processing it in 4-byte chunks. Each chunk is transformed using a series of multiplications, bitwise rotations, and masking operations to ensure randomness and minimize collisions.

Any remaining bytes that do not fit into a 4-byte chunk are processed separately to ensure all input data contributes to the final hash. The hash value is further refined in a finalization step, where it undergoes additional bitwise operations and multiplications to improve distribution.

The function also incorporates a seed value, allowing for customizable hash outputs. This implementation is particularly effective for applications requiring fast and reliable hashing, such as hash tables or data indexing.

### DJB2 Hash

```python
def djb2(key: str) -> int:
"""Hash function: DJB2 with 10-bit truncation for collision analysis."""
h = 5381
for c in key:
h = ((h << 5) + h) + ord(c) # h * 33 + ord(c)
return h & 0x3FF # 10-bit hash space (0–1023)
```

The `djb2` function is a well-known non-cryptographic hash function developed by Daniel J. Bernstein. It is widely used for hash tables due to its simplicity and decent distribution properties.

This version of DJB2 takes a single input `key` (a string) and computes a 32-bit hash by starting with a base hash value (`5381`) and iteratively applying a simple formula to each character. The expression `((h << 5) + h)` is equivalent to multiplying the current hash value by 33, and then adding the ASCII value of the current character. The final result is truncated to 10 bits using `& 0x3FF` to force hash values into a smaller range, increasing the chance of collisions for analysis purposes.

---

```python
# Inline implementation of DJB2-based hashing with collision tracking
hashed_data = {}
for key, value in dataset.items():
h = djb2(key) # Use the DJB2 hash function
if h in hashed_data:
hashed_data[h].append((key, value)) # Handle collisions
else:
hashed_data[h] = [(key, value)]
```

Here, `hashed_data` is a dictionary where:
- The keys are **hash values** generated by the `djb2` function.
- The values are **lists of key-value pairs** from the original dataset that share the same hash.

The algorithm iterates through each key-value pair in the dataset, hashes the key using `djb2`, and stores the pair in the appropriate bucket. If a hash value already exists in `hashed_data`, it appends the new item to the existing list — effectively handling collisions.

---

The DJB2 algorithm has a **linear time complexity of O(n)**, where `n` is the number of keys in the dataset. This is because:
- Each key is processed character by character in a single loop.
- Insertion into Python dictionaries is O(1) on average.
- The hashing step for each key is independent and does not depend on the dataset size.

We verified this through **empirical testing** using datasets of 5k, 10k, and 20k elements. As the number of keys doubled, so did the runtime, confirming the algorithm’s linear growth. This linearity held true even when we reduced the hash space to 10 bits to force more collisions — although the **collision rate increased**, the overall runtime remained proportional to input size.

## Data

|Hash Algorithm | Dataset | Run | Total Keys | Unique Hash Values | Total Collisions | Time Taken (s) | Seed Value | Modulo Value|
|---|---|---|---|---|---|---|---|---|
|builtin | dataset_5k | 1 | 5000 | 5000 | 0 | 0.0056 | -- | --|
|builtin | dataset_5k | 2 | 5000 | 5000 | 0 | 0.0042 | -- | --|
|builtin | dataset_5k | 3 | 5000 | 5000 | 0 | 0.0044 | -- | --|
|builtin | dataset_10k | 1 | 10000 | 10000 | 0 | 0.0109 | -- | --|
|builtin | dataset_10k | 2 | 10000 | 10000 | 0 | 0.0094 | -- | --|
|builtin | dataset_10k | 3 | 10000 | 10000 | 0 | 0.0087 | -- | --|
|builtin | dataset_20k | 1 | 20000 | 20000 | 0 | 0.0216 | -- | --|
|builtin | dataset_20k | 2 | 20000 | 20000 | 0 | 0.0196 | -- | --|
|builtin | dataset_20k | 3 | 20000 | 20000 | 0 | 0.0187 | -- | --|
|murmur | dataset_5k | 1 | 5000 | 1 | 4999 | 0.0540 | 1000 | --|
|murmur | dataset_5k | 2 | 5000 | 1 | 4999 | 0.0543 | 1000 | --|
|murmur | dataset_5k | 3 | 5000 | 1 | 4999 | 0.0547 | 1000 | --|
|murmur | dataset_10k | 1 | 10000 | 1 | 9999 | 0.1075 | 1000 | --|
|murmur | dataset_10k | 2 | 10000 | 1 | 9999 | 0.1066 | 1000 | --|
|murmur | dataset_10k | 3 | 10000 | 1 | 9999 | 0.1069 | 1000 | --|
|murmur | dataset_20k | 1 | 20000 | 1 | 19999 | 0.2160 | 1000 | --|
|murmur | dataset_20k | 2 | 20000 | 1 | 19999 | 0.2173 | 1000 | --|
|murmur | dataset_20k | 3 | 20000 | 1 | 19999 | 0.2174 | 1000 | --|
|modulo | dataset_5k | 1 | 5000 | 376 | 4624 | 0.0129 | -- | 1000|
|modulo | dataset_5k | 2 | 5000 | 376 | 4624 | 0.0115 | -- | 1000|
|modulo | dataset_5k | 3 | 5000 | 376 | 4624 | 0.0119 | -- | 1000|
|modulo | dataset_10k | 1 | 10000 | 406 | 9594 | 0.0247 | -- | 1000|
|modulo | dataset_10k | 2 | 10000 | 406 | 9594 | 0.0234 | -- | 1000|
|modulo | dataset_10k | 3 | 10000 | 406 | 9594 | 0.0234 | -- | 1000|
|modulo | dataset_20k | 1 | 20000 | 436 | 19564 | 0.0526 | -- | 1000|
|modulo | dataset_20k | 2 | 20000 | 436 | 19564 | 0.0491 | -- | 1000|
|modulo | dataset_20k | 3 | 20000 | 436 | 19564 | 0.0493 | -- | 1000|
|djb2 | dataset_5k | 1 | 5000 | 4826 | 174 | 0.0165 | -- | --|
|djb2 | dataset_5k | 2 | 5000 | 4826 | 174 | 0.0156 | -- | --|
|djb2 | dataset_5k | 3 | 5000 | 4826 | 174 | 0.0157 | -- | --|
|djb2 | dataset_10k | 1 | 10000 | 9270 | 730 | 0.0316 | -- | --|
|djb2 | dataset_10k | 2 | 10000 | 9270 | 730 | 0.0317 | -- | --|
|djb2 | dataset_10k | 3 | 10000 | 9270 | 730 | 0.0317 | -- | --|
|djb2 | dataset_20k | 1 | 20000 | 17293 | 2707 | 0.0675 | -- | --|
|djb2 | dataset_20k | 2 | 20000 | 17293 | 2707 | 0.0672 | -- | --|
|djb2 | dataset_20k | 3 | 20000 | 17293 | 2707 | 0.0634 | -- | --|

In order to consistently test the hash algorithms, each hash method was
ran three times for each dataset file along with 1000 utilized as
the seed/modulo value in order to maintain consistency with the
outputs, as well as each function.

## Analysis

The four currently implemented hash algorithms utilize different approaches
when finding distinct hash values, which results in different amounts of time
taken for them to run along with different amounts of collisions (with some
hash algorithms returning with no collisions whatsoever). These include the
builtin algorithms (which returns no collisions at all) and the djb2 hash function,
which returns 174 collisions when ran using the dataset_5k.json file.

By far the fastest hash algorithm is the builtin hash function, which overall
runs at twice the speed of the next fastest algorithm (modulo) and three times
as fast as the next fastest hash algorithm (djb2). This means that when
prioritizing efficiency, users would likely prefer to opt for the builtin algorithm,
while users searching for collisions would be more likely to run murmur, which has
the most collisions out of any algorithm.

## Conclusion