Skip to content

Commit

Permalink
Improve buffer-accepting hashes and more (#84)
Browse files Browse the repository at this point in the history
  • Loading branch information
hajimes authored Sep 17, 2024
1 parent 04cb8ba commit 30da46e
Show file tree
Hide file tree
Showing 14 changed files with 981 additions and 446 deletions.
45 changes: 29 additions & 16 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,30 +10,43 @@ This project has adhered to

### Added

- Add `digest` functions that accept a non-immutable buffer as input
and process it without internal copying
([#75](https://github.com/hajimes/mmh3/issues/75)).
- Slightly improve the performance of the `hash_bytes` function.
- Add support for Python 3.13.
- Add `digest` functions that support the new buffer protocol
([PEP 688](https://peps.python.org/pep-0688/)) as input
([#75](https://github.com/hajimes/mmh3/pull/75)).
These functions are implemented with
[METH_FASTCALL](https://docs.python.org/3/c-api/structures.html#c.METH_FASTCALL),
offering improved performance over legacy functions.
- Slightly improve the performance of the `hash_bytes()` function.
- Add Read the Docs documentation
([#54](https://github.com/hajimes/mmh3/issues/54)).
- (planned: Document benchmark results
([#53](https://github.com/hajimes/mmh3/issues/53))).

### Changed

- **Backward-incompatible**: The `seed` argument is now strictly validated to
ensure it falls within the range [0, 0xFFFFFFFF]. A `ValueError` is raised
if the seed is out of range.
- **Backward-incompatible**: Change the constructors of hasher classes to
accept a buffer as the first argument
([#83](https://github.com/hajimes/mmh3/pull/83)).
- The type of flag argumens has been changed from `bool` to `Any`.
- Change the format of CHANGELOG.md to conform to the
[Keep a Changelog](https://keepachangelog.com/en/1.1.0/) standard
([#63](https://github.com/hajimes/mmh3/issues/63)).
- **Backward-incompatible**: Change the constructors of hasher classes to
accept a buffer as the first argument.
([#63](https://github.com/hajimes/mmh3/pull/63)).

### Deprecated

- Deprecate the `hash_from_buffer()` function.
Use `mmh3_32_sintdigest()` or `mmh3_32_uintdigest()` as alternatives.

### Fixed

- Fix a reference leak in the `hash_from_buffer()` function
([#75](https://github.com/hajimes/mmh3/issues/75)).
- Fix type hints ([#76](https://github.com/hajimes/mmh3/issues/76),
[#77](https://github.com/hajimes/mmh3/issues/77)).
([#75](https://github.com/hajimes/mmh3/pull/75)).
- Fix type hints ([#76](https://github.com/hajimes/mmh3/pull/76),
[#77](https://github.com/hajimes/mmh3/pull/77)).

## [4.1.0] - 2024-01-09

Expand All @@ -47,7 +60,7 @@ This project has adhered to
([#50](https://github.com/hajimes/mmh3/issues/50)).
- Fix incorrect type hints ([#51](https://github.com/hajimes/mmh3/issues/51)).
- Fix invalid results on s390x when the arg `x64arch` of `hash64` or
`hash_bytes` is set to `False`
`hash_bytes()` is set to `False`
([#52](https://github.com/hajimes/mmh3/issues/52)).

## [4.0.1] - 2023-07-14
Expand Down Expand Up @@ -97,8 +110,8 @@ This project has adhered to
[wouter bolsterlee](https://github.com/wbolster) and
[Dušan Nikolić](https://github.com/n-dusan)!
- Add support for 32-bit architectures such as `i686` and `armv7l`. From now on,
`hash` and `hash_from_buffer` on these architectures will generate the same
hash values as those on other environments. Thanks
`hash()` and `hash_from_buffer()` on these architectures will generate the
same hash values as those on other environments. Thanks
[Danil Shein](https://github.com/dshein-alt)!
- In relation to the above, `manylinux2014_i686` wheels are now available.
- Support for hashing huge data (>16GB). Thanks
Expand Down Expand Up @@ -134,13 +147,13 @@ This project has adhered to

### Fixed

- Bugfix for `hash_bytes`. Thanks [doozr](https://github.com/doozr)!
- Bugfix for `hash_bytes()`. Thanks [doozr](https://github.com/doozr)!

## [2.5] - 2017-10-28

### Added

- Add `hash_from_buffer`. Thanks [Dimitri Vorona](https://github.com/alendit)!
- Add `hash_from_buffer()`. Thanks [Dimitri Vorona](https://github.com/alendit)!
- Add a keyword argument `signed`.

## [2.4] - 2017-05-27
Expand Down Expand Up @@ -175,7 +188,7 @@ Thanks!

### Added

- Add `hash128`, which returns a 128-bit signed integer.
- Add `hash128()`, which returns a 128-bit signed integer.

### Fixed

Expand Down
60 changes: 25 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,37 +128,50 @@ b'\x82_n\xdd \xac\xb6j\xef\x99\xb1e\xc4\n\xc9\xfd'

## Changelog

See [Changelog](https://mmh3.readthedocs.io/en/latest/changelog_link.html) for the
See [Changelog](https://mmh3.readthedocs.io/en/latest/changelog.html) for the
complete changelog.

### [Unreleased]

#### Added

- Add `digest` functions that accept a non-immutable buffer as input
and process it without internal copying
([#75](https://github.com/hajimes/mmh3/issues/75)).
- Slightly improve the performance of the `hash_bytes` function.
- Add support for Python 3.13.
- Add `digest` functions that support the new buffer protocol
([PEP 688](https://peps.python.org/pep-0688/)) as input
([#75](https://github.com/hajimes/mmh3/pull/75)).
These functions are implemented with
[METH_FASTCALL](https://docs.python.org/3/c-api/structures.html#c.METH_FASTCALL),
offering improved performance over legacy functions.
- Slightly improve the performance of the `hash_bytes()` function.
- Add Read the Docs documentation
([#54](https://github.com/hajimes/mmh3/issues/54)).
- (planned: Document benchmark results
([#53](https://github.com/hajimes/mmh3/issues/53))).

#### Changed

- **Backward-incompatible**: The `seed` argument is now strictly validated to
ensure it falls within the range [0, 0xFFFFFFFF]. A `ValueError` is raised
if the seed is out of range.
- **Backward-incompatible**: Change the constructors of hasher classes to
accept a buffer as the first argument
([#83](https://github.com/hajimes/mmh3/pull/83)).
- The type of flag argumens has been changed from `bool` to `Any`.
- Change the format of CHANGELOG.md to conform to the
[Keep a Changelog](https://keepachangelog.com/en/1.1.0/) standard
([#63](https://github.com/hajimes/mmh3/issues/63)).
- **Backward-incompatible**: Change the constructors of hasher classes to
accept a buffer as the first argument.
([#63](https://github.com/hajimes/mmh3/pull/63)).

#### Deprecated

- Deprecate the `hash_from_buffer()` function.
Use `mmh3_32_sintdigest()` or `mmh3_32_uintdigest()` as alternatives.

#### Fixed

- Fix a reference leak in the `hash_from_buffer()` function
([#75](https://github.com/hajimes/mmh3/issues/75)).
- Fix type hints ([#76](https://github.com/hajimes/mmh3/issues/76),
[#77](https://github.com/hajimes/mmh3/issues/77)).
([#75](https://github.com/hajimes/mmh3/pull/75)).
- Fix type hints ([#76](https://github.com/hajimes/mmh3/pull/76),
[#77](https://github.com/hajimes/mmh3/pull/77)).

### [4.1.0] - 2024-01-09

Expand All @@ -172,7 +185,7 @@ complete changelog.
([#50](https://github.com/hajimes/mmh3/issues/50)).
- Fix incorrect type hints ([#51](https://github.com/hajimes/mmh3/issues/51)).
- Fix invalid results on s390x when the arg `x64arch` of `hash64` or
`hash_bytes` is set to `False`
`hash_bytes()` is set to `False`
([#52](https://github.com/hajimes/mmh3/issues/52)).

## License
Expand Down Expand Up @@ -201,29 +214,6 @@ For compatibility with
[murmur3 (Go)](https://pkg.go.dev/github.com/spaolacci/murmur3), see
<https://github.com/hajimes/mmh3/issues/46>.

### Unexpected results when given non 32-bit seeds

In version 2.4, the type of a seed was changed from a signed 32-bit integer to
an unsigned 32-bit integer. However, the resulting values for signed seeds
remain unchanged from previous versions, as long as they are 32-bit.

```pycon
>>> mmh3.hash("aaaa", -1756908916) # signed representation for 0x9747b28c
1519878282
>>> mmh3.hash("aaaa", 2538058380) # unsigned representation for 0x9747b28c
1519878282
```

Be careful so that these seeds do not exceed 32-bit. Unexpected results may
happen with invalid values.

```pycon
>>> mmh3.hash("foo", 2 ** 33)
-156908512
>>> mmh3.hash("foo", 2 ** 34)
-156908512
```

## Contributing Guidelines

See [Contributing](https://mmh3.readthedocs.io/en/latest/CONTRIBUTING.html).
Expand Down
4 changes: 2 additions & 2 deletions benchmark/plot_graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,8 +124,8 @@ def ordered_intersection(list1: list[T], list2: list[T]) -> list[T]:
plt.savefig(os.path.join(args.output_dir, BANDWIDTH_SMALL_FILE_NAME))

df_latency_all = df_latency * 1000
df_latency_all.index = df_latency_all.index / (1024 * 1024)
df_latency_all.plot(xlabel="Input size (MiB)", ylabel="Latency (ms)")
df_latency_all.index = df_latency_all.index / 1024
df_latency_all.plot(xlabel="Input size (KiB)", ylabel="Latency (ms)")
plt.savefig(os.path.join(args.output_dir, LATENCY_FILE_NAME))

df_latency_small = df_latency * 1000 * 1000 * 1000
Expand Down
15 changes: 6 additions & 9 deletions docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,13 +129,13 @@ The idea of the subproject directory loosely follows the

### Updating mmh3 core C code

Run `tox -e build-cfiles`. This will fetch Appleby's original SMHasher project
Run `tox -e build_cfiles`. This will fetch Appleby's original SMHasher project
as a git submodule and then generate PEP 7-compliant C code from the original
project.

To perform further edits, add transformation code to the `refresh.py` script,
instead of editing `murmurhash3.*` files manually.
Then, run `tox -e build-cfiles` again to update the `murmurhash3.*` files.
Then, run `tox -e build_cfiles` again to update the `murmurhash3.*` files.

### Local files

Expand All @@ -153,8 +153,7 @@ Then, run `tox -e build-cfiles` again to update the `murmurhash3.*` files.
To run benchmarks locally, try the following command:

```shell
pip install ".[benchmark]"
python benchmark/benchmark.py -o OUTPUT_FILE \
tox -e benchmark -- -o OUTPUT_FILE \
--test-hash HASH_NAME --test-buffer-size-max HASH_SIZE
```

Expand All @@ -165,9 +164,8 @@ in bytes.
For example,

```shell
pip install ".[benchmark]"
mkdir results
python benchmark/benchmark.py -o results/mmh3_128.json \
mkdir -p _results
tox -e benchmark -- -o _results/mmh3_128.json \
--test-hash mmh3_128 --test-buffer-size-max 262144
```

Expand All @@ -182,8 +180,7 @@ After obtaining the benchmark results, you can plot graphs by `plot_graph.py`.
The following is an example of how to run the script:

```shell
pip install ".[benchmark,plot]"
python benchmark/plot_graph.py --output-dir docs/_static RESULT_DIR/*.json
tox -e plot -- --output-dir docs/_static RESULT_DIR/*.json
```

where `RESULT_DIR` is the directory containing the benchmark results.
Expand Down
4 changes: 2 additions & 2 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,8 @@ UTF-8 encoding before hashing.

The following functions are used to hash types that implement the buffer
protocol such as `bytes`, `bytearray`, `memoryview`, and `numpy` arrays.
String inputs are also supported and are automatically converted to `bytes`
using UTF-8 encoding before hashing.

```{seealso}
The buffer protocol,
[originally implemented as a part of Python/C API](https://docs.python.org/3/c-api/buffer.html),
was formally defined as a Python-level API in
Expand All @@ -37,6 +36,7 @@ type hint
which is itself an alias for
[typing_extensions.Buffer](https://typing-extensions.readthedocs.io/en/latest/#typing_extensions.Buffer),
the backported type hint for `collections.abc.Buffer`.
```

```{eval-rst}
.. autofunction:: mmh3.hash_from_buffer
Expand Down
File renamed without changes.
5 changes: 2 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ mmh3 is a Python extension for `MurmurHash (MurmurHash3) <https://en.wikipedia.o
:maxdepth: 2
:caption: User Guideline

Quickstart<readme_link>
Quickstart<quickstart>
api
Changelog<changelog_link>
Changelog<changelog>

.. toctree::
:maxdepth: 2
Expand All @@ -21,5 +21,4 @@ Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
File renamed without changes.
14 changes: 6 additions & 8 deletions src/mmh3/__init__.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,24 @@
from __future__ import annotations

import sys
from typing import Union, final
from typing import Any, Union, final

if sys.version_info >= (3, 12):
from collections.abc import Buffer
else:
from _typeshed import ReadableBuffer as Buffer

def hash(key: Union[bytes, str], seed: int = 0, signed: bool = True) -> int: ...
def hash(key: Union[bytes, str], seed: int = 0, signed: Any = True) -> int: ...
def hash_from_buffer(
key: Union[Buffer, str], seed: int = 0, signed: bool = True
key: Union[Buffer, str], seed: int = 0, signed: Any = True
) -> int: ...
def hash64(
key: Union[bytes, str], seed: int = 0, x64arch: bool = True, signed: bool = True
key: Union[bytes, str], seed: int = 0, x64arch: Any = True, signed: Any = True
) -> tuple[int, int]: ...
def hash128(
key: Union[bytes, str], seed: int = 0, x64arch: bool = True, signed: bool = False
key: Union[bytes, str], seed: int = 0, x64arch: Any = True, signed: Any = False
) -> int: ...
def hash_bytes(
key: Union[bytes, str], seed: int = 0, x64arch: bool = True
) -> bytes: ...
def hash_bytes(key: Union[bytes, str], seed: int = 0, x64arch: Any = True) -> bytes: ...
def mmh3_32_digest(key: Union[Buffer, str], seed: int = 0) -> bytes: ...
def mmh3_32_sintdigest(key: Union[Buffer, str], seed: int = 0) -> int: ...
def mmh3_32_uintdigest(key: Union[Buffer, str], seed: int = 0) -> int: ...
Expand Down
Loading

0 comments on commit 30da46e

Please sign in to comment.