Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix quadratic time ByteBuffer operations #711

Merged
merged 3 commits into from
Aug 21, 2022

Conversation

Joshua-Landau-Anthropic
Copy link
Contributor

bytes is immutable, so repeated appends is quadratic-time.

bytearray is the mutable version of bytes.

@Joshua-Landau-Anthropic Joshua-Landau-Anthropic marked this pull request as ready for review July 29, 2022 10:50
@mpenkov
Copy link
Collaborator

mpenkov commented Jul 30, 2022

Cool! Are you able to run any benchmarks?

@Joshua-Landau-Anthropic
Copy link
Contributor Author

$ python benchmark/bytebuffer_bench.py
0.0012826919555664062
$ git stash; git checkout HEAD^; git stash pop
$ python benchmark/bytebuffer_bench.py
0.13568520545959473

Benchmark is

import time
from smart_open.bytebuffer import ByteBuffer

buffer = ByteBuffer()

start = time.time()
for _ in range(1000):
    assert buffer.fill([b"X" * 1000]) == 1000
print(time.time() - start)

In terms of production code, I found this is a significant speedup for large unpickles.

@piskvorky
Copy link
Owner

piskvorky commented Jul 31, 2022

A benchmark on some real smart_open use-case (closer to what users see / care about) would be great. I am myself curious of the difference.

@Joshua-Landau-Anthropic
Copy link
Contributor Author

$ python benchmark/bytebuffer_bench.py "$BENCHMARK_FILE"
Raw ByteBuffer benchmark: 0.012549877166748047
File read benchmark 1.083221435546875

$ git revert 8cc7654580b207dc7a684fab856ef1a708b106b1
[develop a5fc7a7] Revert "Fix quadratic time ByteBuffer operations"

$ python benchmark/bytebuffer_bench.py "$BENCHMARK_FILE"
Raw ByteBuffer benchmark: 21.479155778884888
File read benchmark 25.684982776641846

BENCHMARK_FILE is any large S3 file you have a fast connection to.

@mpenkov mpenkov merged commit 41467a9 into piskvorky:develop Aug 21, 2022
@mpenkov
Copy link
Collaborator

mpenkov commented Aug 21, 2022

Merged! Thank you.

@nelhage
Copy link

nelhage commented Sep 12, 2022

Would it be possible to get a release containing this fix so it's easier for us to upgrade? Thanks in advance!

@mpenkov
Copy link
Collaborator

mpenkov commented Sep 14, 2022

@nelhage
Copy link

nelhage commented Sep 14, 2022

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants