Skip to content

gh-117151: optimize algorithm to grow the buffer size for readall() on files #131052

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

morotti
Copy link
Contributor

@morotti morotti commented Mar 10, 2025

continuing my PRs to optimize buffers.

file readall() sets the buffer to the filesize.

  • in the happy path, the filesize matches the file and it can be read in one call (or multiple calls if large file), without any buffer extension.
  • in the not happy path, the filesize could have changed or be unavailable. new_buffersize is used to grow the buffer gradually. we need to optimize that later case.

that new_buffersize function was written 16 years ago and had little optimization since.
it's reading the file with a 8kB buffer to start with and increasing in steps of around 8kB, it's doing a lot of small inefficient writes.
it's increasing in steps of 12.5% after 65kB, which is still minuscule.

I spent some days looking into this function (as part of the attached ticket looking into optimizing buffers), the attached PR is what I could come up with to optimize the function.

considerations and gotcha:

  • I am seeing buffers above some size (256kB or 512kB) have some microseconds of overhead. probably due to the OS allocating them differently with large page and zeroing them (there was another ticket raised about that). so the code is starting with a 128kB buffer to avoid that overhead. This may vary with the operating system.
  • for small sizes, multiply the size by 4 on each step. we want to get into the MB range to do larger I/O.
  • for larger sizes, increase the size by 12.5%, same as the existing code. this is required to avoid running out of memory when reading very large file. (for example try to read a 25GB file on a system with 32GB of memory, it will error out of memory if the buffer grows by 50% or even 25%). there is another old ticket about that from years
  • the buffer is reallocated in place without copying the memory. it's actually not expensive to realloc() many times since there is no copy. This may vary with the operating system.

is it worth explaining all of that in comments?
the existing code is sparse in explanation and I had to go through a fair amount of tickets and debugging to understand the history.

See code below to benchmark with different file sizes on your machine.
This PR is a draft to discuss the fix before I spend more time on it. and to get a CI build passing.

import os
import time

os.system("touch file.txt")
os.system("truncate --size 0 file.txt")
f = open("file.txt", "rb")
os.system("dd if=/dev/urandom bs=1k count=1 >> file.txt")
os.system("dd if=/dev/urandom bs=50k count=1 >> file.txt")
os.system("dd if=/dev/urandom bs=150k count=1 >> file.txt")
os.system("dd if=/dev/urandom bs=1M count=1 >> file.txt")
#os.system("dd if=/dev/urandom bs=1k count=1 >> file.txt")
#os.system("dd if=/dev/urandom bs=2M count=1 >> file.txt")
#os.system("dd if=/dev/urandom bs=2M count=1 >> file.txt")
os.system("dd if=/dev/urandom bs=1k count=1 >> file.txt")
time.sleep(0.5)
start = time.perf_counter()
data = f.read()
end = time.perf_counter()
elapsed = end - start
f.close()
finalsize = os.path.getsize("file.txt")
print("read took {:3.09f} ms for {} bytes".format(elapsed * 1000.0, finalsize))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants