gh-117151: optimize algorithm to grow the buffer size for readall() on files #131052

morotti · 2025-03-10T19:18:57Z

continuing my PRs to optimize buffers.

file readall() sets the buffer to the filesize.

in the happy path, the filesize matches the file and it can be read in one call (or multiple calls if large file), without any buffer extension.
in the not happy path, the filesize could have changed or be unavailable. new_buffersize is used to grow the buffer gradually. we need to optimize that later case.

that new_buffersize function was written 16 years ago and had little optimization since.
it's reading the file with a 8kB buffer to start with and increasing in steps of around 8kB, it's doing a lot of small inefficient writes.
it's increasing in steps of 12.5% after 65kB, which is still minuscule.

I spent some days looking into this function (as part of the attached ticket looking into optimizing buffers), the attached PR is what I could come up with to optimize the function.

considerations and gotcha:

I am seeing buffers above some size (256kB or 512kB) have some microseconds of overhead. probably due to the OS allocating them differently with large page and zeroing them (there was another ticket raised about that). so the code is starting with a 128kB buffer to avoid that overhead. This may vary with the operating system.
for small sizes, multiply the size by 4 on each step. we want to get into the MB range to do larger I/O.
for larger sizes, increase the size by 12.5%, same as the existing code. this is required to avoid running out of memory when reading very large file. (for example try to read a 25GB file on a system with 32GB of memory, it will error out of memory if the buffer grows by 50% or even 25%). there is another old ticket about that from years
the buffer is reallocated in place without copying the memory. it's actually not expensive to realloc() many times since there is no copy. This may vary with the operating system.

is it worth explaining all of that in comments?
the existing code is sparse in explanation and I had to go through a fair amount of tickets and debugging to understand the history.

See code below to benchmark with different file sizes on your machine.
This PR is a draft to discuss the fix before I spend more time on it. and to get a CI build passing.

import os
import time

os.system("touch file.txt")
os.system("truncate --size 0 file.txt")
f = open("file.txt", "rb")
os.system("dd if=/dev/urandom bs=1k count=1 >> file.txt")
os.system("dd if=/dev/urandom bs=50k count=1 >> file.txt")
os.system("dd if=/dev/urandom bs=150k count=1 >> file.txt")
os.system("dd if=/dev/urandom bs=1M count=1 >> file.txt")
#os.system("dd if=/dev/urandom bs=1k count=1 >> file.txt")
#os.system("dd if=/dev/urandom bs=2M count=1 >> file.txt")
#os.system("dd if=/dev/urandom bs=2M count=1 >> file.txt")
os.system("dd if=/dev/urandom bs=1k count=1 >> file.txt")
time.sleep(0.5)
start = time.perf_counter()
data = f.read()
end = time.perf_counter()
elapsed = end - start
f.close()
finalsize = os.path.getsize("file.txt")
print("read took {:3.09f} ms for {} bytes".format(elapsed * 1000.0, finalsize))

Issue: performance: Update io.DEFAULT_BUFFER_SIZE to make python IO faster? #117151

…ll() on files

pythongh-117151: optimize algorithm to grow the buffer size for reada…

8ee67d4

…ll() on files

bedevere-app bot added the awaiting review label Mar 10, 2025

bedevere-app bot mentioned this pull request Mar 10, 2025

performance: Update io.DEFAULT_BUFFER_SIZE to make python IO faster? #117151

Closed

pythongh-117151: add news item

25d3cb7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-117151: optimize algorithm to grow the buffer size for readall() on files #131052

gh-117151: optimize algorithm to grow the buffer size for readall() on files #131052

Uh oh!

morotti commented Mar 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

gh-117151: optimize algorithm to grow the buffer size for readall() on files #131052

Are you sure you want to change the base?

gh-117151: optimize algorithm to grow the buffer size for readall() on files #131052

Uh oh!

Conversation

morotti commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

morotti commented Mar 10, 2025 •

edited

Loading