Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance: Update io.DEFAULT_BUFFER_SIZE to make python IO faster? #117151

Open
morotti opened this issue Mar 22, 2024 · 9 comments
Open

performance: Update io.DEFAULT_BUFFER_SIZE to make python IO faster? #117151

morotti opened this issue Mar 22, 2024 · 9 comments
Assignees
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir topic-IO

Comments

@morotti
Copy link
Contributor

morotti commented Mar 22, 2024

Bug report

Bug description:

Hello,

I was doing some benchmarking of python and package installation.
That got me down a rabbit hole of buffering optimizations between between pip, requests, urllib and the cpython interpreter.

TL;DR I would like to discuss updating the value of io.DEFAULT_BUFFER_SIZE. It was set to 8192 since 16 years ago.
original commit: https://github.com/python/cpython/blame/main/Lib/_pyio.py#L27

It was a reasonable size given hardware and OS at the time. It's far from optimal today.
Remember, in 2008 you'd run a 32 bits operating system with less than 2 GB memory available and to share between all running applications.
Buffers had to be small, few kB, it wasn't conceivable to have buffer measured in entire MB.

I will attach benchmarks in the next messages showing 3 to 5 times write performance improvement when adjusting the buffer size.

I think the python interpreter can adopt a buffer size somewhere between 64k to 256k by default.
I think 64k is the minimum for python and it should be safe to adjust to.
Higher is better for performance in most cases, though there may be some cases where it's unwanted
(seek and small read/writes, unwanted trigger of write ahead, slow devices with throughput in measured in kB/s where you don't want to block for long)

In addition, I think there is a bug in open() on Linux.
open() sets the buffer size to the device block size on Linux when available (st_blksize, 4k on most disks), instead of io.DEFAULT_BUFFER_SIZE=8k.
I believe this is unwanted behavior, the block size is the minimal size for IO operations on the IO device, it's not the optimal size and it should not be preferred.
I think open() on Linux should be corrected to use a default buffer size of max(st_blksize, io.DEFAULT_BUFFER_SIZE) instead of st_blksize?

Related, the doc might be misleading for saying st_blksize is the preferred size for efficient I/O. https://github.com/python/cpython/blob/main/Doc/library/os.rst#L3181
The GNU doc was updated to clarify: "This is not guaranteed to give optimum performance" https://www.gnu.org/software/gnulib/manual/html_node/stat_002dsize.html

Thoughts?

Annex: some historical context and technical considerations around buffering.

On the hardware side:

  • HDD had 512 bytes blocks historically, then HDD moved to 4096 bytes blocks in the 2010s.
  • SSD have 4096 bytes blocks as far as I know.

On filesystems:

  • buffer size should never be smaller than device and filesystem blocksize
  • I think ext3, ext4, xfs, ntfs, etc... follow the device block size of 4k as default, though they can be configured for any block size.
  • NTFS is capped to 16TB maximum disk size with 4k blocks.
  • microsoft recommends 64k block size for windows server 2019+ and larger disks https://learn.microsoft.com/en-us/windows-server/storage/file-server/ntfs-overview
  • RAID setups and assimilated with zfs/btrfs/xfs can have custom block size, I think anywhere 4kB-1MB. I don't know if there is any consensus, I think anything 16k-32k-64k-128k can be seen in the wild.

On network filesystems:

  • shared network home directories are common on linux (NFS share) and windows (SMB share).
  • entreprise storage vendors like Pure/Vast/NetApp recommend 524488 or 1048576 bytes for IO.
  • see rsize wsize in mount settings:
  • host:path on path type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,acregmin=60,acdirmin=60,hard,proto=tcp,nconnect=8,mountproto=tcp, ...)
  • for windows I cannot find documentation for network clients, though the windows server should have the NTFS filesystem with at least 64k block size as per microsoft recommendation above.

On pipes:

  • buffering is used by pipes and for interprocess communications. see subprocess.py
  • posix guarantees that writes to pipes are atomic up to PIPE_BUF, 4096 bytes on Linux kernel, guaranteed to be at least 512 bytes by posix.
  • Python had a default of io.DEFAULT_BUFFER_SIZE=8192 so it never benefitted from that atomic property :D

on compression code, they probably all need to be adjusted:

On network IO:

  • On Linux, TCP read and write buffers were a minimum of 16k historically. The read buffer was increased to 64k in kernel v4.20, year 2018
  • the buffer is resized dynamically with the TCP window upto 4MB write 6M read, let's not get into TCP. see sysctl_tcp_rmem sysctl_tcp_wmem
  • linux code: https://github.com/torvalds/linux/blame/master/net/ipv4/tcp.c#L4775
  • commit Sep 2018: torvalds/linux@a337531
  • I think socket buffers are managed separately by the kernel, the io.DEFAULT_BUFFER_SIZE matters when you read a file and write to network, or read from network and write to file.

on HTTP, a large subset of networking:

note to self: remember to publish code and result in next message

CPython versions tested on:

3.11

Operating systems tested on:

Other

Linked PRs

@morotti morotti added the type-bug An unexpected behavior, bug, or error label Mar 22, 2024
@morotti
Copy link
Contributor Author

morotti commented Mar 22, 2024

some benchmarking code I used to debug download and write performance.

import io
import os
import platform
import requests
import sys
import time


def download_file(run, url, filepath, chunksize, buffersize):
    if os.path.exists(filepath):
        os.remove(filepath)
    calls = 0
    start = time.perf_counter()
    write_duration = 0.0
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(filepath, 'wb', buffering=buffersize) as f:
            st_blksize =  os.stat(filepath).st_blksize
            for chunk in r.iter_content(chunk_size=chunksize):
                calls = calls + 1
                t1 = time.perf_counter()
                f.write(chunk)
                t2 = time.perf_counter()
                write_duration = write_duration + (t2 - t1)
    end = time.perf_counter()
    function_duration = end - start
    print(
        "run={} filepath={} total_duration={} download_chunksize={} write_duration={} write_buffersize={} calls={} st_blksize={}".format(
            run, filepath, function_duration, chunksize, write_duration, buffersize, calls, st_blksize
        ))


def main():
    print("python {} running on {}".format(sys.version, platform.platform()))
    NUMPY_WHEEL = "https://example.com/numpy/1.21.6/numpy-1.21.6-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl"
    for run in range(0, 10):
        for download_directory in [os.path.abspath(".")]:
            for url in [NUMPY_WHEEL]:
                download_path = os.path.join(download_directory, url.rsplit("/", 1)[1])
                for download_chunksize in [512, 1024, 2048, 4096, 8192,
                                           10000, 16384, 32768, 65536, 131072, 262144, 524488,
                                           1048576, 2097152, 4194304, 8388608, 16777216]:
                    for file_buffersize in [0, 4096, 8192, 65536, 262144, 1048576]:
                        download_file(run, url, download_path, download_chunksize, file_buffersize)


if __name__ == "__main__":
    main()

@morotti
Copy link
Contributor Author

morotti commented Mar 22, 2024

benchmark results, running on python 3.11

various OS and storage.

benchmark_linux_write_var

benchmark_linux_write_homenfs

benchmark_windows_write_temp

@hugovk hugovk added performance Performance or resource usage stdlib Python modules in the Lib dir labels Mar 22, 2024
@Eclips4 Eclips4 added topic-IO and removed type-bug An unexpected behavior, bug, or error labels Mar 22, 2024
@Fidget-Spinner
Copy link
Member

Fidget-Spinner commented Mar 22, 2024

I think your argument makes sense, consumer RAM sizes have more than quadrupled in the past 16 years IIRC, so it shouldn't hurt to increase buffer sizes.

I cannot champion this though, because I am currently wrapped up in too many things. Sorry.

@masklinn
Copy link
Contributor

masklinn commented Mar 22, 2024

SSD have 4096 bytes blocks as far as I know.

AFAIK SSDs have 4 to 8k pages. An SSD block contains up to 256 pages. The NVMe capabilities of the drive are also a factor, as an NVM command can generally transfer a multiple of the page size.

morotti pushed a commit to man-group/cpython that referenced this issue Mar 26, 2024
… are equal to the buffer size. avoid extra memory copy.

BufferedWriter() was buffering calls that are the exact same size as the buffer. it's a very common case to read/write in blocks of the exact buffer size.

it's pointless to copy a full buffer, it's costing extra memory copy and the full buffer will have to be written in the next call anyway.
morotti pushed a commit to man-group/cpython that referenced this issue Apr 2, 2024
…ER_SIZE to 128k, fix open() to use max(st_blksize, io.DEFAULT_BUFFER_SIZE)

performance:
@eendebakpt
Copy link
Contributor

I can confirm this improves performance. @morotti Could you open a PR?

morotti pushed a commit to man-group/cpython that referenced this issue Apr 22, 2024
…ER_SIZE to 128k, fix open() to use max(st_blksize, io.DEFAULT_BUFFER_SIZE)

performance:
morotti pushed a commit to man-group/cpython that referenced this issue Apr 22, 2024
…ER_SIZE to 128k, fix open() to use max(st_blksize, io.DEFAULT_BUFFER_SIZE)
serhiy-storchaka pushed a commit that referenced this issue Apr 23, 2024
…he buffer size (GH-118037)

BufferedWriter() was buffering calls that are the exact same size as the buffer. it's a very common case to read/write in blocks of the exact buffer size.

it's pointless to copy a full buffer, it's costing extra memory copy and the full buffer will have to be written in the next call anyway.

Co-authored-by: rmorotti <romain.morotti@man.com>
@morotti
Copy link
Contributor Author

morotti commented Apr 29, 2024

@eendebakpt I opened a PR, can you review?

#118144

@eendebakpt
Copy link
Contributor

@eendebakpt I opened a PR, can you review?

#118144

Yes, i'll have a look in a couple of days.

morotti pushed a commit to man-group/cpython that referenced this issue Apr 30, 2024
… to 256k.

it was set to 16k in the 1990s.
it was raised to 64k in 2019. the discussion at the time mentioned another 5% improvement by raising to 128k and settled for a very conservative setting.

it's 2024 now, I think it should be revisited to match modern hardware. I am measuring 0-15% performance improvement when raising to 256k on various types of disk. there is no downside as far as I can tell.

this function is only intended for sequential copy of full files (or file like objects). it's the typical use case that benefits from larger operations.

for reference, I came across this function while trying to profile pip that is using it to copy files when installing python packages.
morotti pushed a commit to man-group/cpython that referenced this issue Apr 30, 2024
…ER_SIZE to 128k, fix open() to use max(st_blksize, io.DEFAULT_BUFFER_SIZE)
@serhiy-storchaka
Copy link
Member

In what cases st_blksize is larger than 128 KiB?

@morotti
Copy link
Contributor Author

morotti commented Apr 30, 2024

st_blksize is the block size reported by the device.

I see it larger than 128kB on NFS network filesystems, like in the benchmark I submitted above.
The value matches the rsize set in the NFS mount settings.
It is set to 524488 or 1048576 for the two enterprise storage vendors I have hardware from, as per their recommended settings, which are optimal settings for their respective hardware. (apologies, I'm not sure I have permissions to name brands and benchmarks ^^).

It can be seen on any filesystem where a larger block was set. It's a free setting when the filesystem is created. I think most filesystems XFS/ZFS/EXT4 allow to set any block size from 4k to 1M or so. I think more than 128k can be seen for some RAID setups with enough large disks.

Microsoft recommends 64k block size for windows server 2019+, 4k block size is limited to 16 TB volumes, 64k block size is limited to 256 TB volume. The block size can be set up to 2M.
It should be possible to see it on Linux, if mounting a volume remotely and the mount can be configured to expose the block size from the server or set to the same size.
https://learn.microsoft.com/en-us/windows-server/storage/file-server/ntfs-overview

I think it should be visible as well for s3 filesystems, but I don't have one to test anymore.
There is a thing to mount s3 buckets directly as a filesystem on Linux. It should hint huge blocks because the HTTP overhead is huge.

Basically, anything involving large disks, storage appliances, network and specialized filesystems.

morotti pushed a commit to man-group/cpython that referenced this issue May 30, 2024
…ER_SIZE to 128k, fix open() to use max(st_blksize, io.DEFAULT_BUFFER_SIZE)
morotti pushed a commit to man-group/cpython that referenced this issue May 30, 2024
… to 256k.

it was set to 16k in the 1990s.
it was raised to 64k in 2019. the discussion at the time mentioned another 5% improvement by raising to 128k and settled for a very conservative setting.

it's 2024 now, I think it should be revisited to match modern hardware. I am measuring 0-15% performance improvement when raising to 256k on various types of disk. there is no downside as far as I can tell.

this function is only intended for sequential copy of full files (or file like objects). it's the typical use case that benefits from larger operations.

for reference, I came across this function while trying to profile pip that is using it to copy files when installing python packages.
morotti pushed a commit to man-group/cpython that referenced this issue Jun 14, 2024
…ER_SIZE to 128k, fix open() to use max(st_blksize, io.DEFAULT_BUFFER_SIZE)
morotti pushed a commit to man-group/cpython that referenced this issue Jun 17, 2024
…ER_SIZE to 128k, fix open() to use max(st_blksize, io.DEFAULT_BUFFER_SIZE)
morotti pushed a commit to man-group/cpython that referenced this issue Jun 17, 2024
…ER_SIZE to 128k, adjust open() to use max(st_blksize, io.DEFAULT_BUFFER_SIZE)
morotti pushed a commit to man-group/cpython that referenced this issue Jun 17, 2024
…ER_SIZE to 128k, adjust open() to use max(st_blksize, io.DEFAULT_BUFFER_SIZE)
mxmlnkn added a commit to mxmlnkn/ratarmount that referenced this issue Sep 14, 2024
…at implementations

This is especially important for Lustre after disabling the buffering
for calls through FUSE in the earlier commit. Without this, we would now
have 8K reads to Lustre that has huge latencies and should function more
optimally with 4 MiB reads.

See also the proposal to increase the default buffer size of 8K:
python/cpython#117151
@gpshead gpshead self-assigned this Oct 3, 2024
@gpshead gpshead changed the title performance: can we update io.DEFAULT_BUFFER_SIZE to make python IO 3 times faster? :) performance: Update io.DEFAULT_BUFFER_SIZE to make python IO faster? Oct 3, 2024
morotti pushed a commit to man-group/cpython that referenced this issue Oct 3, 2024
…ER_SIZE to 128k, adjust open() to use max(st_blksize, io.DEFAULT_BUFFER_SIZE)
gpshead pushed a commit that referenced this issue Oct 4, 2024
…6k. (GH-119783)

* gh-117151: increase default buffer size of shutil.copyfileobj() to 256k.

it was set to 16k in the 1990s.
it was raised to 64k in 2019. the discussion at the time mentioned another 5% improvement by raising to 128k and settled for a very conservative setting.

it's 2024 now, I think it should be revisited to match modern hardware. I am measuring 0-15% performance improvement when raising to 256k on various types of disk. there is no downside as far as I can tell.

this function is only intended for sequential copy of full files (or file like objects). it's the typical use case that benefits from larger operations.

for reference, I came across this function while trying to profile pip that is using it to copy files when installing python packages.

* add news

---------

Co-authored-by: rmorotti <romain.morotti@man.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir topic-IO
Projects
None yet
Development

No branches or pull requests

8 participants