gh-120754: Refactor I/O modules to stash whole stat result rather than individual members #123412

cmaloney · 2024-08-28T00:24:19Z

As I was working on gh-120754 I noticed I kept adding more members and copying out individual members from the fstat call, and that it may be simpler / easier to just stash (and invalidate) the whole stat result rather than individul members. This is preparatory work for

Avoid calling isatty on open for regular files (Resolving Avoid calling isatty() for most open() calls #90102)
Reduce system calls by making more members available (Helping implement Speed up open().read() pattern by reducing the number of system calls #120754)

One important note, and why the member is called stat_atopen is that the values should only be used as guidance / an estimate. With individual members copied out this was also the case. While it's common for a file to not be modified while python code is reading it, other processes could interact with it and code needs to handle that. Two examples of this that I've come across: It is possible to change an fd so isatty result changes (see: gh-90102 and GH-121941) and a fd which is opened in blocking mode may have an ioctl used on it to change it to non-blocking (see: gh-109523). The general class of bugs here are commonly called time-of-check to time of use (TOCTOU, https://en.wikipedia.org/wiki/Time-of-check_to_time-of-use)

Given how common some specific patterns are (ex. Path().read_bytes()) it is still worthwhile to optimize those (Ex. disabling buffering results in a over 10% speedup in that case, GH-122111). The existing codepaths treated this correctly as far as I can tell.

This PR is a portion of GH-121593 which is being split up into smaller, hopefully easier to review chunks. Not calling isatty for regular files makes a small but measurable perf improvement for every "open and read whole regular file" python does.

Multiple places in the I/O stack optimize common cases by using the information from stat. Currently individual members are extracted from the stat and stored into the fileio struct. Refactor the code to store the whole stat struct instead.

Parallels the changes to _io. The `stat` Python object doesn't allow changing members, so rather than modifying estimated_size, just clear the value.

cmaloney · 2024-08-28T02:03:27Z

Could this get the no news tag? (This is changing / refactoring an implementation detail)

Lib/_pyio.py

Modules/_io/fileio.c

Lib/_pyio.py

Modules/_io/fileio.c

Lib/_pyio.py

Co-authored-by: Victor Stinner <vstinner@python.org>

vstinner

LGTM.

@gpshead @serhiy-storchaka @pitrou: Would you mind to have a look?

Lib/_pyio.py

vstinner · 2024-09-18T15:48:21Z

Ok, I merged your change. Thanks for your contribution. Let's see how it goes :-)

cmaloney · 2024-09-18T20:37:52Z

Looking at individual buildbots, seeing some test_io refleaks failures (https://buildbot.python.org/#/builders/259/builds/1384, https://buildbot.python.org/#/builders/551/builds/78), digging in a bit.

vstinner · 2024-09-18T21:25:20Z

Using test.bisect_cmd, I identified the leaking test:

$ ./python -m test test_io -R 3:3 -m test.test_io.CIOTest.test_fileio_closefd
(...)
test_io leaked [1, 1, 1] memory blocks, sum=3
(...)

vstinner · 2024-09-18T21:28:52Z

Looking at individual buildbots, seeing some test_io refleaks failures (https://buildbot.python.org/#/builders/259/builds/1384, https://buildbot.python.org/#/builders/551/builds/78), digging in a bit.

I wrote a fix: PR gh-124225.

gpshead · 2024-09-18T22:05:31Z

Modules/_io/fileio.c

+get_blksize(fileio *self, void *closure)
+{
+#ifdef HAVE_STRUCT_STAT_ST_BLKSIZE
+    if (self->stat_atopen != NULL && self->stat_atopen->st_blksize > 1) {


I do wonder how realistic the st_blksize values, when available, are for performance purposes, I guess we'll find out.

This PR should not change the buffer size, does it?

#117151 (comment) investigated st_blksize a bit previously. This PR I tried not to change buffer size at all / just change how it is accessed.

Have with the refactors + optimizations been watching for new issues. Are finding some as people test main (ex. gh-113977 which I wrote a primary fix for #122101, and have more fix ideas on top of the stat_atopen changes)

…er than individual members (python#123412) Multiple places in the I/O stack optimize common cases by using the information from stat. Currently individual members are extracted from the stat and stored into the fileio struct. Refactor the code to store the whole stat struct instead. Parallels the changes to _io. The `stat` Python object doesn't allow changing members, so rather than modifying estimated_size, just clear the value.

cmaloney added 2 commits August 27, 2024 16:53

pythongh-120754: Refactor _pyio to stash whole stat

9d849ce

Parallels the changes to _io. The `stat` Python object doesn't allow changing members, so rather than modifying estimated_size, just clear the value.

bedevere-app bot mentioned this pull request Aug 28, 2024

Speed up open().read() pattern by reducing the number of system calls #120754

Closed

bedevere-app bot added the awaiting review label Aug 28, 2024

cmaloney mentioned this pull request Aug 28, 2024

GH-120754: Remove isatty call during regular open #121593

Closed

picnixz added the skip news label Aug 28, 2024

vstinner reviewed Aug 28, 2024

View reviewed changes

cmaloney and others added 4 commits August 28, 2024 17:20

Apply suggestions from code review

bfcfcf2

Co-authored-by: Victor Stinner <vstinner@python.org>

Add comments around stat_atopen and why bufsize is + 1

3122665

Apply review changes for _pyio _blkszie

8f5cfe4

Fix comment formatting

d18a82d

vstinner approved these changes Aug 29, 2024

View reviewed changes

Lib/_pyio.py Show resolved Hide resolved

bedevere-app bot added awaiting merge and removed awaiting review labels Aug 29, 2024

Add _pyiio +1 comment to fileio for better clarity

c55d10e

vstinner merged commit 8b6c7c7 into python:main Sep 18, 2024
35 checks passed

bedevere-app bot removed the awaiting merge label Sep 18, 2024

cmaloney deleted the cmaloney/stat_atopen branch September 18, 2024 19:04

zware mentioned this pull request Sep 18, 2024

Fix make htmllive target #124219

Merged

gpshead reviewed Sep 18, 2024

View reviewed changes

This was referenced Oct 3, 2024

gh-90102: Remove isatty call during regular open #124922

Merged

gh-117151: IO performance improvement, increase io.DEFAULT_BUFFER_SIZE to 128k #118144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-120754: Refactor I/O modules to stash whole stat result rather than individual members #123412

gh-120754: Refactor I/O modules to stash whole stat result rather than individual members #123412

Uh oh!

cmaloney commented Aug 28, 2024

Uh oh!

cmaloney commented Aug 28, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vstinner left a comment

Uh oh!

Uh oh!

Uh oh!

vstinner commented Sep 18, 2024

Uh oh!

cmaloney commented Sep 18, 2024

Uh oh!

vstinner commented Sep 18, 2024

Uh oh!

vstinner commented Sep 18, 2024

Uh oh!

gpshead Sep 18, 2024

Uh oh!

vstinner Sep 18, 2024

Uh oh!

cmaloney Sep 18, 2024

Uh oh!

Uh oh!

Uh oh!

gh-120754: Refactor I/O modules to stash whole stat result rather than individual members #123412

gh-120754: Refactor I/O modules to stash whole stat result rather than individual members #123412

Uh oh!

Conversation

cmaloney commented Aug 28, 2024

Uh oh!

cmaloney commented Aug 28, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vstinner commented Sep 18, 2024

Uh oh!

cmaloney commented Sep 18, 2024

Uh oh!

vstinner commented Sep 18, 2024

Uh oh!

vstinner commented Sep 18, 2024

Uh oh!

gpshead Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

vstinner Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

cmaloney Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!