Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-120754: Refactor I/O modules to stash whole stat result rather than individual members #123412

Merged
merged 7 commits into from
Sep 18, 2024

Conversation

cmaloney
Copy link
Contributor

As I was working on gh-120754 I noticed I kept adding more members and copying out individual members from the fstat call, and that it may be simpler / easier to just stash (and invalidate) the whole stat result rather than individul members. This is preparatory work for

  1. Avoid calling isatty on open for regular files (Resolving Avoid calling isatty() for most open() calls #90102)
  2. Reduce system calls by making more members available (Helping implement Speed up open().read() pattern by reducing the number of system calls #120754)

One important note, and why the member is called stat_atopen is that the values should only be used as guidance / an estimate. With individual members copied out this was also the case. While it's common for a file to not be modified while python code is reading it, other processes could interact with it and code needs to handle that. Two examples of this that I've come across: It is possible to change an fd so isatty result changes (see: gh-90102 and GH-121941) and a fd which is opened in blocking mode may have an ioctl used on it to change it to non-blocking (see: gh-109523). The general class of bugs here are commonly called time-of-check to time of use (TOCTOU, https://en.wikipedia.org/wiki/Time-of-check_to_time-of-use)

Given how common some specific patterns are (ex. Path().read_bytes()) it is still worthwhile to optimize those (Ex. disabling buffering results in a over 10% speedup in that case, GH-122111). The existing codepaths treated this correctly as far as I can tell.

This PR is a portion of GH-121593 which is being split up into smaller, hopefully easier to review chunks. Not calling isatty for regular files makes a small but measurable perf improvement for every "open and read whole regular file" python does.

Multiple places in the I/O stack optimize common cases by using the
information from stat. Currently individual members are extracted from
the stat and stored into the fileio struct. Refactor the code to store
the whole stat struct instead.
Parallels the changes to _io. The `stat` Python object doesn't allow
changing members, so rather than modifying estimated_size, just clear
the value.
@cmaloney
Copy link
Contributor Author

Could this get the no news tag? (This is changing / refactoring an implementation detail)

Lib/_pyio.py Outdated Show resolved Hide resolved
Modules/_io/fileio.c Outdated Show resolved Hide resolved
Lib/_pyio.py Show resolved Hide resolved
Modules/_io/fileio.c Show resolved Hide resolved
Modules/_io/fileio.c Show resolved Hide resolved
Lib/_pyio.py Outdated Show resolved Hide resolved
Lib/_pyio.py Outdated Show resolved Hide resolved
Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@gpshead @serhiy-storchaka @pitrou: Would you mind to have a look?

Lib/_pyio.py Show resolved Hide resolved
@vstinner vstinner merged commit 8b6c7c7 into python:main Sep 18, 2024
35 checks passed
@vstinner
Copy link
Member

Ok, I merged your change. Thanks for your contribution. Let's see how it goes :-)

@cmaloney cmaloney deleted the cmaloney/stat_atopen branch September 18, 2024 19:04
@cmaloney
Copy link
Contributor Author

Looking at individual buildbots, seeing some test_io refleaks failures (https://buildbot.python.org/#/builders/259/builds/1384, https://buildbot.python.org/#/builders/551/builds/78), digging in a bit.

@vstinner
Copy link
Member

Using test.bisect_cmd, I identified the leaking test:

$ ./python -m test test_io -R 3:3 -m test.test_io.CIOTest.test_fileio_closefd
(...)
test_io leaked [1, 1, 1] memory blocks, sum=3
(...)

@vstinner
Copy link
Member

Looking at individual buildbots, seeing some test_io refleaks failures (https://buildbot.python.org/#/builders/259/builds/1384, https://buildbot.python.org/#/builders/551/builds/78), digging in a bit.

I wrote a fix: PR gh-124225.

get_blksize(fileio *self, void *closure)
{
#ifdef HAVE_STRUCT_STAT_ST_BLKSIZE
if (self->stat_atopen != NULL && self->stat_atopen->st_blksize > 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder how realistic the st_blksize values, when available, are for performance purposes, I guess we'll find out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR should not change the buffer size, does it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#117151 (comment) investigated st_blksize a bit previously. This PR I tried not to change buffer size at all / just change how it is accessed.

Have with the refactors + optimizations been watching for new issues. Are finding some as people test main (ex. gh-113977 which I wrote a primary fix for #122101, and have more fix ideas on top of the stat_atopen changes)

savannahostrowski pushed a commit to savannahostrowski/cpython that referenced this pull request Sep 22, 2024
…er than individual members (python#123412)

Multiple places in the I/O stack optimize common cases by using the
information from stat. Currently individual members are extracted from
the stat and stored into the fileio struct. Refactor the code to store
the whole stat struct instead.

Parallels the changes to _io. The `stat` Python object doesn't allow
changing members, so rather than modifying estimated_size, just clear
the value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants