Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-116380: Speed up glob.glob() by removing some system calls #116392

Merged
merged 101 commits into from
Feb 28, 2025

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Mar 5, 2024

Speed up glob.glob() and glob.iglob() by reducing the number of system calls made.

This unifies the implementations of globbing in the glob and pathlib modules.

Depends on

Filtered recursive walk

Expanding a recursive ** segment entails walking the entire directory tree, and so any subsequent pattern segments (except special segments) can be evaluated by filtering the expanded paths through a regex. For example, glob.glob("foo/**/*.py", recursive=True) recursively walks foo/ with os.scandir(), and then filters paths through a regex based on "**/*.py, with no further filesystem access needed.

This solves #104269 as a side-effect.

Tracking path existence

We store a flag alongside each path indicating whether the path is guaranteed to exist. As we process the pattern:

  • Certain special pattern segments ("", "." and "..") leave the flag unchanged
  • Literal pattern segments (e.g. foo/bar) set the flag to false
  • Wildcard pattern segments (e.g. */*.py) set the flag to true (because children are found via os.scandir())
  • Recursive pattern segments (e.g. **) leave the flag unchanged for the root path, and set it to true for descendants discovered via os.scandir().

If the flag is false at the end, we call lstat() on each path to filter out missing paths.

Minor speed-ups

We:

  • Exclude paths that don't match a non-terminal non-recursive wildcard pattern prior to calling is_dir().
  • Use a stack rather than recursion to implement recursive wildcards.
  • Pre-compile regular expressions and pre-join literal pattern segments.
  • Convert to/from bytes (a minor use-case) in iglob() rather than supporting bytes throughout. This particularly simplifies the code needed to handle relative bytes paths with dir_fd.
  • Avoid calling os.path.join(); instead we keep paths in a normalized form and append trailing slashes when needed.
  • Avoid calling os.path.normcase(); instead we use case-insensitive regex matching.

Implementation notes

Much of this functionality is already present in pathlib's implementation of globbing. The specific additions we make are:

  1. Support for dir_fd
  2. Support for include_hidden
  3. Support for generating paths relative to root_dir

Results

Speedups via python -m timeit -s "from glob import glob" "glob(pattern, recursive=True, include_hidden=True)" from CPython source directory on Linux:

pattern speedup
Lib/* 1.87x
Lib/*/ 1.85x
Lib/*.py 1.3x
Lib/** 5.62x
Lib/**/ 1.23x
Lib/**/* 1.92x
Lib/**/** 17x
Lib/**/*/ 2.15x
Lib/**/*.py 1.79x
Lib/**/__init__.py 1.03x
Lib/**/*/*.py 2.41x
Lib/**/*/__init__.py 1.76x

@barneygale
Copy link
Contributor Author

barneygale commented Mar 5, 2024

Needs a fix for #116377 to land.

Copy link
Member

@gpshead gpshead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@serhiy-storchaka serhiy-storchaka self-requested a review March 6, 2024 17:12
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not hurry to merge. This is an old code. The main advantage of the initial code was its simplicity, but since then it was complicated by adding new features and optimizations. In particularly the use of os.scandir() instead of os.listdir() significantly improved performance. The new implementation should be benchmarked with different test cases: deep and wide threes, files and directories domination.

@barneygale
Copy link
Contributor Author

barneygale commented Mar 7, 2024

Thanks Serhiy! We use os.scandir() if either:

  1. We're expanding a recursive wildcard (we need to distinguish directories in order to recurse)
  2. We're expanding a non-final non-recursive wildcard (we need to select only directories)

If neither of these are true, then we don't need to stat() the children, and so os.listdir() is actually a little faster I think. But I will test this on a few machines to be sure!

edit: to further illustrate what I mean, here's where os.listdir() is used:

non-recursive part recursive part
non-terminal part os.scandir() os.scandir()
terminal part os.listdir() <-- os.scandir()

@barneygale
Copy link
Contributor Author

The new implementation should be benchmarked with different test cases: deep and wide threes, files and directories domination.

I've been looking into this! The randomfiletree project is helpful - it can repeatedly walk a tree and create child files/folders according to a gaussian distribution, which seems to me like a good approximation for an average "shallow and wide" filesystem structure, including tweaking for file or folder distribution.

It's difficult to produce "deep and narrow" trees this way, as the file/folder probability would need to change with the depth (I think?). I've been considering writing a tree generator that works this way, e.g.:

  • At depth==0, generate 100 subdirectories
  • At 0 < depth < 50, generate 1 subdirectory
  • At depth==50, generate 100 files

... but is that overly arbitrary? Is there a better way? Or do I just need to come up with a bunch of test cases along those lines?

@barneygale
Copy link
Contributor Author

A test of 100 nested directories named "deep" from my Linux machine:

pattern speedup
deep/** 3.86x
deep/**/ 4.03x
deep/**/* 4.92x
deep/**/*/ 4.93x

@barneygale
Copy link
Contributor Author

barneygale commented Feb 8, 2025

@serhiy-storchaka FYI, this PR changes some of the test expectations you recently added in 8b5c850. Specifically, redundant slashes in the pattern are always preserved, and pattern separators are always normalized. Hopefully that's a reasonable price to pay for the speedups, the support for deep hierarchies, and the elimination of duplicate results

@barneygale
Copy link
Contributor Author

I'll merge this in about a week if there's no further feedback. Thanks all.

@barneygale barneygale enabled auto-merge (squash) February 28, 2025 19:42
@barneygale barneygale merged commit da4899b into python:main Feb 28, 2025
39 checks passed
@eendebakpt
Copy link
Contributor

Great!

@barneygale Thanks for all the work you put into this PR.

@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot aarch64 Fedora Stable Clang Installed 3.x has failed when building commit da4899b.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/14/builds/7291) and take a look at the build logs.
  4. Check if the failure is related to this commit (da4899b) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/14/builds/7291

Failed tests:

  • test_glob

Failed subtests:

  • test_iglob_iter_close - test.test_glob.GlobTests.test_iglob_iter_close

Summary of the results of the build (if available):

==

Click to see traceback logs
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-aarch64.clang-installed/build/target/lib/python3.14/test/test_glob.py", line 424, in test_iglob_iter_close
    self.assertEqual(next(iter), 'deep/d')
                     ~~~~^^^^^^
StopIteration

@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot AMD64 Fedora Stable Clang Installed 3.x has failed when building commit da4899b.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/350/builds/7433) and take a look at the build logs.
  4. Check if the failure is related to this commit (da4899b) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/350/builds/7433

Failed tests:

  • test_glob

Failed subtests:

  • test_iglob_iter_close - test.test_glob.GlobTests.test_iglob_iter_close

Summary of the results of the build (if available):

==

Click to see traceback logs
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-x86_64.clang-installed/build/target/lib/python3.14/test/test_glob.py", line 424, in test_iglob_iter_close
    self.assertEqual(next(iter), 'deep/d')
                     ~~~~^^^^^^
StopIteration

barneygale added a commit to barneygale/cpython that referenced this pull request Mar 1, 2025
…stem calls. (python#116392)"

This broke tests on the 'aarch64 Fedora Stable Clang Installed 3.x' and
'AMD64 Fedora Stable Clang Installed 3.x' build bots.

This reverts commit da4899b.
barneygale added a commit that referenced this pull request Mar 1, 2025
…alls. (#116392)" (#130743)

This broke tests on the 'aarch64 Fedora Stable Clang Installed 3.x' and
'AMD64 Fedora Stable Clang Installed 3.x' build bots.

This reverts commit da4899b.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants