Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-115060: Speed up pathlib.Path.glob() by skipping directory scanning #116152

Closed
wants to merge 6 commits into from

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Feb 29, 2024

For ordinary literal pattern segments (e.g. foo/bar in foo/bar/../**), skip calling scandir() on each segment, and instead call exists() or is_dir() as necessary to exclude missing paths. This only applies when case_sensitive is None (the default); otherwise we can't guarantee case sensitivity or realness with this approach. If follow_symlinks is False we also need to exclude symlinks from intermediate segments.

This restores an optimization that was removed in da1980a by some eejit. It's actually even faster because we don't stat() intermediate directories, and in some cases we can skip all filesystem access when expanding a literal part (e.g. when it's followed by a non-recursive wildcard segment).

… scanning.

For ordinary literal pattern segments (e.g. `foo/bar` in `foo/bar/../**`),
skip calling `_scandir()` on each segment, and instead call `exists()` or
`is_dir()` as necessary to exclude missing paths. This only applies when
*case_sensitive* is `None` (the default); otherwise we can't guarantee case
sensitivity or realness with this approach. If *follow_symlinks* is `False`
we also need to exclude symlinks from intermediate segments.

This restores an optimization that was removed in da1980a by some eejit.
It's actually even faster because we don't `stat()` intermediate
directories, and in some cases we can skip all filesystem access when
expanding a literal part (e.g. when it's followed by a non-recursive
wildcard segment).
@barneygale
Copy link
Contributor Author

barneygale commented Feb 29, 2024

Quite a lot faster:

$ ./python -m timeit -s "from pathlib import Path" "list(Path().glob('Lib/pathlib/__init__.py'))"
 2000 loops, best of 5: 195 usec per loop  # before
20000 loops, best of 5:  17 usec per loop  # after
$ ./python -m timeit -s "from pathlib import Path" "list(Path().glob('Lib/pathlib/*'))"
 2000 loops, best of 5: 197   usec per loop  # before
10000 loops, best of 5:  28.8 usec per loop  # after
$ ./python -m timeit -s "from pathlib import Path" "list(Path().glob('Lib/*/__init__.py'))"
 200 loops, best of 5:   1.24 msec per loop  # before
1000 loops, best of 5: 307    usec per loop  # after
$ ./python -m timeit -s "from pathlib import Path" "list(Path().glob('*/pathlib/__init__.py'))"
 200 loops, best of 5:   1.22 msec per loop  # before
1000 loops, best of 5: 261    usec per loop  # after

@barneygale barneygale marked this pull request as draft March 4, 2024 18:19
@barneygale barneygale closed this Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-pathlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant