Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up glob.glob() by reducing number of system calls made #116380

Open
barneygale opened this issue Mar 5, 2024 · 1 comment
Open

Speed up glob.glob() by reducing number of system calls made #116380

barneygale opened this issue Mar 5, 2024 · 1 comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@barneygale barneygale added the performance Performance or resource usage label Mar 5, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Mar 5, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Mar 6, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Mar 17, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Mar 18, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Mar 22, 2024
@barneygale barneygale changed the title Speed up glob.glob() Speed up glob.glob() by reducing number of system calls made Mar 22, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Mar 27, 2024
These undocumented functions expand a single literal or (non-recursive)
wildcard segment.
barneygale added a commit to barneygale/cpython that referenced this issue Apr 1, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Apr 5, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Apr 6, 2024
Move pathlib globbing implementation to a new module and class:
`pathlib._glob.Globber`. This class implements fast string-based globbing.
It's called by `pathlib.Path.glob()`, which then converts strings back to
path objects.

In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that
works with `PathBase` objects rather than strings, and calls user-defined
path methods like `PathBase.stat()` rather than `os.stat()`.

This sets the stage for two more improvements:

- pythonGH-115060: Query non-wildcard segments with `lstat()`
- pythonGH-116380: Move `pathlib._glob` to `glob` (unify implementations).
barneygale added a commit that referenced this issue Apr 10, 2024
…17589)

Move pathlib globbing implementation into a new private class: `glob._Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects.

In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`.

This sets the stage for two more improvements:

- GH-115060: Query non-wildcard segments with `lstat()`
- GH-116380: Unify `pathlib` and `glob` implementations of globbing.

No change to the implementations of `glob.glob()` and `glob.iglob()`.
barneygale added a commit to barneygale/cpython that referenced this issue Apr 10, 2024
Add support for globbing relative to a file descriptor in our internal
`glob._Globber` class. This is necessary for unifying the `glob` and
`pathlib` globbing implementations.

The new functionality isn't exposed in `pathlib.Path.glob()`. It easily
could be.
barneygale added a commit to barneygale/cpython that referenced this issue Apr 10, 2024
Add support for excluding hidden files (with names beginning `.`) in our
internal `glob._Globber` class. This is necessary for unifying the `glob`
and `pathlib` globbing implementations.
barneygale added a commit to barneygale/cpython that referenced this issue Apr 10, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Apr 12, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Apr 12, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Apr 12, 2024
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
…gs (python#117589)

Move pathlib globbing implementation into a new private class: `glob._Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects.

In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`.

This sets the stage for two more improvements:

- pythonGH-115060: Query non-wildcard segments with `lstat()`
- pythonGH-116380: Unify `pathlib` and `glob` implementations of globbing.

No change to the implementations of `glob.glob()` and `glob.iglob()`.
barneygale added a commit to barneygale/cpython that referenced this issue May 3, 2024
…glob`

Moving this code under the `pathlib` package makes it quite a lot easier
to backport in the `pathlib-abc` PyPI package. It was a bit foolish of me
to add it to `glob` in the first place.

Also add `translate()` to `__all__` in `glob`. This function is new in
3.13, so there's no NEWS needed.
barneygale added a commit that referenced this issue May 3, 2024
…118562)

Moving this code under the `pathlib` package makes it quite a lot easier
to backport in the `pathlib-abc` PyPI package. It was a bit foolish of me
to add it to `glob` in the first place.

Also add `translate()` to `__all__` in `glob`. This function is new in
3.13, so there's no NEWS needed.
barneygale added a commit to barneygale/cpython that referenced this issue Jun 7, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Jun 7, 2024
noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
…abc`. (python#120011)

In `glob._Globber`, move pathlib-specific methods to `pathlib._abc.PathGlobber` and replace them with abstract methods. Rename `glob._Globber` to `glob._GlobberBase`. As a result, the `glob` module is no longer befouled by code that can only ever apply to pathlib.

No change of behaviour.
estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024
…abc`. (python#120011)

In `glob._Globber`, move pathlib-specific methods to `pathlib._abc.PathGlobber` and replace them with abstract methods. Rename `glob._Globber` to `glob._GlobberBase`. As a result, the `glob` module is no longer befouled by code that can only ever apply to pathlib.

No change of behaviour.
barneygale added a commit to barneygale/cpython that referenced this issue Aug 26, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Oct 27, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Nov 1, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Nov 8, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Nov 27, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Jan 13, 2025
barneygale added a commit to barneygale/cpython that referenced this issue Jan 16, 2025
barneygale added a commit to barneygale/cpython that referenced this issue Feb 4, 2025
barneygale added a commit to barneygale/cpython that referenced this issue Feb 8, 2025
barneygale added a commit to barneygale/cpython that referenced this issue Feb 8, 2025
barneygale added a commit to barneygale/cpython that referenced this issue Feb 8, 2025
barneygale added a commit to barneygale/cpython that referenced this issue Feb 14, 2025
barneygale added a commit to barneygale/cpython that referenced this issue Feb 18, 2025
barneygale added a commit to barneygale/cpython that referenced this issue Feb 21, 2025
barneygale added a commit to barneygale/cpython that referenced this issue Feb 28, 2025
barneygale added a commit that referenced this issue Feb 28, 2025
…116392)

## Filtered recursive walk

Expanding a recursive `**` segment entails walking the entire directory
tree, and so any subsequent pattern segments (except special segments) can
be evaluated by filtering the expanded paths through a regex. For example,
`glob.glob("foo/**/*.py", recursive=True)` recursively walks `foo/` with
`os.scandir()`, and then filters paths through a regex based on "`**/*.py`,
with no further filesystem access needed.

This fixes an issue where `glob()` could return duplicate results.

## Tracking path existence

We store a flag alongside each path indicating whether the path is
guaranteed to exist. As we process the pattern:

- Certain special pattern segments (`""`, `"."` and `".."`) leave the flag
  unchanged
- Literal pattern segments (e.g. `foo/bar`) set the flag to false
- Wildcard pattern segments (e.g. `*/*.py`) set the flag to true (because
  children are found via `os.scandir()`)
- Recursive pattern segments (e.g. `**`) leave the flag unchanged for the
  root path, and set it to true for descendants discovered via
  `os.scandir()`.

If the flag is false at the end, we call `lstat()` on each path to filter
out missing paths.

## Minor speed-ups

- Exclude paths that don't match a non-terminal non-recursive wildcard
  pattern _prior_ to calling `is_dir()`.
- Use a stack rather than recursion to implement recursive wildcards.
  - This fixes a recursion error when globbing deep trees.
- Pre-compile regular expressions and pre-join literal pattern segments.
- Convert to/from `bytes` (a minor use-case) in `iglob()` rather than
  supporting `bytes` throughout. This particularly simplifies the code
  needed to handle relative bytes paths with `dir_fd`.
- Avoid calling `os.path.join()`; instead we keep paths in a normalized
  form and append trailing slashes when needed.
- Avoid calling `os.path.normcase()`; instead we use case-insensitive regex
  matching.

## Implementation notes

Much of this functionality is already present in pathlib's implementation
of globbing. The specific additions we make are:

1. Support for `dir_fd`
2. Support for `include_hidden`
3. Support for generating paths relative to `root_dir`

This unifies the implementations of globbing in the `glob` and `pathlib`
modules.

Co-authored-by: Pieter Eendebak <pieter.eendebak@gmail.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
@cmaloney
Copy link
Contributor

Given #116392 reduced system calls by changing design, would it be possible to add a test so that it doesn't accidentally revert? Tests are editable if a better / faster set of system calls is found, helps ensure changes are intentional. I improved the infrastructure to make this sort of testing possible a bit ago, happy to help implement:

system call checking infrastructure: https://github.com/python/cpython/blob/main/Lib/test/support/strace_helper.py
sample test: https://github.com/python/cpython/blob/main/Lib/test/test_fileio.py#L365-L498

@barneygale barneygale reopened this Feb 28, 2025
barneygale added a commit to barneygale/cpython that referenced this issue Feb 28, 2025
Remove `test.test_glob.GlobTests.test_iglob_iter_close()`, which was added
in da4899b. This fails on "Fedora Stable Clang Installed 3.x" build bots,
for reasons that aren't clear to me.
@picnixz picnixz added the stdlib Python modules in the Lib dir label Mar 1, 2025
barneygale added a commit to barneygale/cpython that referenced this issue Mar 1, 2025
…stem calls. (python#116392)"

This broke tests on the 'aarch64 Fedora Stable Clang Installed 3.x' and
'AMD64 Fedora Stable Clang Installed 3.x' build bots.

This reverts commit da4899b.
barneygale added a commit that referenced this issue Mar 1, 2025
…alls. (#116392)" (#130743)

This broke tests on the 'aarch64 Fedora Stable Clang Installed 3.x' and
'AMD64 Fedora Stable Clang Installed 3.x' build bots.

This reverts commit da4899b.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir
Projects
Development

No branches or pull requests

3 participants