Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-102613: Fast recursive globbing in pathlib.Path.glob() #104512

Merged
merged 17 commits into from
Jun 6, 2023

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented May 15, 2023

This PR introduces a 'walk-and-match' strategy for handling glob patterns that include a non-terminal ** wildcard, such as **/*.py. For this example, the previous implementation recursively walked directories using os.scandir() when it expanded the ** component, and then scanned those same directories again when expanded the *.py component. This is wasteful.

In the new implementation, any components following a ** wildcard are used to build a re.Pattern object, which is used to filter the results of the recursive walk. A pattern like **/*.py uses half the number of os.scandir() calls; a pattern like **/*/*.py a third, etc.

This new algorithm does not apply if either:

  1. The follow_symlinks argument is set to None (its default), or
  2. The pattern contains .. components.

In these cases we fall back to the old implementation.

This PR also replaces selector classes with selector functions. These generators directly yield results rather calling through to their successors. A new internal Path._glob() method takes care to chain these generators together, which simplifies the lazy algorithm and slightly improves performance. It should also be easier to understand and maintain.

Performance for the original #102613 repro case, with 400 nested a/ directories, and matching treatment of symlinks and hidden files:

$ ../python -m timeit -s 'import glob' 'print(glob.glob("**/*", recursive=True, include_hidden=True))'
5 loops, best of 5: 66.2 msec per loop
$ ../python -m timeit -s 'from pathlib import Path' 'print(list(Path(".").rglob("**/*", follow_symlinks=True)))'
10 loops, best of 5: 22.7 msec per loop  # before this PR
10 loops, best of 5: 16.5 msec per loop  # after this PR

These results were from an SSD. The improvement will be greater for slow storage (e.g. network-mounted volumes).

This commit replaces selector classes with selector functions. These
generators directly yield results rather calling through to their
successor. A new internal `Path._glob()` takes care to chain these
generators together, which simplifies the lazy algorithm and slightly
improves performance.
@barneygale barneygale marked this pull request as draft May 30, 2023 17:39
@barneygale barneygale changed the title GH-102613: Simplify implementation of pathlib.Path.glob() GH-102613: Fast recursive globbing in pathlib.Path.glob() May 31, 2023
@barneygale barneygale marked this pull request as ready for review May 31, 2023 20:36
@barneygale
Copy link
Contributor Author

@zooba here's the promised walk-and-match implementation!

Copy link
Member

@zooba zooba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine, there's potentially a few simplifications, but it looks like a great improvement over the existing code.

Do we need any new tests to specifically trigger anything that behaves differently?

Lib/pathlib.py Outdated Show resolved Hide resolved
Lib/pathlib.py Show resolved Hide resolved
Lib/pathlib.py Show resolved Hide resolved
@barneygale
Copy link
Contributor Author

barneygale commented Jun 1, 2023

Thanks for the review! I've added a few more tests exercising .. and ** segments.

`..` components are resolved lexically, rather than after symlinks.
@barneygale barneygale requested a review from zooba June 6, 2023 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-pathlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants