GH-102613: Fast recursive globbing in pathlib.Path.glob()
#104512
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a 'walk-and-match' strategy for handling glob patterns that include a non-terminal
**
wildcard, such as**/*.py
. For this example, the previous implementation recursively walked directories usingos.scandir()
when it expanded the**
component, and then scanned those same directories again when expanded the*.py
component. This is wasteful.In the new implementation, any components following a
**
wildcard are used to build are.Pattern
object, which is used to filter the results of the recursive walk. A pattern like**/*.py
uses half the number ofos.scandir()
calls; a pattern like**/*/*.py
a third, etc.This new algorithm does not apply if either:
None
(its default), or..
components.In these cases we fall back to the old implementation.
This PR also replaces selector classes with selector functions. These generators directly yield results rather calling through to their successors. A new internal
Path._glob()
method takes care to chain these generators together, which simplifies the lazy algorithm and slightly improves performance. It should also be easier to understand and maintain.Performance for the original #102613 repro case, with 400 nested
a/
directories, and matching treatment of symlinks and hidden files:These results were from an SSD. The improvement will be greater for slow storage (e.g. network-mounted volumes).