Skip to content

Commit da4899b

Browse files
barneygaleeendebakptpicnixz
authored
GH-116380: Speed up glob.[i]glob() by making fewer system calls. (#116392)
## Filtered recursive walk Expanding a recursive `**` segment entails walking the entire directory tree, and so any subsequent pattern segments (except special segments) can be evaluated by filtering the expanded paths through a regex. For example, `glob.glob("foo/**/*.py", recursive=True)` recursively walks `foo/` with `os.scandir()`, and then filters paths through a regex based on "`**/*.py`, with no further filesystem access needed. This fixes an issue where `glob()` could return duplicate results. ## Tracking path existence We store a flag alongside each path indicating whether the path is guaranteed to exist. As we process the pattern: - Certain special pattern segments (`""`, `"."` and `".."`) leave the flag unchanged - Literal pattern segments (e.g. `foo/bar`) set the flag to false - Wildcard pattern segments (e.g. `*/*.py`) set the flag to true (because children are found via `os.scandir()`) - Recursive pattern segments (e.g. `**`) leave the flag unchanged for the root path, and set it to true for descendants discovered via `os.scandir()`. If the flag is false at the end, we call `lstat()` on each path to filter out missing paths. ## Minor speed-ups - Exclude paths that don't match a non-terminal non-recursive wildcard pattern _prior_ to calling `is_dir()`. - Use a stack rather than recursion to implement recursive wildcards. - This fixes a recursion error when globbing deep trees. - Pre-compile regular expressions and pre-join literal pattern segments. - Convert to/from `bytes` (a minor use-case) in `iglob()` rather than supporting `bytes` throughout. This particularly simplifies the code needed to handle relative bytes paths with `dir_fd`. - Avoid calling `os.path.join()`; instead we keep paths in a normalized form and append trailing slashes when needed. - Avoid calling `os.path.normcase()`; instead we use case-insensitive regex matching. ## Implementation notes Much of this functionality is already present in pathlib's implementation of globbing. The specific additions we make are: 1. Support for `dir_fd` 2. Support for `include_hidden` 3. Support for generating paths relative to `root_dir` This unifies the implementations of globbing in the `glob` and `pathlib` modules. Co-authored-by: Pieter Eendebak <pieter.eendebak@gmail.com> Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
1 parent b545450 commit da4899b

File tree

7 files changed

+240
-229
lines changed

7 files changed

+240
-229
lines changed

Doc/library/glob.rst

+10-8
Original file line numberDiff line numberDiff line change
@@ -75,10 +75,6 @@ The :mod:`glob` module defines the following functions:
7575
Using the "``**``" pattern in large directory trees may consume
7676
an inordinate amount of time.
7777

78-
.. note::
79-
This function may return duplicate path names if *pathname*
80-
contains multiple "``**``" patterns and *recursive* is true.
81-
8278
.. versionchanged:: 3.5
8379
Support for recursive globs using "``**``".
8480

@@ -88,6 +84,11 @@ The :mod:`glob` module defines the following functions:
8884
.. versionchanged:: 3.11
8985
Added the *include_hidden* parameter.
9086

87+
.. versionchanged:: 3.14
88+
Matching path names are returned only once. In previous versions, this
89+
function may return duplicate path names if *pathname* contains multiple
90+
"``**``" patterns and *recursive* is true.
91+
9192

9293
.. function:: iglob(pathname, *, root_dir=None, dir_fd=None, recursive=False, \
9394
include_hidden=False)
@@ -98,10 +99,6 @@ The :mod:`glob` module defines the following functions:
9899
.. audit-event:: glob.glob pathname,recursive glob.iglob
99100
.. audit-event:: glob.glob/2 pathname,recursive,root_dir,dir_fd glob.iglob
100101

101-
.. note::
102-
This function may return duplicate path names if *pathname*
103-
contains multiple "``**``" patterns and *recursive* is true.
104-
105102
.. versionchanged:: 3.5
106103
Support for recursive globs using "``**``".
107104

@@ -111,6 +108,11 @@ The :mod:`glob` module defines the following functions:
111108
.. versionchanged:: 3.11
112109
Added the *include_hidden* parameter.
113110

111+
.. versionchanged:: 3.14
112+
Matching path names are yielded only once. In previous versions, this
113+
function may yield duplicate path names if *pathname* contains multiple
114+
"``**``" patterns and *recursive* is true.
115+
114116

115117
.. function:: escape(pathname)
116118

Doc/whatsnew/3.14.rst

+8
Original file line numberDiff line numberDiff line change
@@ -968,6 +968,14 @@ base64
968968
(Contributed by Bénédikt Tran, Chris Markiewicz, and Adam Turner in :gh:`118761`.)
969969

970970

971+
glob
972+
----
973+
974+
* Reduce the number of system calls in :func:`glob.glob` and :func:`~glob.iglob`,
975+
thereby improving the speed of globbing operations by 20-80%.
976+
(Contributed by Barney Gale in :gh:`116380`.)
977+
978+
971979
io
972980
---
973981
* :mod:`io` which provides the built-in :func:`open` makes less system calls

0 commit comments

Comments
 (0)