GH-117586: Speed up `pathlib.Path.glob()` by working with strings #117589

barneygale · 2024-04-06T19:45:45Z

Move pathlib globbing implementation into a new private class: glob._Globber. This class implements fast string-based globbing. It's called by pathlib.Path.glob(), which then converts strings back to path objects.

In the private pathlib ABCs, add a pathlib._abc.Globber subclass that works with PathBase objects rather than strings, and calls user-defined path methods like PathBase.stat() rather than os.stat().

This sets the stage for two more improvements:

Timings:

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/*'))"
1000 loops, best of 5: 392 usec per loop
1000 loops, best of 5: 365 usec per loop
# --> 1.07x faster

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/*.py'))"
1000 loops, best of 5: 393 usec per loop
1000 loops, best of 5: 371 usec per loop
# --> 1.06x faster

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/**'))"
50 loops, best of 5: 9.46 msec per loop
50 loops, best of 5: 9.06 msec per loop
# --> 1.04x faster

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/**/'))"
50 loops, best of 5: 4.98 msec per loop
50 loops, best of 5: 5.15 msec per loop
# --> 1.03x slower (!)

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/**/*'))"
20 loops, best of 5: 14 msec per loop
20 loops, best of 5: 12.9 msec per loop
# --> 1.09x faster

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/**/*.py'))"
20 loops, best of 5: 12.2 msec per loop
20 loops, best of 5: 11.4 msec per loop
# --> 1.07x faster

Issue: Speed up pathlib.Path.glob() by working with strings internally #117586

Move pathlib globbing implementation to a new module and class: `pathlib._glob.Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects. In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`. This sets the stage for two more improvements: - pythonGH-115060: Query non-wildcard segments with `lstat()` - pythonGH-116380: Move `pathlib._glob` to `glob` (unify implementations).

barneygale · 2024-04-06T20:12:10Z

This is the first PR in a series that will hopefully unify the globbing implementations in the pathlib and glob modules, and speed both up in the process.

barneygale · 2024-04-07T15:30:02Z

Hey @serhiy-storchaka, does this PR look alright to you? Not requesting a detailed review, more of a sanity check, given you've looked after the glob module for the last few years.

This PR doesn't affect glob.[i]glob(), but it does move pathlib's globbing implementation into glob.py.

Thank you.

barneygale · 2024-04-10T19:37:44Z

I'll merge this now as it's important for #115060, which I'm hoping to get done in time for 3.13 beta 1.

But I'll leave glob.glob() and glob.iglob() unchanged in 3.13; any PRs I make will target 3.14.

Move `pathlib.Path.walk()` implementation into `glob._Globber`. The new `glob._Globber.walk()` classmethod works with strings internally, which is a little faster than generating `Path` objects and keeping them normalized. The `pathlib.Path.walk()` method converts the strings back to path objects. In the private pathlib ABCs, our existing subclass of `_Globber` ensures that `PathBase` instances are used throughout. Follow-up to python#117589.

…17726) Move `pathlib.Path.walk()` implementation into `glob._Globber`. The new `glob._Globber.walk()` classmethod works with strings internally, which is a little faster than generating `Path` objects and keeping them normalized. The `pathlib.Path.walk()` method converts the strings back to path objects. In the private pathlib ABCs, our existing subclass of `_Globber` ensures that `PathBase` instances are used throughout. Follow-up to #117589.

…gs (python#117589) Move pathlib globbing implementation into a new private class: `glob._Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects. In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`. This sets the stage for two more improvements: - pythonGH-115060: Query non-wildcard segments with `lstat()` - pythonGH-116380: Unify `pathlib` and `glob` implementations of globbing. No change to the implementations of `glob.glob()` and `glob.iglob()`.

…gs (python#117726) Move `pathlib.Path.walk()` implementation into `glob._Globber`. The new `glob._Globber.walk()` classmethod works with strings internally, which is a little faster than generating `Path` objects and keeping them normalized. The `pathlib.Path.walk()` method converts the strings back to path objects. In the private pathlib ABCs, our existing subclass of `_Globber` ensures that `PathBase` instances are used throughout. Follow-up to python#117589.

As of python/cpython#117589 (at least), `Path.glob` returns an `Iterator` rather than `Generator` (which inherits from `Iterator`). `convert_file` doesn't need to care about this distinction; it can reasonably accept both. This previously caused a test failure along these lines: ______________________________________________________ TestPypandoc.test_basic_conversion_from_file_pattern_pathlib_glob _______________________________________________________ self = <tests.TestPypandoc testMethod=test_basic_conversion_from_file_pattern_pathlib_glob> def test_basic_conversion_from_file_pattern_pathlib_glob(self): received_from_str_filename_input = pypandoc.convert_file("./*.md", 'html').lower() > received_from_path_filename_input = pypandoc.convert_file(Path(".").glob("*.md"), 'html').lower() tests.py:654: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ source_file = <map object at 0x7f83952c9420>, to = 'html', format = None, extra_args = (), encoding = 'utf-8', outputfile = None, filters = None, verify_format = True sandbox = False, cworkdir = '/home/cjwatson/src/python/pypandoc', sort_files = True [...] if not _identify_path(discovered_source_files): > raise RuntimeError("source_file is not a valid path") E RuntimeError: source_file is not a valid path pypandoc/__init__.py:201: RuntimeError

Since python/cpython#117589 (at least), `Path.glob` and `Path.rglob` return an `Iterator` rather than a `Generator`.

As of python/cpython#117589 (at least), `Path.glob` returns an `Iterator` rather than `Generator` (which inherits from `Iterator`). `convert_file` doesn't need to care about this distinction; it can reasonably accept both. This previously caused a test failure along these lines: ______________________________________________________ TestPypandoc.test_basic_conversion_from_file_pattern_pathlib_glob _______________________________________________________ self = <tests.TestPypandoc testMethod=test_basic_conversion_from_file_pattern_pathlib_glob> def test_basic_conversion_from_file_pattern_pathlib_glob(self): received_from_str_filename_input = pypandoc.convert_file("./*.md", 'html').lower() > received_from_path_filename_input = pypandoc.convert_file(Path(".").glob("*.md"), 'html').lower() tests.py:654: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ source_file = <map object at 0x7f83952c9420>, to = 'html', format = None, extra_args = (), encoding = 'utf-8', outputfile = None, filters = None, verify_format = True sandbox = False, cworkdir = '/home/cjwatson/src/python/pypandoc', sort_files = True [...] if not _identify_path(discovered_source_files): > raise RuntimeError("source_file is not a valid path") E RuntimeError: source_file is not a valid path pypandoc/__init__.py:201: RuntimeError Co-authored-by: Jessica Tegner <jessica@jessicategner.com>

Since python/cpython#117589 (at least), `Path.glob` and `Path.rglob` return an `Iterator` rather than a `Generator`.

barneygale added performance Performance or resource usage topic-pathlib labels Apr 6, 2024

bedevere-app bot mentioned this pull request Apr 6, 2024

Speed up pathlib.Path.glob() by working with strings internally #117586

Closed

bedevere-app bot added the awaiting core review label Apr 6, 2024

Move class into glob module.

26d2c03

barneygale mentioned this pull request Apr 6, 2024

GH-116380: Speed up glob.glob() by removing some system calls #116392

Merged

barneygale added 4 commits April 6, 2024 22:22

Fix handling of missing root path.

d6314ac

More precise error handling

8696ca0

Ensure results are normalized.

824f1f6

Speed up results normalization

60eb3d0

barneygale added 2 commits April 7, 2024 16:43

Define add_slash() in _Globber itself.

98dea96

Slightly speed up path renormalisation.

ebcd7fc

barneygale merged commit 6258844 into python:main Apr 10, 2024
33 checks passed

bedevere-app bot removed the awaiting core review label Apr 10, 2024

barneygale mentioned this pull request Apr 10, 2024

GH-117586: Speed up pathlib.Path.walk() by working with strings #117726

Merged

cjwatson mentioned this pull request Dec 8, 2024

Fix convert_file for Python 3.13 JessicaTegner/pypandoc#384

Merged

cjwatson added a commit to cjwatson/typeshed that referenced this pull request Dec 8, 2024

Weaken return type of Path.{glob,rglob} in 3.13

9674ec2

Since python/cpython#117589 (at least), `Path.glob` and `Path.rglob` return an `Iterator` rather than a `Generator`.

cjwatson added a commit to cjwatson/typeshed that referenced this pull request Dec 8, 2024

Weaken return type of Path.{glob,rglob} in 3.13

795336d

Since python/cpython#117589 (at least), `Path.glob` and `Path.rglob` return an `Iterator` rather than a `Generator`.

cjwatson mentioned this pull request Dec 8, 2024

Weaken return type of Path.{glob,rglob} in 3.13 python/typeshed#13223

Merged

srittau pushed a commit to python/typeshed that referenced this pull request Feb 28, 2025

Weaken return type of Path.{glob,rglob} in 3.13 (#13223)

8ebf8af

Since python/cpython#117589 (at least), `Path.glob` and `Path.rglob` return an `Iterator` rather than a `Generator`.

tungol mentioned this pull request Mar 7, 2025

add new pathlib base classes for 3.13 python/typeshed#12937

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-117586: Speed up `pathlib.Path.glob()` by working with strings #117589

GH-117586: Speed up `pathlib.Path.glob()` by working with strings #117589

barneygale commented Apr 6, 2024 •

edited

Loading

barneygale commented Apr 6, 2024

barneygale commented Apr 7, 2024 •

edited

Loading

barneygale commented Apr 10, 2024

GH-117586: Speed up pathlib.Path.glob() by working with strings #117589

GH-117586: Speed up pathlib.Path.glob() by working with strings #117589

Conversation

barneygale commented Apr 6, 2024 • edited Loading

barneygale commented Apr 6, 2024

barneygale commented Apr 7, 2024 • edited Loading

barneygale commented Apr 10, 2024

GH-117586: Speed up `pathlib.Path.glob()` by working with strings #117589

GH-117586: Speed up `pathlib.Path.glob()` by working with strings #117589

barneygale commented Apr 6, 2024 •

edited

Loading

barneygale commented Apr 7, 2024 •

edited

Loading