Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-117586: Speed up pathlib.Path.glob() by working with strings #117589

Merged
merged 8 commits into from
Apr 10, 2024

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Apr 6, 2024

Move pathlib globbing implementation into a new private class: glob._Globber. This class implements fast string-based globbing. It's called by pathlib.Path.glob(), which then converts strings back to path objects.

In the private pathlib ABCs, add a pathlib._abc.Globber subclass that works with PathBase objects rather than strings, and calls user-defined path methods like PathBase.stat() rather than os.stat().

This sets the stage for two more improvements:

Timings:

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/*'))"
1000 loops, best of 5: 392 usec per loop
1000 loops, best of 5: 365 usec per loop
# --> 1.07x faster

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/*.py'))"
1000 loops, best of 5: 393 usec per loop
1000 loops, best of 5: 371 usec per loop
# --> 1.06x faster

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/**'))"
50 loops, best of 5: 9.46 msec per loop
50 loops, best of 5: 9.06 msec per loop
# --> 1.04x faster

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/**/'))"
50 loops, best of 5: 4.98 msec per loop
50 loops, best of 5: 5.15 msec per loop
# --> 1.03x slower (!)

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/**/*'))"
20 loops, best of 5: 14 msec per loop
20 loops, best of 5: 12.9 msec per loop
# --> 1.09x faster

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/**/*.py'))"
20 loops, best of 5: 12.2 msec per loop
20 loops, best of 5: 11.4 msec per loop
# --> 1.07x faster

Move pathlib globbing implementation to a new module and class:
`pathlib._glob.Globber`. This class implements fast string-based globbing.
It's called by `pathlib.Path.glob()`, which then converts strings back to
path objects.

In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that
works with `PathBase` objects rather than strings, and calls user-defined
path methods like `PathBase.stat()` rather than `os.stat()`.

This sets the stage for two more improvements:

- pythonGH-115060: Query non-wildcard segments with `lstat()`
- pythonGH-116380: Move `pathlib._glob` to `glob` (unify implementations).
@barneygale
Copy link
Contributor Author

This is the first PR in a series that will hopefully unify the globbing implementations in the pathlib and glob modules, and speed both up in the process.

@barneygale
Copy link
Contributor Author

barneygale commented Apr 7, 2024

Hey @serhiy-storchaka, does this PR look alright to you? Not requesting a detailed review, more of a sanity check, given you've looked after the glob module for the last few years.

This PR doesn't affect glob.[i]glob(), but it does move pathlib's globbing implementation into glob.py.

Thank you.

@barneygale
Copy link
Contributor Author

I'll merge this now as it's important for #115060, which I'm hoping to get done in time for 3.13 beta 1.

But I'll leave glob.glob() and glob.iglob() unchanged in 3.13; any PRs I make will target 3.14.

@barneygale barneygale merged commit 6258844 into python:main Apr 10, 2024
33 checks passed
barneygale added a commit to barneygale/cpython that referenced this pull request Apr 10, 2024
Move `pathlib.Path.walk()` implementation into `glob._Globber`. The new
`glob._Globber.walk()` classmethod works with strings internally, which is
a little faster than generating `Path` objects and keeping them normalized.
The `pathlib.Path.walk()` method converts the strings back to path objects.

In the private pathlib ABCs, our existing subclass of `_Globber` ensures
that `PathBase` instances are used throughout.

Follow-up to python#117589.
barneygale added a commit that referenced this pull request Apr 11, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
…17726)

Move `pathlib.Path.walk()` implementation into `glob._Globber`. The new
`glob._Globber.walk()` classmethod works with strings internally, which is
a little faster than generating `Path` objects and keeping them normalized.
The `pathlib.Path.walk()` method converts the strings back to path objects.

In the private pathlib ABCs, our existing subclass of `_Globber` ensures
that `PathBase` instances are used throughout.

Follow-up to #117589.
diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024
…gs (python#117589)

Move pathlib globbing implementation into a new private class: `glob._Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects.

In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`.

This sets the stage for two more improvements:

- pythonGH-115060: Query non-wildcard segments with `lstat()`
- pythonGH-116380: Unify `pathlib` and `glob` implementations of globbing.

No change to the implementations of `glob.glob()` and `glob.iglob()`.
diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024
…gs (python#117726)

Move `pathlib.Path.walk()` implementation into `glob._Globber`. The new
`glob._Globber.walk()` classmethod works with strings internally, which is
a little faster than generating `Path` objects and keeping them normalized.
The `pathlib.Path.walk()` method converts the strings back to path objects.

In the private pathlib ABCs, our existing subclass of `_Globber` ensures
that `PathBase` instances are used throughout.

Follow-up to python#117589.
cjwatson added a commit to cjwatson/pypandoc that referenced this pull request Dec 8, 2024
As of python/cpython#117589 (at least),
`Path.glob` returns an `Iterator` rather than `Generator` (which
inherits from `Iterator`).  `convert_file` doesn't need to care about
this distinction; it can reasonably accept both.

This previously caused a test failure along these lines:

  ______________________________________________________ TestPypandoc.test_basic_conversion_from_file_pattern_pathlib_glob _______________________________________________________

  self = <tests.TestPypandoc testMethod=test_basic_conversion_from_file_pattern_pathlib_glob>

      def test_basic_conversion_from_file_pattern_pathlib_glob(self):
          received_from_str_filename_input = pypandoc.convert_file("./*.md", 'html').lower()
  >       received_from_path_filename_input = pypandoc.convert_file(Path(".").glob("*.md"), 'html').lower()

  tests.py:654:
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

  source_file = <map object at 0x7f83952c9420>, to = 'html', format = None, extra_args = (), encoding = 'utf-8', outputfile = None, filters = None, verify_format = True
  sandbox = False, cworkdir = '/home/cjwatson/src/python/pypandoc', sort_files = True
  [...]
          if not _identify_path(discovered_source_files):
  >           raise RuntimeError("source_file is not a valid path")
  E           RuntimeError: source_file is not a valid path

  pypandoc/__init__.py:201: RuntimeError
cjwatson added a commit to cjwatson/typeshed that referenced this pull request Dec 8, 2024
Since python/cpython#117589 (at least),
`Path.glob` and `Path.rglob` return an `Iterator` rather than a
`Generator`.
cjwatson added a commit to cjwatson/typeshed that referenced this pull request Dec 8, 2024
Since python/cpython#117589 (at least),
`Path.glob` and `Path.rglob` return an `Iterator` rather than a
`Generator`.
JessicaTegner added a commit to JessicaTegner/pypandoc that referenced this pull request Jan 8, 2025

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
As of python/cpython#117589 (at least),
`Path.glob` returns an `Iterator` rather than `Generator` (which
inherits from `Iterator`).  `convert_file` doesn't need to care about
this distinction; it can reasonably accept both.

This previously caused a test failure along these lines:

  ______________________________________________________ TestPypandoc.test_basic_conversion_from_file_pattern_pathlib_glob _______________________________________________________

  self = <tests.TestPypandoc testMethod=test_basic_conversion_from_file_pattern_pathlib_glob>

      def test_basic_conversion_from_file_pattern_pathlib_glob(self):
          received_from_str_filename_input = pypandoc.convert_file("./*.md", 'html').lower()
  >       received_from_path_filename_input = pypandoc.convert_file(Path(".").glob("*.md"), 'html').lower()

  tests.py:654:
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

  source_file = <map object at 0x7f83952c9420>, to = 'html', format = None, extra_args = (), encoding = 'utf-8', outputfile = None, filters = None, verify_format = True
  sandbox = False, cworkdir = '/home/cjwatson/src/python/pypandoc', sort_files = True
  [...]
          if not _identify_path(discovered_source_files):
  >           raise RuntimeError("source_file is not a valid path")
  E           RuntimeError: source_file is not a valid path

  pypandoc/__init__.py:201: RuntimeError

Co-authored-by: Jessica Tegner <jessica@jessicategner.com>
srittau pushed a commit to python/typeshed that referenced this pull request Feb 28, 2025

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Since python/cpython#117589 (at least),
`Path.glob` and `Path.rglob` return an `Iterator` rather than a
`Generator`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-pathlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant