-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-117586: Speed up pathlib.Path.glob()
by working with strings
#117589
Conversation
Move pathlib globbing implementation to a new module and class: `pathlib._glob.Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects. In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`. This sets the stage for two more improvements: - pythonGH-115060: Query non-wildcard segments with `lstat()` - pythonGH-116380: Move `pathlib._glob` to `glob` (unify implementations).
This is the first PR in a series that will hopefully unify the globbing implementations in the |
Hey @serhiy-storchaka, does this PR look alright to you? Not requesting a detailed review, more of a sanity check, given you've looked after the This PR doesn't affect Thank you. |
I'll merge this now as it's important for #115060, which I'm hoping to get done in time for 3.13 beta 1. But I'll leave |
Move `pathlib.Path.walk()` implementation into `glob._Globber`. The new `glob._Globber.walk()` classmethod works with strings internally, which is a little faster than generating `Path` objects and keeping them normalized. The `pathlib.Path.walk()` method converts the strings back to path objects. In the private pathlib ABCs, our existing subclass of `_Globber` ensures that `PathBase` instances are used throughout. Follow-up to python#117589.
…17726) Move `pathlib.Path.walk()` implementation into `glob._Globber`. The new `glob._Globber.walk()` classmethod works with strings internally, which is a little faster than generating `Path` objects and keeping them normalized. The `pathlib.Path.walk()` method converts the strings back to path objects. In the private pathlib ABCs, our existing subclass of `_Globber` ensures that `PathBase` instances are used throughout. Follow-up to #117589.
…gs (python#117589) Move pathlib globbing implementation into a new private class: `glob._Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects. In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`. This sets the stage for two more improvements: - pythonGH-115060: Query non-wildcard segments with `lstat()` - pythonGH-116380: Unify `pathlib` and `glob` implementations of globbing. No change to the implementations of `glob.glob()` and `glob.iglob()`.
…gs (python#117726) Move `pathlib.Path.walk()` implementation into `glob._Globber`. The new `glob._Globber.walk()` classmethod works with strings internally, which is a little faster than generating `Path` objects and keeping them normalized. The `pathlib.Path.walk()` method converts the strings back to path objects. In the private pathlib ABCs, our existing subclass of `_Globber` ensures that `PathBase` instances are used throughout. Follow-up to python#117589.
As of python/cpython#117589 (at least), `Path.glob` returns an `Iterator` rather than `Generator` (which inherits from `Iterator`). `convert_file` doesn't need to care about this distinction; it can reasonably accept both. This previously caused a test failure along these lines: ______________________________________________________ TestPypandoc.test_basic_conversion_from_file_pattern_pathlib_glob _______________________________________________________ self = <tests.TestPypandoc testMethod=test_basic_conversion_from_file_pattern_pathlib_glob> def test_basic_conversion_from_file_pattern_pathlib_glob(self): received_from_str_filename_input = pypandoc.convert_file("./*.md", 'html').lower() > received_from_path_filename_input = pypandoc.convert_file(Path(".").glob("*.md"), 'html').lower() tests.py:654: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ source_file = <map object at 0x7f83952c9420>, to = 'html', format = None, extra_args = (), encoding = 'utf-8', outputfile = None, filters = None, verify_format = True sandbox = False, cworkdir = '/home/cjwatson/src/python/pypandoc', sort_files = True [...] if not _identify_path(discovered_source_files): > raise RuntimeError("source_file is not a valid path") E RuntimeError: source_file is not a valid path pypandoc/__init__.py:201: RuntimeError
Since python/cpython#117589 (at least), `Path.glob` and `Path.rglob` return an `Iterator` rather than a `Generator`.
Since python/cpython#117589 (at least), `Path.glob` and `Path.rglob` return an `Iterator` rather than a `Generator`.
As of python/cpython#117589 (at least), `Path.glob` returns an `Iterator` rather than `Generator` (which inherits from `Iterator`). `convert_file` doesn't need to care about this distinction; it can reasonably accept both. This previously caused a test failure along these lines: ______________________________________________________ TestPypandoc.test_basic_conversion_from_file_pattern_pathlib_glob _______________________________________________________ self = <tests.TestPypandoc testMethod=test_basic_conversion_from_file_pattern_pathlib_glob> def test_basic_conversion_from_file_pattern_pathlib_glob(self): received_from_str_filename_input = pypandoc.convert_file("./*.md", 'html').lower() > received_from_path_filename_input = pypandoc.convert_file(Path(".").glob("*.md"), 'html').lower() tests.py:654: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ source_file = <map object at 0x7f83952c9420>, to = 'html', format = None, extra_args = (), encoding = 'utf-8', outputfile = None, filters = None, verify_format = True sandbox = False, cworkdir = '/home/cjwatson/src/python/pypandoc', sort_files = True [...] if not _identify_path(discovered_source_files): > raise RuntimeError("source_file is not a valid path") E RuntimeError: source_file is not a valid path pypandoc/__init__.py:201: RuntimeError Co-authored-by: Jessica Tegner <jessica@jessicategner.com>
Since python/cpython#117589 (at least), `Path.glob` and `Path.rglob` return an `Iterator` rather than a `Generator`.
Move pathlib globbing implementation into a new private class:
glob._Globber
. This class implements fast string-based globbing. It's called bypathlib.Path.glob()
, which then converts strings back to path objects.In the private pathlib ABCs, add a
pathlib._abc.Globber
subclass that works withPathBase
objects rather than strings, and calls user-defined path methods likePathBase.stat()
rather thanos.stat()
.This sets the stage for two more improvements:
pathlib.Path.glob()
by removing redundant regex matching #115060glob.glob()
by reducing number of system calls made #116380Timings:
pathlib.Path.glob()
by working with strings internally #117586