-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize pathlib path construction #101362
Comments
I'd like to land #101363 before I put the first PR up for this issue. |
`PurePath` now normalises and splits paths only when necessary, e.g. when `.name` or `.parent` is accessed. The result is cached. This speeds up path object construction by around 4x. `PurePath.__fspath__()` now returns an unnormalised path, which should be transparent to filesystem APIs (else pathlib's normalisation is broken!). This extends the earlier performance improvement to most impure `Path` methods, and also speeds up pickling, `p.joinpath('bar')` and `p / 'bar'`. This also fixes pythonGH-76846 and pythonGH-85281 by unifying path constructors and adding an `__init__()` method.
This saves a comparison in `pathlib.Path.__new__()` and reduces the time taken to run `Path()` by ~5%
This saves a comparison in `pathlib.Path.__new__()` and reduces the time taken to run `Path()` by ~5%
…b.PurePath This reduces the time taken to run `PurePath("foo")` by ~15%
The previous `_parse_args()` method pulled the `_parts` out of any supplied `PurePath` objects; these were subsequently joined in `_from_parts()` using `os.path.join()`. This is actually a slower form of joining than calling `fspath()` on the path object, because it doesn't take advantage of the fact that the contents of `_parts` is normalized! This reduces the time taken to run `PurePath("foo", "bar") by ~20%, and the time taken to run `PurePath(p, "cheese")`, where `p = PurePath("/foo", "bar", "baz")`, by ~40%.
Does the PR cope with |
Could you clarify? The PRs maintain the behaviour that attempting to instantiate |
This behaviour should be preserved:
|
Right! That will be broken by #101667 as things stand: >>> from os import fspath
>>> from pathlib import *
>>> p = PureWindowsPath("a/b/c")
>>> p
PureWindowsPath('a/b/c')
>>> fspath(p)
'a\\b\\c'
>>> PurePosixPath(fspath(p))
PurePosixPath('a\\b\\c')
>>> PurePosixPath(p)
PurePosixPath('a\\b\\c') It doesn't appear to be documented or tested behaviour, and it feels odd to me that |
This feature doesn't really work with drives or roots: >>> PurePosixPath(PureWindowsPath('//server/share/dir'))
PurePosixPath('\\\\server\\share\\/dir')
>>> PurePosixPath(PureWindowsPath('c:/dir'))
PurePosixPath('c:\\/dir')
>>> PurePosixPath(PureWindowsPath('/dir'))
PurePosixPath('\\/dir') As far as I can tell, no one has ever logged a bug about it. However, using >>> PurePosixPath(PureWindowsPath('//server/share/dir').as_posix())
PurePosixPath('//server/share/dir')
>>> PurePosixPath(PureWindowsPath('c:/dir').as_posix())
PurePosixPath('c:/dir')
>>> PurePosixPath(PureWindowsPath('/dir').as_posix())
PurePosixPath('/dir') So I'm tempted to conclude that converting with |
I think the direct conversion should either be consistent with Might be worth adding a short note to the docs warning that the constructor can't reliably convert from different On another perf note, is it possible that parsing the path up front isn't necessary? Obviously it'll save the most time to keep a single string literal around and parse it later, but I don't personally have a good feel for whether that's common or not. (Obviously if it's available pre-parsed then keep it.) |
That's my plan! We can return the unnormalized path from |
…H-101664) This saves a comparison in `pathlib.Path.__new__()` and reduces the time taken to run `Path()` by ~5%. Automerge-Triggered-By: GH:AlexWaygood
The previous `_parse_args()` method pulled the `_parts` out of any supplied `PurePath` objects; these were subsequently joined in `_from_parts()` using `os.path.join()`. This is actually a slower form of joining than calling `fspath()` on the path object, because it doesn't take advantage of the fact that the contents of `_parts` is normalized! This reduces the time taken to run `PurePath("foo", "bar")` by ~20%, and the time taken to run `PurePath(p, "cheese")`, where `p = PurePath("/foo", "bar", "baz")`, by ~40%. Automerge-Triggered-By: GH:AlexWaygood
…b.PurePath() (python#101665) pythonGH-101362: Call join() only when >1 argument supplied to pathlib.PurePath This reduces the time taken to run `PurePath("foo")` by ~15%
…ime (pythonGH-101664) This saves a comparison in `pathlib.Path.__new__()` and reduces the time taken to run `Path()` by ~5%. Automerge-Triggered-By: GH:AlexWaygood
The previous `_parse_args()` method pulled the `_parts` out of any supplied `PurePath` objects; these were subsequently joined in `_from_parts()` using `os.path.join()`. This is actually a slower form of joining than calling `fspath()` on the path object, because it doesn't take advantage of the fact that the contents of `_parts` is normalized! This reduces the time taken to run `PurePath("foo", "bar")` by ~20%, and the time taken to run `PurePath(p, "cheese")`, where `p = PurePath("/foo", "bar", "baz")`, by ~40%. Automerge-Triggered-By: GH:AlexWaygood
Improve performance of path construction by skipping the addition of the path anchor (`drive + root`) to the internal `_parts` list. This change allows us to simplify the implementations of `joinpath()`, `name`, `parent`, and `parents` a little. The public `parts` tuple is unaffected.
* main: (21 commits) pythongh-102192: Replace PyErr_Fetch/Restore etc by more efficient alternatives in sub interpreters module (python#102472) pythongh-95672: Fix versionadded indentation of get_pagesize in test.rst (pythongh-102455) pythongh-102416: Do not memoize incorrectly loop rules in the parser (python#102467) pythonGH-101362: Optimise PurePath(PurePath(...)) (pythonGH-101667) pythonGH-101362: Check pathlib.Path flavour compatibility at import time (pythonGH-101664) pythonGH-101362: Call join() only when >1 argument supplied to pathlib.PurePath() (python#101665) pythongh-102444: Fix minor bugs in `test_typing` highlighted by pyflakes (python#102445) pythonGH-102341: Improve the test function for pow (python#102342) Fix unused classes in a typing test (pythonGH-102437) pythongh-101979: argparse: fix a bug where parentheses in metavar argument of add_argument() were dropped (python#102318) pythongh-102356: Add thrashcan macros to filter object dealloc (python#102426) Move around example in to_bytes() to avoid confusion (python#101595) pythonGH-97546: fix flaky asyncio `test_wait_for_race_condition` test (python#102421) pythongh-96821: Add config option `--with-strict-overflow` (python#96823) pythongh-101992: update pstlib module documentation (python#102133) pythongh-63301: Set exit code when tabnanny CLI exits on error (python#7699) pythongh-101863: Fix wrong comments in EUC-KR codec (pythongh-102417) pythongh-102302 Micro-optimize `inspect.Parameter.__hash__` (python#102303) pythongh-102179: Fix `os.dup2` error reporting for negative fds (python#102180) pythongh-101892: Fix `SystemError` when a callable iterator call exhausts the iterator (python#101896) ...
* main: (37 commits) pythongh-102192: Replace PyErr_Fetch/Restore etc by more efficient alternatives in sub interpreters module (python#102472) pythongh-95672: Fix versionadded indentation of get_pagesize in test.rst (pythongh-102455) pythongh-102416: Do not memoize incorrectly loop rules in the parser (python#102467) pythonGH-101362: Optimise PurePath(PurePath(...)) (pythonGH-101667) pythonGH-101362: Check pathlib.Path flavour compatibility at import time (pythonGH-101664) pythonGH-101362: Call join() only when >1 argument supplied to pathlib.PurePath() (python#101665) pythongh-102444: Fix minor bugs in `test_typing` highlighted by pyflakes (python#102445) pythonGH-102341: Improve the test function for pow (python#102342) Fix unused classes in a typing test (pythonGH-102437) pythongh-101979: argparse: fix a bug where parentheses in metavar argument of add_argument() were dropped (python#102318) pythongh-102356: Add thrashcan macros to filter object dealloc (python#102426) Move around example in to_bytes() to avoid confusion (python#101595) pythonGH-97546: fix flaky asyncio `test_wait_for_race_condition` test (python#102421) pythongh-96821: Add config option `--with-strict-overflow` (python#96823) pythongh-101992: update pstlib module documentation (python#102133) pythongh-63301: Set exit code when tabnanny CLI exits on error (python#7699) pythongh-101863: Fix wrong comments in EUC-KR codec (pythongh-102417) pythongh-102302 Micro-optimize `inspect.Parameter.__hash__` (python#102303) pythongh-102179: Fix `os.dup2` error reporting for negative fds (python#102180) pythongh-101892: Fix `SystemError` when a callable iterator call exhausts the iterator (python#101896) ...
For anyone following along, I think this ticket can be resolved if/when this PR lands: |
Improve performance of path construction by skipping the addition of the path anchor (`drive + root`) to the internal `_parts` list. Rename this attribute to `_tail` for clarity.
Resolving this issue. It's now much cheaper to construct Future work: |
…ythonGH-102476) Improve performance of path construction by skipping the addition of the path anchor (`drive + root`) to the internal `_parts` list. Rename this attribute to `_tail` for clarity.
…ythonGH-102476) Improve performance of path construction by skipping the addition of the path anchor (`drive + root`) to the internal `_parts` list. Rename this attribute to `_tail` for clarity.
In #540 I found about ~10% of the runtime to be attributable to the regex-matching. This patch replaces the regex with two split() calls. In local benchmarks I see a ~10% speedup. One major other source of slow-down was/is the construction of `PurePosixPath` objects. It seems this is being taken care of by python/cpython#101362 and will be resolved eventually. Closes #540
Pathlib is slow. One of the most obvious symptoms is that
pathlib.PurePath
objects are slow to construct. We should be able to speed construction up without making other parts of pathlib slower.Two possible approaches:
__new__()
,_from_parts()
,_parse_parts()
,_parse_args()
.Linked PRs
pathlib.PurePath()._parts
#102476The text was updated successfully, but these errors were encountered: