Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pathlib's deferred joining slows real workloads #113888

Closed
barneygale opened this issue Jan 10, 2024 · 2 comments
Closed

pathlib's deferred joining slows real workloads #113888

barneygale opened this issue Jan 10, 2024 · 2 comments
Labels
performance Performance or resource usage topic-pathlib

Comments

@barneygale
Copy link
Contributor

In #104996 I made pathlib defer joining of arguments given to path initialisers (like PurePath('a', 'b') and via joinpath(), __truediv__() and __rtruediv__().

This "optimisation" often results in more path joining. Consider:

test_path = pathlib.Path('/home', 'barney', 'projects', 'cpython', 'Lib', 'test')
print(test_path / 'test_abc.py')
print(test_path / 'test_pathlib')
print(test_path / 'test_zipfile')

(the print() could be any operation on the path object other than a further join)

Under the hood this results in the following calls:

os.path.join('/home', 'barney', 'projects', 'cpython', 'Lib', 'test', 'test_abc.py')  # cost=7
os.path.join('/home', 'barney', 'projects', 'cpython', 'Lib', 'test', 'test_pathlib')  # cost=7
os.path.join('/home', 'barney', 'projects', 'cpython', 'Lib', 'test', 'test_zipfile')  # cost=7
# total cost: 21

If we'd naively joined the paths, we'd instead have:

os.path.join('/home', 'barney', 'projects', 'cpython', 'Lib', 'test')  # cost=6
os.path.join('/home/barney/projects/cpython/Lib/test', 'test_abc.py')  # cost=2
os.path.join('/home/barney/projects/cpython/Lib/test', 'test_pathlib')  # cost=2
os.path.join('/home/barney/projects/cpython/Lib/test', 'test_zipfile')  # cost=2
# total cost: 12
@barneygale barneygale added performance Performance or resource usage topic-pathlib labels Jan 10, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Jan 10, 2024
Path modules provide a subset of the `os.path` API, specifically those
functions needed to provide `PurePathBase` functionality. Each
`PurePathBase` subclass references its path module via a `pathmod` class
attribute.

This commit adds a new `PathModuleBase` class, which provides abstract
methods that unconditionally raise `UnsupportedOperation`. An instance of
this class is assigned to `PurePathBase.pathmod`, replacing `posixpath`.
As a result, `PurePathBase` is no longer POSIX-y by default, and almost[^1]
all its methods raise `UnsupportedOperation` courtesy of `pathmod`.

Users who subclass `PurePathBase` or `PathBase` should choose the path
syntax by setting `pathmod` to `posixpath`, `ntpath`, `os.path`, or their
own subclass of `PathModuleBase`, as circumstances demand.

[^1] Except `joinpath()`, `__truediv__()`, `__rtruediv__()`. See pythonGH-113888.
@barneygale
Copy link
Contributor Author

OTOH:

print(pathlib.Path('/home') / 'barney' / 'projects' / 'cpython' / 'Lib' / 'test')

With the current implementation this causes:

os.path.join('/home', 'barney', 'projects', 'cpython', 'Lib', 'test')  # cost=6
# total cost: 6

But if we joined up-front, it would cause:

os.path.join('/home', 'barney') # cost=2
os.path.join('/home/barney', 'projects') # cost=2
os.path.join('/home/barney/projects', 'cpython') # cost=2
os.path.join('/home/barney/projects/cpython', 'Lib') # cost=2
os.path.join('/home/barney/projects/cpython/Lib', 'test') # cost=2
# total cost: 10

@barneygale
Copy link
Contributor Author

I haven't been able to establish which is better, so I'll close this issue as not planned.

@barneygale barneygale closed this as not planned Won't fix, can't repro, duplicate, stale Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-pathlib
Projects
None yet
Development

No branches or pull requests

1 participant