Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop manually interning strings in pathlib #119518

Closed
barneygale opened this issue May 24, 2024 · 6 comments
Closed

Stop manually interning strings in pathlib #119518

barneygale opened this issue May 24, 2024 · 6 comments
Labels
performance Performance or resource usage topic-pathlib

Comments

@barneygale
Copy link
Contributor

barneygale commented May 24, 2024

When parsing and normalizing a path, pathlib calls sys.intern() on the string parts:

parsed = [sys.intern(str(x)) for x in rel.split(sep) if x and x != '.']

I've never been able to establish that this is a worthwhile thing to do. The implementation seems incomplete, because the path normalization only occurs when a user manually initialises a path object, and not in paths generated from path.iterdir(), path.walk(), etc. Drives/roots/anchors aren't interned despite most likely to be shared.

Previous discussion: #112856 (comment)

Linked PRs

@barneygale barneygale added performance Performance or resource usage topic-pathlib labels May 24, 2024
@barneygale
Copy link
Contributor Author

CC @pitrou

@pitrou
Copy link
Member

pitrou commented May 28, 2024

I've never been able to establish that this is a worthwhile thing to do.

That's a good question. The contention is that, if you keep a lot of related paths in memory, interning the path components would yield significant memory savings. But how useful it is would depend on the use case; and it's probably possible to construct use cases where it would be detrimental.

@barneygale
Copy link
Contributor Author

Thank you :)

it's probably possible to construct use cases where it would be detrimental.

Surprisingly, sys.intern(str(x)) accounts for ~10% of the cost of str(Path('foo', 'bar')) on my machine. i.e. this patch makes it ~10% faster:

diff --git a/Lib/pathlib/_local.py b/Lib/pathlib/_local.py
index 49d9f813c5..d20512bd9b 100644
--- a/Lib/pathlib/_local.py
+++ b/Lib/pathlib/_local.py
@@ -270,7 +270,7 @@ def _parse_path(cls, path):
             elif len(drv_parts) == 6:
                 # e.g. //?/unc/server/share
                 root = sep
-        parsed = [sys.intern(str(x)) for x in rel.split(sep) if x and x != '.']
+        parsed = [x for x in rel.split(sep) if x and x != '.']
         return drv, root, parsed
 
     @property

@pitrou
Copy link
Member

pitrou commented May 28, 2024

It can probably be micro-optimized if you really care. For example I'm not sure the str call is still useful.

>>> x = 'foo'
>>> %timeit str(x)
40.3 ns ± 0.169 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> %timeit sys.intern(x)
46.9 ns ± 0.402 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> %timeit sys.intern(str(x))
79.7 ns ± 0.635 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> _intern = sys.intern
>>> %timeit _intern(x)
28.3 ns ± 0.11 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

@barneygale
Copy link
Contributor Author

Hum! That change used to trip up the test_str_subclass* tests, but it doesn't any more. I wonder if there's been a change in how string subclasses work elsewhere in the language.

@vstinner
Copy link
Member

vstinner commented Jun 3, 2024

The contention is that, if you keep a lot of related paths in memory, interning the path components would yield significant memory savings.

Can pathlib have its own "interned strings" cache with a limit on the cache size? Well, I don't know if it's worth it :-)

Pseudo-code with a limit on 3 entries:

cache = {}
def get(key):
    try:
        return cache[key]
    except KeyError:
        pass

    if len(cache) >= 3:
        return key

    cache[key] = key
    return key

abc = get(b'abc'.decode())
abc2 = get(b'abc'.decode())
assert abc2 is abc

d = get(b'd'.decode())
e = get(b'e'.decode())
e2 = get(b'e'.decode())
assert e2 is e

# cache no longer used
f = get(b'f'.decode())
print("cache size", len(cache))

Can pathlib remove entries from such cache?

barneygale added a commit to barneygale/cpython that referenced this issue Aug 26, 2024
Remove `sys.intern(str(x))` calls when normalizing a path in pathlib. This
speeds up `str(Path('foo/bar'))` by about 10%.
barneygale added a commit to barneygale/cpython that referenced this issue Aug 26, 2024
Remove `sys.intern(str(x))` calls when normalizing a path in pathlib. This
speeds up `str(Path('foo/bar'))` by about 10%.
encukou pushed a commit that referenced this issue Sep 2, 2024
Remove `sys.intern(str(x))` calls when normalizing a path in pathlib. This
speeds up `str(Path('foo/bar'))` by about 10%.
@encukou encukou closed this as completed Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-pathlib
Projects
None yet
Development

No branches or pull requests

4 participants