Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure that typing cache differentiate Union[int, str] and Union[str, int] #103749

Open
sobolevn opened this issue Apr 24, 2023 · 9 comments
Open
Assignees
Labels
topic-typing type-feature A feature request or enhancement

Comments

@sobolevn
Copy link
Member

Previous discussions:

It was decided (@gvanrossum and @samuelcolvin) that there's not need to change __hash__ or __eq__ of Union objects, we can just change how the typing cache works.

In other words, we need to change this behavior:

>>> from typing import List, Union
>>> List[Union[str, int]] is List[Union[str, int]]
True
>>> List[Union[str, int]] is List[Union[int, str]]
True

To be:

>>> from typing import List, Union
>>> List[Union[str, int]] is List[Union[str, int]]   # still the same
True
>>> List[Union[str, int]] is List[Union[int, str]]   # change, because order of `Union.__args__` changed
False

This way we can fix this confusing behavior:

>>> List[Union[str, int]].__args__
(typing.Union[str, int],)

>>> List[Union[int, str]].__args__  # order of args changes
(typing.Union[str, int],)

Instead it would be:

>>> List[Union[str, int]].__args__
(typing.Union[str, int],)

>>> List[Union[int, str]].__args__   # order is the same 
(typing.Union[int, str],)

I am interested in working on this: typing.py is my favourite module + I have a highly related opened PR with types.GenericAlias cache.

If someone else is already working on it, please let me know :)

@sobolevn sobolevn added type-feature A feature request or enhancement topic-typing labels Apr 24, 2023
@sobolevn sobolevn self-assigned this Apr 24, 2023
@gvanrossum
Copy link
Member

Thanks -- go for it! (Are we also considering adding a cache for list[int|str], or even for int|str?)

@adriangb
Copy link
Contributor

I did a quick experiment. I’m sure you can come up with something better, but maybe it’s a good starting point: https://github.com/python/cpython/compare/main...adriangb:cpython:union-cache?expand=1

@sobolevn
Copy link
Member Author

Are we also considering adding a cache for list[int|str], or even for int|str?

@gvanrossum, yes. I am working on it in #103541 (comment)

@adriangb yes, this is the same idea I had: tweaking the functools.lru_cache cache key generation. But, I only change it for Union items. I would highly appreciate your review, when PR is ready 👍

@sobolevn
Copy link
Member Author

sobolevn commented May 2, 2023

I had some time to work on this today. I have a partially working solution, but I am not a big fan of what I came up with.

So, here's how my new _tp_cache function looks like:

def _tp_cache(func=None, /, *, typed=False):
    """Internal wrapper caching __getitem__ of generic types with a fallback to
    original function for non-hashable arguments.
    """
    def decorator(func):
        # The callback 'inner' references the newly created lru_cache
        # indirectly by performing a lookup in the global '_caches' dictionary.
        # This breaks a reference that can be problematic when combined with
        # C API extensions that leak references to types. See GH-98253.

        def _tp_wrapper(func):
            def ignored_first_arg(tp_key, /, *args, **kwargs):
                # We just ignore the first `tp_key`, since it is a special
                # hack to distingush unions from each other.
                return func(*args, **kwargs)
            return ignored_first_arg

        def _tp_cache_key(args):
            new_args = []
            for arg in args:
                if isinstance(arg, _UnionGenericAlias):
                    new_args.append((Union, *arg.__args__))
                else:
                    new_args.append(arg)
            return tuple(new_args)

        cache = functools.lru_cache(typed=typed)(_tp_wrapper(func))
        _caches[func] = cache
        _cleanups.append(cache.cache_clear)
        del cache

        @functools.wraps(func)
        def inner(*args, **kwds):
            try:
                return _caches[func](_tp_cache_key(args), *args, **kwds)
            except TypeError:
                pass  # All real errors (not unhashable args) are raised below.
            return func(*args, **kwds)
        return inner

    if func is not None:
        return decorator(func)

    return decorator

It solves the problem partially. All tests pass. Plus, I have this extra test that also passes:

    @cpython_only
    def test_special_caching(self):
        types = [
            Union[int, str],
            Optional[int],
            List[Union[int, str]],
            List[List[Union[int, str]]],
        ]
        for typ in types:
            with self.subTest(typ=typ):
                self.assertIs(typ, typ)

        self.assertIsNot(Union[str, int], Union[int, str])
        self.assertIsNot(Union[str, None], Union[None, str])

        self.assertIsNot(Union[type(None), int], Optional[int])
        self.assertIs(Union[int, type(None)], Optional[int])

        self.assertIsNot(List[Union[str, int]],
                         List[Union[int, str]])

But, there's still a problem with deeply nested Union types. For example, List[List[Union[str, int]]] is List[List[Union[int, str]]] is still True.

What can we do?

  1. Go through all arguments and their nested type args and convert them to a special Union type representation. I don't quite like this approach: it seems quite slow, complex, and hacky. basically, we will compute cache keys twice.
  2. Completely rewrite _tp_cache to not use functools.lru_cache and functools._make_key function and use its own _make_key that will handle Union specially. In this case we will suffer from perfomance degradation. Since lru_cache uses several C functions that has much better speed than what I can write in pure-Python
  3. Add optional make_key_func argument to functools.lru_cache to be able to use some custom logic (I like this approach more than others). I think that there might be other use-cases for this in the wild and this change won't be useless for other developers (unlike 1. and 2.)

What do others think?

CC @JelleZijlstra @AlexWaygood

@adriangb
Copy link
Contributor

adriangb commented May 2, 2023

What happens if you remove if isinstance(arg, _UnionGenericAlias): and always use __args__ (e.g. getattr(obj, '__args__', None)?

@sobolevn
Copy link
Member Author

sobolevn commented May 2, 2023

It won't change much, because it will translate (List, List[Union[int, str]]) to (List, (List, Union[str, int]))

@adriangb
Copy link
Contributor

Is there custom Python logic that would work? How bad would the performance impact be? As far as I know the purpose of this cache is memory, not performance. If the difference of doing it in Python is 10% slower but the cache itself is 200% slower then maybe it’s okay because we’ve already accepted the performance trade off of having a cache?

@adriangb
Copy link
Contributor

adriangb commented May 13, 2023

@sobolevn what if instead of using __args__ as part of the cache key we use tuple(id(a) for a in __args__)?

Edit: I tried it in the branch I mentioned above (https://github.com/python/cpython/compare/main...adriangb:cpython:union-cache?expand=1) and it seems to work, albeit with a probably less performant and elegant version of _tp_cache.

@sobolevn
Copy link
Member Author

Using id (or any other thing) does not solve the fundamental problem of recomputing the cache key twice. I will return to this issue quite shortly, right now I am a bit bussy with some other stuff ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-typing type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants