Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proforma] Use cache on modification resolvers #147

Closed
wants to merge 3 commits into from

Conversation

RalfG
Copy link
Contributor

@RalfG RalfG commented Apr 16, 2024

This PR adds caching to proforma modification resolvers.

In one of our use cases - reading PSMs from a search engine and recalculating theoretical masses - the same modifications (but different object instances) are resolved many times. Adding cache decorators to the resolvers massively speeds up this operation.

I assume there are no drawbacks to adding this. If lru_cache cannot be imported (Python < 3.2), a dummy decorator is defined. For now, I set maxsize to None, as there should be a finite list of modifications in most realistic use cases.

@levitsky
Copy link
Owner

Thank you @RalfG ! Do you think we can use the old utility function auxiliary.memoize instead of a no-op dummy cache?

@mobiusklein could you also take a look at this?

P.S. Another unnecessary optimization idea: should we auto-wrap all resolve methods of ModificationResolver subclasses automatically, without doing so in each class?

@RalfG
Copy link
Contributor Author

RalfG commented Apr 16, 2024

Did not realize there was a built-in version of lru_cache in Pyteomics. I updated the code to always use that one. Just to be sure, I also set the maxsize to 10000. I don't expect anyone to ever get to that number.

@RalfG
Copy link
Contributor Author

RalfG commented Apr 16, 2024

P.S. Another unnecessary optimization idea: should we auto-wrap all resolve methods of ModificationResolver subclasses automatically, without doing so in each class?

Not sure on how to implement it like that, but feel free to modify the PR.

@mobiusklein
Copy link
Contributor

mobiusklein commented Apr 17, 2024

The auxiliary.memoize decorator probably isn't suitable for use on an instance method. The cache would be shared across instances. So you'd have self as part of the method signature being hashed, which only works because ModificationResolver implements __eq__ and __hash__. This is the same thing that happens with the Python 3 functools.cache decorator (See https://docs.python.org/3/faq/programming.html#faq-cache-method-calls).

Ideally, you'd have the instance manage the cache itself, with as much of the heavy lifting done by the base class.

class ModificationResolver(object):
    def __init__(self, name, **kwargs):
        self.name = name.lower()
        self.symbol = self.name[0]
        self._database = None
        self._cache = {}
    ...
    # Move existing implementations of `resolve` to `_resolve_impl`
    def _resolve_impl(self, name=None, id=None, **kwargs):
        raise NotImplementedError()

    # Manage the cache inside `resolve`
    def resolve(self, name=None, id=None, **kwargs):
        if self._cache is None:
            return self._resolve_impl(name, id, **kwargs)
        cache_key = (name, id, frozenset(kwargs.items()))
        if cache_key in self._cache:
            return self._cache[cache_key]
        value = self._resolve_impl(name, id, **kwargs)
        self._cache[cache_key] = value
        return  value

This involves no metaprogramming and is just a bit more efficient than memoize too.

Otherwise, we could add a metaclass to ModificationResolver that automatically wraps the resolve method object in a memoize during class body execution and deals with inheritance of the cache properly. That's a lot of code for not a lot of benefit, but makes it effortless forever after.

A third way would be to split the difference with Python's name resolution system and just do this:

class ModificationResolver(object):
    def __init__(self, name, **kwargs):
        self.name = name.lower()
        self.symbol = self.name[0]
        self._database = None
        self.resolve = memoize(self.resolve)

Take a reference to the resolved instance method resolve for self, and replace it with an attribute that happens to be a callable object that wraps that function. Arguably less obscure than metaclasses, but is hard to read. It also doesn't kick in if the subclass doesn't call the base class __init__ method. The cache still hashes self, but the cache isn't shared across instances.

In terms of compatibility, they have differences in "compatibility":

  1. Not a breaking change per se, but subclasses which call the base class resolve instead of _resolve_impl may behave strangely.
  2. Not a breaking change. Everything is handled by the type system, but it is probably hard to get right on the first try.
  3. Theoretically a breaking change, but the "breaking" is just that no caching is used by a subclass.

Which road seems best? I think letting the instance manage the cache is probably best, but it breaks things. The third option is hacky but cheap. The second option is much more work than the first, and if we're doing that, we may as well implement some kind of instance-aware caching anyway.

@RalfG
Copy link
Contributor Author

RalfG commented Apr 17, 2024

Thanks for the detailed answer, @mobiusklein! I don't have any hard preferences for either of the solutions. The first seems the most doable?

@levitsky
Copy link
Owner

Thank you from me as well @mobiusklein , your analysis is always a ton of insight.

Although I would enjoy trying to get option 2 to work when I have time (which is next week), I can easily imagine shooting myself (or someone else) in the foot with it in the process. That being said, I don't exactly understand what you mean by cache inheritance that the metaclass code would have to handle. Wouldn't it essentially do what option 1 does, "renaming" (or "de-naming"?) the original resolve method and replacing it with a caching version, operating on self._cache?

Overall I think option 1 is the most straightforward and it allows creating subclassses that benefit from caching or subclasses that don't use it, so I would not object to just going with that if that seems to be the overall preference.

@mobiusklein
Copy link
Contributor

I had some other changes already in place for ProForma based upon the recent discussion in HUPO-PSI/ProForma#6. These changes and the first approach I proposed are now in #148.

@levitsky
Copy link
Owner

Thanks everyone, #148 is merged. Closing this one, but feel free to follow up as needed.

@levitsky levitsky closed this Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants