Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError when loading model using methodcaller #283

Closed
merveenoyan opened this issue Jan 24, 2023 · 2 comments · Fixed by #287
Closed

TypeError when loading model using methodcaller #283

merveenoyan opened this issue Jan 24, 2023 · 2 comments · Fixed by #287
Labels
bug Something isn't working persistence Secure persistence feature

Comments

@merveenoyan
Copy link
Collaborator

Came across this error for only this model generated using the test case generation script in api-inference-community:
*** TypeError: methodcaller needs at least one argument, the method name

To reproduce:

from skops import hub_utils
hub_utils.download(repo_id = "skops-tests/textclassification-sklearn-latest-hist_gradient_boosting-with-config-skops", dst="model_dir")

import skops.io as sio
sio.load("/content/model_dir/skops-v588xnq6.skops", trusted=True)

At first I thought something might've been wrong with how I serialize it but I think it's weird how every test passes and this model fails for inference.

@BenjaminBossan BenjaminBossan added bug Something isn't working persistence Secure persistence feature labels Jan 25, 2023
@BenjaminBossan
Copy link
Collaborator

BenjaminBossan commented Jan 25, 2023

@E-Aho @adrinjalali This is a veritable bug, here is a minimal reproduction:

from sklearn.base import BaseEstimator
from operator import methodcaller

class MyEstimator(BaseEstimator):
   def __init__(self):
       pass
   def fit(self, X, y):
       self.meth_ = methodcaller("foo")
       return self

est = MyEstimator().fit(None, None)
sio.loads(sio.dumps(est), trusted=True)

Resulting in:

--> 397 instance = cls.__new__(cls)  # type: ignore
    399 if not self.children["attrs"]:
    400     # nothing more to do
    401     return instance

TypeError: methodcaller needs at least one argument, the method name

The offending line:

instance = cls.__new__(cls) # type: ignore

This could be an interesting one to resolve. Presumably, methodcaller is not the only type that fails like this, we should ensure that other similar types work too.

@merveenoyan methodcaller is used here:

https://github.com/huggingface/api-inference-community/blob/763efb856835098e7c9b91219b241ea7e524e695/docker_images/sklearn/tests/generators/generate.py#L96

The script could be rewritten to not use methodcaller and should then pass.

@BenjaminBossan BenjaminBossan changed the title TypeError when loading model TypeError when loading model using methodcaller Jan 25, 2023
@adrinjalali
Copy link
Member

I was confused for a while, since methodcaller is defined here (https://github.com/python/cpython/blob/498598e8c2d64232d26c075de87c513415176bbf/Lib/operator.py#L302) as:

class methodcaller:
    """
    Return a callable object that calls the given method on its operand.
    After f = methodcaller('name'), the call f(r) returns r.name().
    After g = methodcaller('name', 'date', foo=1), the call g(r) returns
    r.name('date', foo=1).
    """
    __slots__ = ('_name', '_args', '_kwargs')

    def __init__(self, name, /, *args, **kwargs):
        self._name = name
        if not isinstance(self._name, str):
            raise TypeError('method name must be a string')
        self._args = args
        self._kwargs = kwargs

    def __call__(self, obj):
        return getattr(obj, self._name)(*self._args, **self._kwargs)

    def __repr__(self):
        args = [repr(self._name)]
        args.extend(map(repr, self._args))
        args.extend('%s=%r' % (k, v) for k, v in self._kwargs.items())
        return '%s.%s(%s)' % (self.__class__.__module__,
                              self.__class__.__name__,
                              ', '.join(args))

    def __reduce__(self):
        if not self._kwargs:
            return self.__class__, (self._name,) + self._args
        else:
            from functools import partial
            return partial(self.__class__, self._name, **self._kwargs), self._args

which doesn't implement __new__. So it was puzzling why the error. Then I realized, like many other things in python, there's also a C implementation of it, and it implements __new__ here (https://github.com/python/cpython/blob/498598e8c2d64232d26c075de87c513415176bbf/Modules/_operator.c#L1563), which starts as:

static PyObject *
methodcaller_new(PyTypeObject *type, PyObject *args, PyObject *kwds)
{
    methodcallerobject *mc;
    PyObject *name;

    if (PyTuple_GET_SIZE(args) < 1) {
        PyErr_SetString(PyExc_TypeError, "methodcaller needs at least "
                        "one argument, the method name");
        return NULL;
    }

Anyhow, now we know it's happening because methodcaller has a custom __new__ method which requires an extra arg. Luckily not many objects implement __new__, and I always thought for these objects we'd have a dedicated loader, like we do for set, dict etc. Cause I really don't want to rely on their __reduce__ lol

As for the issue on the hub, @merveenoyan you can change the code to avoid using methodcaller, while we fix it here for this object.

BenjaminBossan added a commit to BenjaminBossan/skops that referenced this issue Jan 30, 2023
Fixes skops-dev#283

Description

Some of the operator functions can be quite useful sklearn users,
especially in conjunction with FunctionTransformer. This is because it
often allows to avoid writing small functions or using lambda, which
can't be persisted. However, since some of these functions are actually
classes defined in CPython that need special treatment, our code didn't
work with them.

This fix consists of two parts.

On the one hand, functions like methodcaller now have a special dispatch
type that correctly knows how to initialize them.

On the other hand, we had a problem with some funcs like operator.add.
Since they take 2 arguments, it would be useful for them to be passed
without initialization in conjunction with partial. However, their type
is builtin_function_or_method. By default, they would be handled by
object_get_state, but that doesn't work correctly with uninitialized
objects. This has now been fixed.

Open questions

To get the type of operator.add et al, we need
builtin_function_or_method. Right now, I just do
builtin_function_or_method = type(len). Is there a better way?

Regarding operators like attrgetter and methodcaller, could there be a
security implication by allowing them? Sure, users can decide not to
trust that type, but whether it is trustworthy or not depends on its
arguments. Right now, we just tell them that they need to trust
_operator.attrgetter, which could seem harmless.
adrinjalali pushed a commit that referenced this issue Mar 6, 2023
Fixes #283

## Description

Some of the operator functions can be quite useful sklearn users, especially in conjunction with `FunctionTransformer`. This is because it often allows to avoid writing small functions or using `lambda`, which can't be persisted. However, since some of these functions are actually classes defined in CPython that need special treatment, our code didn't work with them.

This fix consists of two parts.

On the one hand, functions like `methodcaller` now have a special dispatch type that knows how to correctly initialize them.

On the other hand, we had a problem with some types like `operator.add`. Since they take 2 arguments, it would be useful for them to be passed without initialization in conjunction with `partial`. However, their type is `builtin_function_or_method`. By default, they would be handled by `object_get_state`, but that doesn't work correctly with uninitialized objects. This has now been fixed by dispatching to `type_get_state`.

## Open questions

To get the type of `operator.add` et al., we need `builtin_function_or_method`. Right now, I just do `builtin_function_or_method = type(len)`. Is there a better way?

Regarding operators like `attrgetter` and `methodcaller`, could there be a security implication by allowing them? Sure, users can decide not to trust that type, but whether it is trustworthy or not depends on its arguments. Right now, we just tell them that they need to trust `_operator.attrgetter`, which could seem harmless. The same argument can of course be made for `eval` & co, but there, users will be much more likely to recognize their danger.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working persistence Secure persistence feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants