Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Support operators persistence #287

Conversation

BenjaminBossan
Copy link
Collaborator

Fixes #283

Description

Some of the operator functions can be quite useful sklearn users, especially in conjunction with FunctionTransformer. This is because it often allows to avoid writing small functions or using lambda, which can't be persisted. However, since some of these functions are actually classes defined in CPython that need special treatment, our code didn't work with them.

This fix consists of two parts.

On the one hand, functions like methodcaller now have a special dispatch type that knows how to correctly initialize them.

On the other hand, we had a problem with some types like operator.add. Since they take 2 arguments, it would be useful for them to be passed without initialization in conjunction with partial. However, their type is builtin_function_or_method. By default, they would be handled by object_get_state, but that doesn't work correctly with uninitialized objects. This has now been fixed by dispatching to type_get_state.

Open questions

To get the type of operator.add et al., we need builtin_function_or_method. Right now, I just do builtin_function_or_method = type(len). Is there a better way?

Regarding operators like attrgetter and methodcaller, could there be a security implication by allowing them? Sure, users can decide not to trust that type, but whether it is trustworthy or not depends on its arguments. Right now, we just tell them that they need to trust _operator.attrgetter, which could seem harmless. The same argument can of course be made for eval & co, but there, users will be much more likely to recognize their danger.

Fixes skops-dev#283

Description

Some of the operator functions can be quite useful sklearn users,
especially in conjunction with FunctionTransformer. This is because it
often allows to avoid writing small functions or using lambda, which
can't be persisted. However, since some of these functions are actually
classes defined in CPython that need special treatment, our code didn't
work with them.

This fix consists of two parts.

On the one hand, functions like methodcaller now have a special dispatch
type that correctly knows how to initialize them.

On the other hand, we had a problem with some funcs like operator.add.
Since they take 2 arguments, it would be useful for them to be passed
without initialization in conjunction with partial. However, their type
is builtin_function_or_method. By default, they would be handled by
object_get_state, but that doesn't work correctly with uninitialized
objects. This has now been fixed.

Open questions

To get the type of operator.add et al, we need
builtin_function_or_method. Right now, I just do
builtin_function_or_method = type(len). Is there a better way?

Regarding operators like attrgetter and methodcaller, could there be a
security implication by allowing them? Sure, users can decide not to
trust that type, but whether it is trustworthy or not depends on its
arguments. Right now, we just tell them that they need to trust
_operator.attrgetter, which could seem harmless.
@BenjaminBossan
Copy link
Collaborator Author

Ready for review @skops-dev/maintainers

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative is to call __new__ with the param we need.

skops/io/_general.py Show resolved Hide resolved
@BenjaminBossan
Copy link
Collaborator Author

Another alternative is to call new with the param we need.

Do you mean changing the line

instance = cls.__new__(cls)

in ObjectNode. This was my first attempt (which also worked) but it required adding a bunch of if ... else into the generic object path, which made the resulting code harder to understand. Using a special dispatch function seemed to be the better way.

instead of this, did you test what happens if we use the reduce loader we already have?

I tried it and it can be made to work. Here are a list of changes:

  1. move the reduce_get_state and ReduceNode to _general.py
  2. write custom operator_reduce_get_state and OperatoreReduceNode (since the generic versions are not intended to be used directly), which are about the same length of the current operator_func_get_state and OperatorFuncNode
  3. for operator.add et al, we still need to use the type approach

Overall, it's the same amount of code but with more indirection, and more to rewrite should we ever get rid of the reduce_get_state and ReduceNode. Therefore, I would slightly prefer the existing suggestion. I have stashed the other solution though, If there are better arguments to make the switch.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue I noticed, is that operator module includes too much. It sounds like an innocent module, but then it has things like call, setitem, deltem, and inplace manipulators. I wanna make sure those are not supported out of the box, and certainly not as trusted.

skops/io/_general.py Show resolved Hide resolved
Copy link
Collaborator Author

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue I noticed, is that operator module includes too much. It sounds like an innocent module, but then it has things like call, setitem, delitem, and inplace manipulators. I wanna make sure those are not supported out of the box, and certainly not as trusted.

As is, users need to trust those, we haven't added them to the list of automatically trusted modules. Do you want me to write a test that checks that loading fails if they are not set as trusted by the user?

skops/io/_general.py Show resolved Hide resolved
@adrinjalali
Copy link
Member

A test to make sure those are not trusted would be nice. Generally, making sure things we know shouldn't be trusted are not trusted is a nice thing. I'm happy to do it here or in a separate PR, as you wish.

The bug was that the attrs of the operator were stored as is, instead of
storing their skops state. This worked in this instance because the
attrs were just a list of str, i.e. primitive types. But still we should
always store to (and restore from) the skops state.

Another unit test was added that tests that operators are not trusted by
default.
@BenjaminBossan
Copy link
Collaborator Author

@adrinjalali I added tests as suggested. I also found a somewhat embarrassing bug, not sure if we could have added some checks to prevent this type of bug in the future.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

As for the bug you found and the test for it, it's a good question.

skops/io/tests/test_persist.py Show resolved Hide resolved
i* operators and for Python 3.11+, call.
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope this doesn't blow up 😁

@adrinjalali adrinjalali merged commit 872931b into skops-dev:main Mar 6, 2023
@BenjaminBossan
Copy link
Collaborator Author

Hope this doesn't blow up

What do you mean? Security-wise? Maybe we can better document that users shouldn't blindly trust every type and what builtin types to look out for.

@BenjaminBossan BenjaminBossan deleted the FIX-283-support-operators-persistence branch March 6, 2023 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TypeError when loading model using methodcaller
2 participants