-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX: Support operators persistence #287
FIX: Support operators persistence #287
Conversation
Fixes skops-dev#283 Description Some of the operator functions can be quite useful sklearn users, especially in conjunction with FunctionTransformer. This is because it often allows to avoid writing small functions or using lambda, which can't be persisted. However, since some of these functions are actually classes defined in CPython that need special treatment, our code didn't work with them. This fix consists of two parts. On the one hand, functions like methodcaller now have a special dispatch type that correctly knows how to initialize them. On the other hand, we had a problem with some funcs like operator.add. Since they take 2 arguments, it would be useful for them to be passed without initialization in conjunction with partial. However, their type is builtin_function_or_method. By default, they would be handled by object_get_state, but that doesn't work correctly with uninitialized objects. This has now been fixed. Open questions To get the type of operator.add et al, we need builtin_function_or_method. Right now, I just do builtin_function_or_method = type(len). Is there a better way? Regarding operators like attrgetter and methodcaller, could there be a security implication by allowing them? Sure, users can decide not to trust that type, but whether it is trustworthy or not depends on its arguments. Right now, we just tell them that they need to trust _operator.attrgetter, which could seem harmless.
Ready for review @skops-dev/maintainers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative is to call __new__
with the param we need.
Do you mean changing the line
in
I tried it and it can be made to work. Here are a list of changes:
Overall, it's the same amount of code but with more indirection, and more to rewrite should we ever get rid of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue I noticed, is that operator
module includes too much. It sounds like an innocent module, but then it has things like call
, setitem
, deltem
, and inplace manipulators. I wanna make sure those are not supported out of the box, and certainly not as trusted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue I noticed, is that
operator
module includes too much. It sounds like an innocent module, but then it has things likecall
,setitem
,delitem
, and inplace manipulators. I wanna make sure those are not supported out of the box, and certainly not as trusted.
As is, users need to trust those, we haven't added them to the list of automatically trusted modules. Do you want me to write a test that checks that loading fails if they are not set as trusted by the user?
A test to make sure those are not trusted would be nice. Generally, making sure things we know shouldn't be trusted are not trusted is a nice thing. I'm happy to do it here or in a separate PR, as you wish. |
The bug was that the attrs of the operator were stored as is, instead of storing their skops state. This worked in this instance because the attrs were just a list of str, i.e. primitive types. But still we should always store to (and restore from) the skops state. Another unit test was added that tests that operators are not trusted by default.
@adrinjalali I added tests as suggested. I also found a somewhat embarrassing bug, not sure if we could have added some checks to prevent this type of bug in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM.
As for the bug you found and the test for it, it's a good question.
i* operators and for Python 3.11+, call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope this doesn't blow up 😁
What do you mean? Security-wise? Maybe we can better document that users shouldn't blindly |
Fixes #283
Description
Some of the operator functions can be quite useful sklearn users, especially in conjunction with
FunctionTransformer
. This is because it often allows to avoid writing small functions or usinglambda
, which can't be persisted. However, since some of these functions are actually classes defined in CPython that need special treatment, our code didn't work with them.This fix consists of two parts.
On the one hand, functions like
methodcaller
now have a special dispatch type that knows how to correctly initialize them.On the other hand, we had a problem with some types like
operator.add
. Since they take 2 arguments, it would be useful for them to be passed without initialization in conjunction withpartial
. However, their type isbuiltin_function_or_method
. By default, they would be handled byobject_get_state
, but that doesn't work correctly with uninitialized objects. This has now been fixed by dispatching totype_get_state
.Open questions
To get the type of
operator.add
et al., we needbuiltin_function_or_method
. Right now, I just dobuiltin_function_or_method = type(len)
. Is there a better way?Regarding operators like
attrgetter
andmethodcaller
, could there be a security implication by allowing them? Sure, users can decide not to trust that type, but whether it is trustworthy or not depends on its arguments. Right now, we just tell them that they need to trust_operator.attrgetter
, which could seem harmless. The same argument can of course be made foreval
& co, but there, users will be much more likely to recognize their danger.