FIX: Support operators persistence #287

BenjaminBossan · 2023-01-30T14:32:33Z

Fixes #283

Description

Some of the operator functions can be quite useful sklearn users, especially in conjunction with FunctionTransformer. This is because it often allows to avoid writing small functions or using lambda, which can't be persisted. However, since some of these functions are actually classes defined in CPython that need special treatment, our code didn't work with them.

This fix consists of two parts.

On the one hand, functions like methodcaller now have a special dispatch type that knows how to correctly initialize them.

On the other hand, we had a problem with some types like operator.add. Since they take 2 arguments, it would be useful for them to be passed without initialization in conjunction with partial. However, their type is builtin_function_or_method. By default, they would be handled by object_get_state, but that doesn't work correctly with uninitialized objects. This has now been fixed by dispatching to type_get_state.

Open questions

To get the type of operator.add et al., we need builtin_function_or_method. Right now, I just do builtin_function_or_method = type(len). Is there a better way?

Regarding operators like attrgetter and methodcaller, could there be a security implication by allowing them? Sure, users can decide not to trust that type, but whether it is trustworthy or not depends on its arguments. Right now, we just tell them that they need to trust _operator.attrgetter, which could seem harmless. The same argument can of course be made for eval & co, but there, users will be much more likely to recognize their danger.

Fixes skops-dev#283 Description Some of the operator functions can be quite useful sklearn users, especially in conjunction with FunctionTransformer. This is because it often allows to avoid writing small functions or using lambda, which can't be persisted. However, since some of these functions are actually classes defined in CPython that need special treatment, our code didn't work with them. This fix consists of two parts. On the one hand, functions like methodcaller now have a special dispatch type that correctly knows how to initialize them. On the other hand, we had a problem with some funcs like operator.add. Since they take 2 arguments, it would be useful for them to be passed without initialization in conjunction with partial. However, their type is builtin_function_or_method. By default, they would be handled by object_get_state, but that doesn't work correctly with uninitialized objects. This has now been fixed. Open questions To get the type of operator.add et al, we need builtin_function_or_method. Right now, I just do builtin_function_or_method = type(len). Is there a better way? Regarding operators like attrgetter and methodcaller, could there be a security implication by allowing them? Sure, users can decide not to trust that type, but whether it is trustworthy or not depends on its arguments. Right now, we just tell them that they need to trust _operator.attrgetter, which could seem harmless.

BenjaminBossan · 2023-01-30T14:42:30Z

Ready for review @skops-dev/maintainers

adrinjalali

Another alternative is to call __new__ with the param we need.

skops/io/_general.py

BenjaminBossan · 2023-01-31T10:25:06Z

Another alternative is to call new with the param we need.

Do you mean changing the line

instance = cls.__new__(cls)

in ObjectNode. This was my first attempt (which also worked) but it required adding a bunch of if ... else into the generic object path, which made the resulting code harder to understand. Using a special dispatch function seemed to be the better way.

instead of this, did you test what happens if we use the reduce loader we already have?

I tried it and it can be made to work. Here are a list of changes:

move the reduce_get_state and ReduceNode to _general.py
write custom operator_reduce_get_state and OperatoreReduceNode (since the generic versions are not intended to be used directly), which are about the same length of the current operator_func_get_state and OperatorFuncNode
for operator.add et al, we still need to use the type approach

Overall, it's the same amount of code but with more indirection, and more to rewrite should we ever get rid of the reduce_get_state and ReduceNode. Therefore, I would slightly prefer the existing suggestion. I have stashed the other solution though, If there are better arguments to make the switch.

Copy & paste error.

adrinjalali

One issue I noticed, is that operator module includes too much. It sounds like an innocent module, but then it has things like call, setitem, deltem, and inplace manipulators. I wanna make sure those are not supported out of the box, and certainly not as trusted.

skops/io/_general.py

BenjaminBossan

One issue I noticed, is that operator module includes too much. It sounds like an innocent module, but then it has things like call, setitem, delitem, and inplace manipulators. I wanna make sure those are not supported out of the box, and certainly not as trusted.

As is, users need to trust those, we haven't added them to the list of automatically trusted modules. Do you want me to write a test that checks that loading fails if they are not set as trusted by the user?

skops/io/_general.py

adrinjalali · 2023-02-28T10:17:55Z

A test to make sure those are not trusted would be nice. Generally, making sure things we know shouldn't be trusted are not trusted is a nice thing. I'm happy to do it here or in a separate PR, as you wish.

The bug was that the attrs of the operator were stored as is, instead of storing their skops state. This worked in this instance because the attrs were just a list of str, i.e. primitive types. But still we should always store to (and restore from) the skops state. Another unit test was added that tests that operators are not trusted by default.

BenjaminBossan · 2023-03-02T13:37:27Z

@adrinjalali I added tests as suggested. I also found a somewhat embarrassing bug, not sure if we could have added some checks to prevent this type of bug in the future.

adrinjalali

Otherwise LGTM.

As for the bug you found and the test for it, it's a good question.

skops/io/tests/test_persist.py

i* operators and for Python 3.11+, call.

adrinjalali

Hope this doesn't blow up 😁

BenjaminBossan · 2023-03-06T09:23:40Z

Hope this doesn't blow up

What do you mean? Security-wise? Maybe we can better document that users shouldn't blindly trust every type and what builtin types to look out for.

BenjaminBossan added 3 commits January 30, 2023 15:26

Simplify code

7b151ee

Add changes.rst entry

9719e6a

BenjaminBossan mentioned this pull request Jan 30, 2023

Integrate skops serialization format huggingface/api-inference-community#174

Merged

adrinjalali reviewed Jan 30, 2023

View reviewed changes

skops/io/_general.py Show resolved Hide resolved

BenjaminBossan added 3 commits January 31, 2023 11:26

Remove bytes from default trusted types

4501fd8

Copy & paste error.

Merge branch 'main' into FIX-283-support-operators-persistence

7b505c8

Merge branch 'main' into FIX-283-support-operators-persistence

6dd5dea

adrinjalali reviewed Feb 27, 2023

View reviewed changes

skops/io/_general.py Show resolved Hide resolved

BenjaminBossan commented Feb 27, 2023

View reviewed changes

skops/io/_general.py Show resolved Hide resolved

BenjaminBossan added 2 commits March 2, 2023 13:51

Merge branch 'main' into FIX-283-support-operators-persistence

3a18a26

adrinjalali reviewed Mar 3, 2023

View reviewed changes

skops/io/tests/test_persist.py Show resolved Hide resolved

Add more operators to testing

f80b134

i* operators and for Python 3.11+, call.

adrinjalali approved these changes Mar 6, 2023

View reviewed changes

adrinjalali merged commit 872931b into skops-dev:main Mar 6, 2023

BenjaminBossan deleted the FIX-283-support-operators-persistence branch March 6, 2023 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: Support operators persistence #287

FIX: Support operators persistence #287

BenjaminBossan commented Jan 30, 2023

BenjaminBossan commented Jan 30, 2023

adrinjalali left a comment

BenjaminBossan commented Jan 31, 2023

adrinjalali left a comment

BenjaminBossan left a comment

adrinjalali commented Feb 28, 2023

BenjaminBossan commented Mar 2, 2023

adrinjalali left a comment

adrinjalali left a comment

BenjaminBossan commented Mar 6, 2023

FIX: Support operators persistence #287

FIX: Support operators persistence #287

Conversation

BenjaminBossan commented Jan 30, 2023

Description

Open questions

BenjaminBossan commented Jan 30, 2023

adrinjalali left a comment

Choose a reason for hiding this comment

BenjaminBossan commented Jan 31, 2023

adrinjalali left a comment

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

adrinjalali commented Feb 28, 2023

BenjaminBossan commented Mar 2, 2023

adrinjalali left a comment

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

BenjaminBossan commented Mar 6, 2023