-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
skops can't load TargetEncoder object #450
Comments
Seems the core of the issue is that we don't support pandas yet, and it's been in my mind to do that. I'll draw up a PR to fix this. Thanks for the perfectly nicely written report BTW. |
The issue with this is that there doesn't seem to be any way to save/load a pandas object preserving all the data, w/o an external dependency. Need to figure out what to do here. Any thoughts @jorisvandenbossche |
Can you explain what you mean with "without an external dependency"? What's the exact issue you are running into? |
Here we avoid using Looking at pandas' code, I see that in a lot of cases Generally speaking, we're trying to avoid relying on I was hoping for something along the lines of I'm contemplating about adding that as a dependency, since it's small. Also, WDYT @BenjaminBossan |
I could imagine that fully supporting pandas would open up a big can of worms. If it's only about However, when it comes to the provided example, I saw that it uses |
The rest of Benjamin's comment might make it no longer relevant, but indeed for supporting this use case, I would first check what TargetEncoder exactly needs in terms of pandas support. For example, if it just stores the unique values as an pd.Index object (the error traceback above was about the index), it might be sufficient to support that through the existing support of numpy arrays (if you get the numpy array of an Index and serialize that / reconstruct it from that, I think that should already cover quite some of the use cases)
I think the main reason for pandas doing that is because the actual class constructor is often doing too much (like type inference, which we don't want on unpickle) |
I'd argue that it is. We support xgboost, catboost, quantile_forest, etc. Basically, the sklearn ecosystem, and sklearn compatible estimators are certainly in that ecosystem.
That's a very good point. We'd need to check what exactly needs to be persisted. (Maybe @lazarust would be happy to check?)
One can use this pattern for instance, to avoid calling the constructor: obj = Obj.__new__(**new_args)
obj.__setstate__(**state_from__getstate__) |
Yes, the idea is to also support the sklearn ecosystem. But I think that when it comes to this, it's a matter of trade-offs, i.e. how popular is the package vs how difficult is it to support it. In the end, it's up to you to make the call, however, I would certainly not want to support all of pandas just for |
I think sooner or later we need to support pandas since it's used in the ecosystem quite a lot, and this However, I'm tempted to use |
I'd be happy to look into what needs to persist (it may take me a week or so due to the holidays)!
I think that if we want to support Pandas more broadly, it may make more sense to use |
SKOPS is unable to load sklearn-pipeline having TargetEncoder object.
The code below helps to re-produce the issue.
Simple code for a classifier using Target Encoder
Using SKOPS to store into a file and get the unknown data types
Output of
unknown_type
Use SKOPS to load the file
Obtained the error
ENV Python 3.10.14
The text was updated successfully, but these errors were encountered: