-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load model in Card class #174
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for contributing 🧡 could you please also write tests making sure that it would work for both a model, a model path containing a pkl
file, and a model path containing .skops
format?
skops/card/_model_card.py
Outdated
model_path = Path(model) | ||
if not model_path.exists(): | ||
raise ValueError("Model file does not exist") | ||
model = skops.io.load(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now, we also support pickle files. For this to work, it would be good to add a line that loads a pickle. skops.io.load() loads a model file with skops extension, which is essentially a .zip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just added a few changes for "skops" as well as "pkl" file suuport
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this.
Not a full review but a couple of points:
- We need to have a test for this. If you need help implementing the test, let us know.
- The docstring of the
Card
class needs to be updated. - Right now, the model is eagerly loaded during
__init__
. Recently, we made a move to prevent this (see MNT Model card refactor to move plotting & table #163). However, lazy loading could be costly because it would be done for each call to__repr__
. Maybe we can store the actual model in a different attribute (say,_model
) when it's called for the first time?
But we need to make sure we invalidate that whenever the |
This could be done with a setter but maybe we can come up with an easier solution? |
Yes sure, there will be same test function but instead of model instance we have to pass a tmp model file (one for skops and one for pkl).
On this as the same argument is supporting both model and model_path so in representation and in _generate_card we will be doing model loading. |
@p-mishra1 Just to make sure, what is the state of this PR? Are you waiting for input from our side? |
Yes @BenjaminBossan i changed the functionality to load skops file as well as pkl file and also mentioned some of the things about thr design choice. |
So here we would like to see unit tests for the new feature being added, similar to what we have here. Would you please work on that? We can help if needed.
So I prefer the second solution. When the model is first accessed, if it's a path, it is loaded, if it's already a model, then we're good. It should be stored on a private attribute like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this @p-mishra1
This is progressing in the right direction, but there is still some work to do. Most significantly, the initial consideration is still not addressed:
Right now, the model is eagerly loaded during
__init__
. Recently, we made a move to prevent this. However, lazy loading could be costly because it would be done for each call to__repr__
. Maybe we can store the actual model in a different attribute (say,_model
) when it's called for the first time?
Why is this important? Say, a user creates a card instance with model_path0
and then later re-assigns the model to model_path1
, the _load_model
function will not be called. Then code further down the line will fail.
I created a minimal example that shows how this problem can be solved with @property
and setter
:
import pickle
from typing import Any
from pathlib import Path
import joblib
from sklearn.linear_model import LinearRegression, LogisticRegression
from skops.io import load
def _load_model(model: Any) -> Any:
"""Loads the model if provided a file path, if already a model instance,
return it unmodified.
Parameters
----------
model : pathlib.path, str, or sklearn estimator
Path/str or the actual model instance. If a Path or str, loads the model on first call.
Returns
-------
model : object
Model instance.
"""
if not isinstance(model, (Path, str)):
return model
model_path = Path(model)
if not model_path.exists():
raise ValueError("Model file does not exist")
if model_path.suffix in (".pkl", ".pickle"):
# I don't think there is any advantage of joblib.load over pickle.load
model = joblib.load(model_path)
elif model_path.suffix == ".skops":
model = load(model_path)
else:
msg = f"Cannot interpret model suffix {model_path.suffix}, should be '.pkl', '.pickle' or '.skops'"
raise ValueError(msg)
return model
class Card:
def __init__(self, model):
self._model = model
@property
def model(self):
model = _load_model(self._model)
if model is not self._model:
self._model = model
return model
@model.setter
def model(self, model):
self._model = model
class TestCardModelAttribute:
def test_model_estimator(self):
model0 = LinearRegression()
card = Card(model0)
assert card.model is model0
# re-assigning the model works
model1 = LogisticRegression()
card.model = model1
assert card.model is model1
# re-assigning back to original works
card.model = model0
assert card.model is model0
def test_model_is_str_pickle(self, tmp_path):
model0 = LinearRegression(n_jobs=123)
f_name0 = tmp_path / "lin_reg.pickle"
with open(f_name0, "wb") as f:
pickle.dump(model0, f)
card = Card(f_name0)
assert isinstance(card.model, LinearRegression)
assert card.model.n_jobs == 123
# re-assigning the model works
model1 = LogisticRegression(n_jobs=456)
f_name1 = tmp_path / "log_reg.pickle"
with open(f_name1, "wb") as f:
pickle.dump(model1, f)
card.model = f_name1
assert isinstance(card.model, LogisticRegression)
assert card.model.n_jobs == 456
# re-assigning back to original works
card.model = f_name0
assert isinstance(card.model, LinearRegression)
assert card.model.n_jobs == 123
This example should work as is. As you can see, I simplified a lot, but it should demonstrate the general idea. Do you think you could refactor your code to work like this?
Also, as you can see, I added some specific tests for this feature. This is preferred over modifying all the existing tests to check both models passed as str or as actual estimators. So could you please undo your changes to the existing tests and instead add tests that are specific to this problem?
PS: We should also add the possibility to pass a pathlib.Path
, not only str
, since we generally support that throughout skops.
thanks @BenjaminBossan will update the changes |
d6c9856
to
3c93add
Compare
thanks @BenjaminBossan for the code snippet and detailed explanation i have added required changes, i also added a deleter for model. have a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for reworking this PR. We're getting closer to the finish line.
I did a light review of your PR and there are a couple of things that need to be changed.
Formatting
Could you revert the formatting changes you made? It looks like some tool you use (your IDE maybe?) reformatted the files you worked on to use 79 or 80 characters per line. However, we use black
with the default settings, which comes down to ~88 characters per line. If you undo these changes, the code will be much easier to review because the formatting changes distract from the real changes.
The best way to achieve this would be if you could first undo your formatting changes, then add black
via a pre-commit hook. This step is explained in our contribution pipeline but the gist is that you should do something like:
pip install pre-commit
pre-commit install
from within your Python environment. After doing this, each time when you commit, the code will be checked for black
compliance.
Testing
If I understand correctly, you have duplicated a number of tests. Examples are test_plot_model_false_from_path
and test_add_from_path
. The change you introduced was to have the card initialized from a file instead of passing the model directly, the rest is identical.
IMO, it is not necessary to duplicate all those tests, since the actual thing they're testing is not related to how the model was passed. Instead, it is sufficient to have tests that show that passing a model or a str
or a Path
do the same thing, and that modifying the .model
attribute works (as in the last test class (which btw. should also test deleting, which you correctly added)). The duplicated tests can then be removed.
Another suggestion I have, which is not as clear cut, is to move _load_model
out to be a standalone function and not a method on Card
. Then add a few tests just for this function. Once we know that this function works well, testing the Card
class will be much easier because we know that the model loading part is covered.
Test fixtures
I think there is a misunderstanding regarding the use of pytest fixtures. Let's take this example:
@pytest.mark.parametrize("suffix", [".pkl", ".pickle", ".skops"])
def test_save_model_card_from_path(destination_path, model_card_from_path):
model_card_from_path.save(Path(destination_path) / "README.md")
assert (Path(destination_path) / "README.md").exists()
Here you use the fixture suffix
but your test doesn't make use of it. This means that the exact same test will be run 3 times, with no additional value. The correct way to use fixtures is to create the decorator (as you did), then add the fixture name as an argument to the function, and then to use that argument in the function (e.g. in test_plot_model_false_from_path
you use the fixture correctly).
I hope this helps and if you have more questions feel free to ask us.
I used the contribution setup before raising the PR but i think my local black setup formatted the code before the being formatted by pre-commit.
okay, I will remove the duplicate tests and make standalone _load_model function and write tests for funtion as well as passing path.
Fixture parameterisation |
Ah I see, I didn't know about this new feature. It's kinda cool but also a bit magic. I wonder if it's really preferable over using |
Oh nice, finally pytest supporting this. Been waiting for it for ages. I personally find it more intuitive than using |
Created a fresh PR in #205 for this. |
#96 added a function in the Card class to load the model if provided a file path.
sorry if the implementation looks naive