load model in Card class #174

p-mishra1 · 2022-10-05T14:54:34Z

#96 added a function in the Card class to load the model if provided a file path.
sorry if the implementation looks naive

merveenoyan

Thanks a lot for contributing 🧡 could you please also write tests making sure that it would work for both a model, a model path containing a pkl file, and a model path containing .skops format?

merveenoyan · 2022-10-05T21:37:28Z

skops/card/_model_card.py

+            model_path = Path(model)
+            if not model_path.exists():
+                raise ValueError("Model file does not exist")
+            model = skops.io.load(model)


As of now, we also support pickle files. For this to work, it would be good to add a line that loads a pickle. skops.io.load() loads a model file with skops extension, which is essentially a .zip.

just added a few changes for "skops" as well as "pkl" file suuport

BenjaminBossan

Thanks for taking this.

Not a full review but a couple of points:

We need to have a test for this. If you need help implementing the test, let us know.
The docstring of the Card class needs to be updated.
Right now, the model is eagerly loaded during __init__. Recently, we made a move to prevent this (see MNT Model card refactor to move plotting & table #163). However, lazy loading could be costly because it would be done for each call to __repr__. Maybe we can store the actual model in a different attribute (say, _model) when it's called for the first time?

adrinjalali · 2022-10-06T09:38:45Z

Right now, the model is eagerly loaded during init. Recently, we made a move to prevent this (see #163). However, lazy loading could be costly because it would be done for each call to repr. Maybe we can store the actual model in a different attribute (say, _model) when it's called for the first time?

But we need to make sure we invalidate that whenever the _model changes.

BenjaminBossan · 2022-10-06T11:23:41Z

But we need to make sure we invalidate that whenever the _model changes.

This could be done with a setter but maybe we can come up with an easier solution?

p-mishra1 · 2022-10-06T12:56:27Z

We need to have a test for this. If you need help implementing the test, let us know.

Yes sure, there will be same test function but instead of model instance we have to pass a tmp model file (one for skops and one for pkl).

Right now, the model is eagerly loaded during __init__. Recently, we made a move to prevent this (see MNT Model card refactor to move plotting & table #163). However, lazy loading could be costly because it would be done for each call to __repr__. Maybe we can store the actual model in a different attribute (say, _model) when it's called for the first time?

On this as the same argument is supporting both model and model_path so in representation and in _generate_card we will be doing model loading.
else we can do it in representation on 1st call and set _model to model instance but in that case both model and '_model' will have the instance when user will provide model instance to Card class.

BenjaminBossan · 2022-10-11T08:45:02Z

@p-mishra1 Just to make sure, what is the state of this PR? Are you waiting for input from our side?

p-mishra1 · 2022-10-11T15:25:29Z

Yes @BenjaminBossan i changed the functionality to load skops file as well as pkl file and also mentioned some of the things about thr design choice.
Waiting for input from you guys about that.

BenjaminBossan · 2022-10-12T08:32:08Z

We need to have a test for this. If you need help implementing the test, let us know.

Yes sure, there will be same test function but instead of model instance we have to pass a tmp model file (one for skops and one for pkl).

So here we would like to see unit tests for the new feature being added, similar to what we have here. Would you please work on that? We can help if needed.

On this as the same argument is supporting both model and model_path so in representation and in _generate_card we will be doing model loading.
else we can do it in representation on 1st call and set _model to model instance but in that case both model and '_model' will have the instance when user will provide model instance to Card class.

So I prefer the second solution. When the model is first accessed, if it's a path, it is loaded, if it's already a model, then we're good. It should be stored on a private attribute like _model. Then the next time, the model is already loaded and we don't need to load it again. When the user changes the model attribute, we need to invalidate _model, since it could now be a different model.

BenjaminBossan

Thank you for working on this @p-mishra1

This is progressing in the right direction, but there is still some work to do. Most significantly, the initial consideration is still not addressed:

Right now, the model is eagerly loaded during __init__. Recently, we made a move to prevent this. However, lazy loading could be costly because it would be done for each call to __repr__. Maybe we can store the actual model in a different attribute (say, _model) when it's called for the first time?

Why is this important? Say, a user creates a card instance with model_path0 and then later re-assigns the model to model_path1, the _load_model function will not be called. Then code further down the line will fail.

I created a minimal example that shows how this problem can be solved with @property and setter:

import pickle
from typing import Any
from pathlib import Path
import joblib
from sklearn.linear_model import LinearRegression, LogisticRegression
from skops.io import load


def _load_model(model: Any) -> Any:
    """Loads the model if provided a file path, if already a model instance,
    return it unmodified.

    Parameters
    ----------
    model : pathlib.path, str, or sklearn estimator
        Path/str or the actual model instance. If a Path or str, loads the model on first call.

    Returns
    -------
    model : object
        Model instance.

    """
    if not isinstance(model, (Path, str)):
        return model

    model_path = Path(model)
    if not model_path.exists():
        raise ValueError("Model file does not exist")

    if model_path.suffix in (".pkl", ".pickle"):
        # I don't think there is any advantage of joblib.load over pickle.load
        model = joblib.load(model_path)
    elif model_path.suffix == ".skops":
        model = load(model_path)
    else:
        msg = f"Cannot interpret model suffix {model_path.suffix}, should be '.pkl', '.pickle' or '.skops'"
        raise ValueError(msg)

    return model


class Card:
    def __init__(self, model):
        self._model = model

    @property
    def model(self):
        model = _load_model(self._model)
        if model is not self._model:
            self._model = model
        return model

    @model.setter
    def model(self, model):
        self._model = model


class TestCardModelAttribute:
    def test_model_estimator(self):
        model0 = LinearRegression()

        card = Card(model0)
        assert card.model is model0

        # re-assigning the model works
        model1 = LogisticRegression()
        card.model = model1
        assert card.model is model1

        # re-assigning back to original works
        card.model = model0
        assert card.model is model0

    def test_model_is_str_pickle(self, tmp_path):
        model0 = LinearRegression(n_jobs=123)
        f_name0 = tmp_path / "lin_reg.pickle"
        with open(f_name0, "wb") as f:
            pickle.dump(model0, f)

        card = Card(f_name0)
        assert isinstance(card.model, LinearRegression)
        assert card.model.n_jobs == 123

        # re-assigning the model works
        model1 = LogisticRegression(n_jobs=456)
        f_name1 = tmp_path / "log_reg.pickle"
        with open(f_name1, "wb") as f:
            pickle.dump(model1, f)

        card.model = f_name1
        assert isinstance(card.model, LogisticRegression)
        assert card.model.n_jobs == 456

        # re-assigning back to original works
        card.model = f_name0
        assert isinstance(card.model, LinearRegression)
        assert card.model.n_jobs == 123

This example should work as is. As you can see, I simplified a lot, but it should demonstrate the general idea. Do you think you could refactor your code to work like this?

Also, as you can see, I added some specific tests for this feature. This is preferred over modifying all the existing tests to check both models passed as str or as actual estimators. So could you please undo your changes to the existing tests and instead add tests that are specific to this problem?

PS: We should also add the possibility to pass a pathlib.Path, not only str, since we generally support that throughout skops.

p-mishra1 · 2022-10-25T06:50:18Z

thanks @BenjaminBossan will update the changes

This reverts commit d6c9856.

p-mishra1 · 2022-10-25T17:10:45Z

thanks @BenjaminBossan for the code snippet and detailed explanation i have added required changes, i also added a deleter for model. have a look.

BenjaminBossan

Thanks a lot for reworking this PR. We're getting closer to the finish line.

I did a light review of your PR and there are a couple of things that need to be changed.

Formatting

Could you revert the formatting changes you made? It looks like some tool you use (your IDE maybe?) reformatted the files you worked on to use 79 or 80 characters per line. However, we use black with the default settings, which comes down to ~88 characters per line. If you undo these changes, the code will be much easier to review because the formatting changes distract from the real changes.

The best way to achieve this would be if you could first undo your formatting changes, then add black via a pre-commit hook. This step is explained in our contribution pipeline but the gist is that you should do something like:

pip install pre-commit
pre-commit install

from within your Python environment. After doing this, each time when you commit, the code will be checked for black compliance.

Testing

If I understand correctly, you have duplicated a number of tests. Examples are test_plot_model_false_from_path and test_add_from_path. The change you introduced was to have the card initialized from a file instead of passing the model directly, the rest is identical.

IMO, it is not necessary to duplicate all those tests, since the actual thing they're testing is not related to how the model was passed. Instead, it is sufficient to have tests that show that passing a model or a str or a Path do the same thing, and that modifying the .model attribute works (as in the last test class (which btw. should also test deleting, which you correctly added)). The duplicated tests can then be removed.

Another suggestion I have, which is not as clear cut, is to move _load_model out to be a standalone function and not a method on Card. Then add a few tests just for this function. Once we know that this function works well, testing the Card class will be much easier because we know that the model loading part is covered.

Test fixtures

I think there is a misunderstanding regarding the use of pytest fixtures. Let's take this example:

@pytest.mark.parametrize("suffix", [".pkl", ".pickle", ".skops"])
def test_save_model_card_from_path(destination_path, model_card_from_path):
    model_card_from_path.save(Path(destination_path) / "README.md")
    assert (Path(destination_path) / "README.md").exists()

Here you use the fixture suffix but your test doesn't make use of it. This means that the exact same test will be run 3 times, with no additional value. The correct way to use fixtures is to create the decorator (as you did), then add the fixture name as an argument to the function, and then to use that argument in the function (e.g. in test_plot_model_false_from_path you use the fixture correctly).

I hope this helps and if you have more questions feel free to ask us.

p-mishra1 · 2022-10-26T14:28:18Z

Formatting

Could you revert the formatting changes you made? It looks like some tool you use (your IDE maybe?) reformatted the files you worked on to use 79 or 80 characters per line. However, we use black with the default settings, which comes down to ~88 characters per line. If you undo these changes, the code will be much easier to review because the formatting changes distract from the real changes.

The best way to achieve this would be if you could first undo your formatting changes, then add black via a pre-commit hook. This step is explained in our contribution pipeline but the gist is that you should do something like:
pip install pre-commit
pre-commit install
from within your Python environment. After doing this, each time when you commit, the code will be checked for black compliance.

I used the contribution setup before raising the PR but i think my local black setup formatted the code before the being formatted by pre-commit.

Testing

If I understand correctly, you have duplicated a number of tests. Examples are test_plot_model_false_from_path and test_add_from_path. The change you introduced was to have the card initialized from a file instead of passing the model directly, the rest is identical.

IMO, it is not necessary to duplicate all those tests, since the actual thing they're testing is not related to how the model was passed. Instead, it is sufficient to have tests that show that passing a model or a str or a Path do the same thing, and that modifying the .model attribute works (as in the last test class (which btw. should also test deleting, which you correctly added)). The duplicated tests can then be removed.

Another suggestion I have, which is not as clear cut, is to move _load_model out to be a standalone function and not a method on Card. Then add a few tests just for this function. Once we know that this function works well, testing the Card class will be much easier because we know that the model loading part is covered.

okay, I will remove the duplicate tests and make standalone _load_model function and write tests for funtion as well as passing path.

Test fixtures

I think there is a misunderstanding regarding the use of pytest fixtures. Let's take this example:
@pytest.mark.parametrize("suffix", [".pkl", ".pickle", ".skops"])
def test_save_model_card_from_path(destination_path, model_card_from_path):
    model_card_from_path.save(Path(destination_path) / "README.md")
    assert (Path(destination_path) / "README.md").exists()
Here you use the fixture suffix but your test doesn't make use of it. This means that the exact same test will be run 3 times, with no additional value. The correct way to use fixtures is to create the decorator (as you did), then add the fixture name as an argument to the function, and then to use that argument in the function (e.g. in test_plot_model_false_from_path you use the fixture correctly).

Fixture parameterisation
here the pytest fixture model_card_from_path has a param named suffix and pytest is implicitly passing the values. have a look at the above stackoverflow link, so the pytest behaviour is correct(i checked the test locally and verfied the output and behaviour before the PR ).

BenjaminBossan · 2022-10-26T14:40:42Z

here the pytest fixture model_card_from_path has a param named suffix and pytest is implicitly passing the values. have a look at the above stackoverflow link, so the pytest behaviour is correct(i checked the test locally and verfied the output and behaviour before the PR ).

Ah I see, I didn't know about this new feature. It's kinda cool but also a bit magic. I wonder if it's really preferable over using request.param to parametrize a fixture.

adrinjalali · 2022-10-28T10:07:59Z

Oh nice, finally pytest supporting this. Been waiting for it for ages. I personally find it more intuitive than using request.param. But everywhere this pattern is used, I'd leave a comment saying it's a parameter passed to the fixture for future developers to figure it out easily.

p-mishra1 · 2022-10-28T15:43:13Z

Created a fresh PR in #205 for this.

p-mishra1 mentioned this pull request Oct 5, 2022

Take model as argument to generate the card #96

Closed

p-mishra1 closed this Oct 5, 2022

p-mishra1 reopened this Oct 5, 2022

merveenoyan requested changes Oct 5, 2022

View reviewed changes

merveenoyan requested a review from adrinjalali October 5, 2022 21:40

BenjaminBossan reviewed Oct 6, 2022

View reviewed changes

p-mishra1 requested review from BenjaminBossan and removed request for adrinjalali October 20, 2022 16:11

BenjaminBossan requested changes Oct 24, 2022

View reviewed changes

p-mishra1 and others added 6 commits October 25, 2022 20:59

load model in Card class

8510d9e

doc failing resolution

4367c13

support for pkl model file

8f919a0

test addtion for card class

da8d302

Revert "test addtion for card class"

4051cff

This reverts commit d6c9856.

load model from path functionality and test addition

3c93add

p-mishra1 force-pushed the praj-model-loading branch from d6c9856 to 3c93add Compare October 25, 2022 17:04

p-mishra1 requested a review from BenjaminBossan October 25, 2022 17:07

BenjaminBossan requested changes Oct 26, 2022

View reviewed changes

p-mishra1 closed this Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load model in Card class #174

load model in Card class #174

p-mishra1 commented Oct 5, 2022

merveenoyan left a comment

merveenoyan Oct 5, 2022

p-mishra1 Oct 6, 2022

BenjaminBossan left a comment

adrinjalali commented Oct 6, 2022

BenjaminBossan commented Oct 6, 2022

p-mishra1 commented Oct 6, 2022 •

edited

Loading

BenjaminBossan commented Oct 11, 2022

p-mishra1 commented Oct 11, 2022 •

edited

Loading

BenjaminBossan commented Oct 12, 2022

BenjaminBossan left a comment

p-mishra1 commented Oct 25, 2022

p-mishra1 commented Oct 25, 2022

BenjaminBossan left a comment

p-mishra1 commented Oct 26, 2022

Formatting

Testing

Test fixtures

BenjaminBossan commented Oct 26, 2022

adrinjalali commented Oct 28, 2022

p-mishra1 commented Oct 28, 2022

load model in Card class #174

load model in Card class #174

Conversation

p-mishra1 commented Oct 5, 2022

merveenoyan left a comment

Choose a reason for hiding this comment

merveenoyan Oct 5, 2022

Choose a reason for hiding this comment

p-mishra1 Oct 6, 2022

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

adrinjalali commented Oct 6, 2022

BenjaminBossan commented Oct 6, 2022

p-mishra1 commented Oct 6, 2022 • edited Loading

BenjaminBossan commented Oct 11, 2022

p-mishra1 commented Oct 11, 2022 • edited Loading

BenjaminBossan commented Oct 12, 2022

BenjaminBossan left a comment

Choose a reason for hiding this comment

p-mishra1 commented Oct 25, 2022

p-mishra1 commented Oct 25, 2022

BenjaminBossan left a comment

Choose a reason for hiding this comment

Formatting

Testing

Test fixtures

p-mishra1 commented Oct 26, 2022

Formatting

Testing

Test fixtures

BenjaminBossan commented Oct 26, 2022

adrinjalali commented Oct 28, 2022

p-mishra1 commented Oct 28, 2022

p-mishra1 commented Oct 6, 2022 •

edited

Loading

p-mishra1 commented Oct 11, 2022 •

edited

Loading