-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH permutation importance #142
ENH permutation importance #142
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some specific comments for this PR and then a more general discussion.
For this PR, I would like to see some tests added. Also, I have a few comments for some lines of code, please check.
For the more general discussion, I want to ask the following question: Do we want to design skops model cards such that we add a bunch of narrow methods such as add_feature_importances
or do we just want to have general methods like add_plot
and require the user to do the work inside the method themselves?
E.g. for adding cross-validation results, we have put the burden on the user to process the data and then use the generic add_table
method:
skops/examples/plot_model_card.py
Lines 151 to 166 in cfea7cb
cv_results = model.cv_results_ | |
clf_report = classification_report( | |
y_test, y_pred, output_dict=True, target_names=["malignant", "benign"] | |
) | |
# The classification report has to be transformed into a DataFrame first to have | |
# the correct format. This requires removing the "accuracy", which was added | |
# above anyway. | |
del clf_report["accuracy"] | |
clf_report = pd.DataFrame(clf_report).T.reset_index() | |
model_card.add_table( | |
folded=True, | |
**{ | |
"Hyperparameter search results": cv_results, | |
"Classification report": clf_report, | |
}, | |
) |
For consistency, this should either be its own method,add_cv_results
, or add_feature_importances
should not be its own method.
The arguments for not providing specific methods are:
- Less work for us (though we should still provide examples)
- More flexibility for the user. E.g. here, the user cannot change the parameters of the plot, e.g. the what the title should be or what the name of the plot should be.
- Don't need to worry about pandas dependency etc.
The argument for providing the specific methods:
- Easier for the user if they don't want to modify the defaults.
- Easier to discover than just providing examples.
skops/card/_model_card.py
Outdated
@@ -10,6 +10,8 @@ | |||
from reprlib import Repr | |||
from typing import Any, Optional, Union | |||
|
|||
import matplotlib.pyplot as plt | |||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it is assumed that pandas is installed. However, pandas is not a strict dependency for skops. I think we specifically didn't want to add pandas because it is a "fat" dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I thought of it as well, I will try to plot without it.
@BenjaminBossan I had a quick huddle last week with @adrinjalali where we discussed this and he told me to create a method to add feature importance graphs. I can remove this and add it to model card example instead. |
If both of you agree that this is a good thing to have, I'm also fine with it. I just wanted to make sure that this decision is made deliberately, after taking into account the tradeoffs. Especially, it will probably result in quite a few very similar methods being added in the future (like the example of adding CV results). |
@BenjaminBossan I sort of agree and think the dependencies are a bit much if we want to move forward (I'll handle stuff with numpy instead). Would you like to wait for @adrinjalali to come back from his vacay to make a decision? |
If you can manage to remove the pandas dependency and if you already decided with Adrin that this would be a good feature to have, I think we can move forward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @merveenoyan
skops/card/_model_card.py
Outdated
@@ -366,6 +368,33 @@ def add_metrics(self, **kwargs: str) -> "Card": | |||
self._eval_results[metric] = value | |||
return self | |||
|
|||
def add_feature_importances(self, feature_importances) -> "Card": | |||
"""Visualize permutation importance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in terms of API, it would make sense to accept importances in a certain way, like accepting both a dict and a pandas dataframe, and then add it to the modelcard.
It doesn't have to be permutation importance.
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of comments, please take a look.
Also, I have a more general comment:
What do we do if a user wants to add more than 1 permutation importance graph? E.g. one for accuracy and one for recall? Right now, this would not work because the file name feature_importances.png
would be re-used, so we would get the same graph twice. Also, the section name is hard-coded, so the two sections would have the exact same name. I think we should have a solution for this.
Fixes #104 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work, by adding the new arguments, my concerns could be addressed. There are still a few improvements, please see my comments. Also, I think it would be a good idea to update the plot_model_card.py
and perhaps also add a sentence to docs/model_card.rst
.
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
@BenjaminBossan codecov fails 😅🥲 |
@merveenoyan Re-running solved the codecov issue. Not sure why codecov hates you specifically :D Could you please merge the current main branch, as have now added CI for Python 3.11? Then I'll review again. |
@BenjaminBossan merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, again I have a few small comments. But I swear we're getting really close :)
skops/conftest.py
Outdated
@@ -16,3 +18,30 @@ def mock_import(name, *args, **kwargs): | |||
|
|||
with patch("builtins.__import__", side_effect=mock_import): | |||
yield | |||
|
|||
import matplotlib # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this line. First of all, it's pandas here, not matplotlib, second, we don't delete pandas from the import cache, so re-importing is not necessary.
skops/io/_audit.py
Outdated
@@ -2,7 +2,7 @@ | |||
|
|||
import io | |||
from contextlib import contextmanager | |||
from typing import Any, Generator, Literal, Sequence, Type, Union | |||
from typing import Any, Generator, Literal, Sequence, Type, Union # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the # type: ignore
here really necessary?
skops/utils/importutils.py
Outdated
|
||
|
||
def import_or_raise(module, feature_name): | ||
"""Raise error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a better description here.
module = import_module(module) | ||
except ImportError as e: | ||
package = module.split(".")[0] | ||
raise ModuleNotFoundError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding: Why not a simple ImportError
?
@merveenoyan Some tests are failing. This is mostly fixed after merging with current main branch. Could you please do that? |
@BenjaminBossan should work now (magically ✨) but I realized pytest doesn't collect tests of parser for some reason on mac (assuming it's same on windows) (that's why nothing failed on my local, I guess) I checked github workflow and looked for
do you know what's going on? |
Probably it's just because you haven't installed pandoc locally? This line makes it so that tests are skipped when pandoc is not found: skops/skops/card/tests/test_parser.py Lines 14 to 18 in d9b7c36
The reason why it runs on Ubuntu CI is: skops/.github/workflows/build-test.yml Lines 65 to 67 in d9b7c36
I haven't added pandoc to the other builds because it's less trivial to install it on Mac and Windows and, as you mentioned, Ubuntu is the fastest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still a few # type: ignore
s in here which I think can be removed, apart from that it's ready to be merged from my perspective.
@BenjaminBossan I re-ran for codecov but no chance, it's failing 🥲🥲🥲🥲 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think now we're good 🎉
I implemented permutation importance, will write tests if you like where it goes.
I have two problems:
matplotlib
being a dependency here? I'll look for something better. Same applies forpandas
.feature_names_in_
toget_feature_names_out()
.The example script looks like below (I'll change layout of plot to fit inside the plot itself):