-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative Card implementation #203
Alternative Card implementation #203
Conversation
This is a suggestion to address the issues discussed on skops-dev#72 Description The proposed model card implementation would allow to dynamically add sections or overwrite them. This is not a complete implementation but already covers most of the features we already have and then some. On top of these features, it would be possible to add more features like creating a default Card with placeholders, just like the exisint template, or the possibility to delete existing sections or to retrieve the result of a certain section. Implementation The underlying data structure consists of a dict and a Section dataclass. All data is stored in a _data attribute with the type dict[str, Section]. The dataclass hold the section contents, i.e. the section title, the section content, and subsections, which again have the same type. It's thus recursive data structure. Section title and dict key are identical, which is mostly for convenience. With this refactor, there are no separate data containers anymore for eval results, template sections, extra sections, etc. They are all treated the same. IMHO, this greatly simplifies the code overall. The only complex function that's left is the one needed to traverse the tree holding the data, and even that is just 14 LOC. Demo To see how the new class can be used, take a look at the main function. The resulting Card can be seen here: https://huggingface.co/skops-ci/hf_hub_example-fcc0d6fe-d072-4f94-8fdb-6bf3bb917bca
RFC @skops-dev/maintainers @E-Aho Don't hesitate to tell me if you think this totally sucks ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this is great! In my opinion it looks like it would be a lot more flexible than using .md templates.
I had a couple things I thought might be nice additions, but overall I really like this implementation :)
skops/card/_card_alternative.py
Outdated
# add arbitrary sections, overwrite them, etc. | ||
card.add(hi="howdy") | ||
card.add(**{"parent section/child section": "child content"}) | ||
card.add(**{"foo": "bar", "spam": "eggs"}) | ||
# change content of "hi" section | ||
card.add(**{"hi/german": "guten tag", "hi/french": "salut"}) | ||
card.add(**{"very/deeply/nested/section": "but why?"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this usage pattern ❤️ feels much easier to add to programatically
skops/card/_card_alternative.py
Outdated
self.model_diagram = model_diagram | ||
self.metadata = metadata or CardData() | ||
|
||
self._data: dict[str, Section] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thought: Might be worth using an OrderedDict
for this (and for the subsections
field in the Section
class.)
That way it would allow us to have more control over the ordering of the output. I know that as of 3.7, the order of the keys in a dict
is just the order of insertion, but it might be nice to offer users the ability to insert between existing sections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to support that? I see it as "the card is rendered in the order you're adding things" kinda API, especially since we don't expose this _data
to the users as a public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, one weakness of the chosen data structure is that it's hard to change the order of sections once they are created. However, I'm not sure how much OrderedDict
would help. I haven't used it in quite a while, but from the docs, the only additional methods it seems to have are move_to_end
and popitem(last=...)
, both of which don't help that much? Ideally, we would have something like:
card = Card(...)
card.add(section_a="content a", section_c="content c")
card.insert(section_b="content b", after="section_a")
or something. Not sure if that's easier to implement with OrderedDict
than with normal dicts.
Alternatively, we could implement the underlying tree as a nested list, which gives us more flexibility to modify the order:
>>> card._data
[
("section_a", "content a"),
("section_b", [
"content b",
("section_b0", "content_b0"),
("section_b1", "content_b1"),
],
...,
]
The disadvantage is that selecting items will be more difficult an inefficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to support that? I see it as "the card is rendered in the order you're adding things" kinda API, especially since we don't expose this _data to the users as a public API.
I think it would be fine to not support changing ordering for an MVP, but it would be at least a nice-to-have down the road imo. Especially if the default Card
class users handle has a fixed order for the defaults. Users might want to insert things between other sections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The disadvantage is that selecting items will be more difficult an inefficient.
I may have been remembering the abilities of OrderedDicts with rose tinted glasses...
Another alternative that might be a little simpler would be to hold a subsection_order
attribute in each Section
that is just a list with the names of children, and renders them in that order.
So if we rendered, as above:
("section_b", [
"content b",
("section_b0", "content_b0"),
("section_b1", "content_b1"),
],
)
Section b would just have an attribute like [__content__, section_b0, section_b1]
We could implement ease-of-use things like support for .add(..., after="foo")
as above as well
On efficiency, while it's always good to think about, I don't think we need to get too deep into optimizations with something like this. I'd be very surprised if anyone had even 30 sections total, and at that kind of scale it's not really a limitation imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The disadvantage is that selecting items will be more difficult an inefficient.
We can always implement __getitem__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative that might be a little simpler would be to hold a subsection_order attribute in each Section that is just a list with the names of children, and renders them in that order.
This would be a kind of index. I think if possible, we should avoid it so that we don't leave room for possible bugs where the content and the "index" are out of sync.
On efficiency, while it's always good to think about, I don't think we need to get too deep into optimizations with something like this.
Probably you're right, the main concern should be usability and simple code. But you never know how users can surprise you. In skorch, we ran into an issue because a user was training a neural net for tens of thousands of epochs, which we didn't expect...
We can always implement
__getitem__
.
Yes, I thought about that, it just feels "wrong" to have that method when the access is O(n) and not O(1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like nested list is a bit overkill (and selection should be less efficient as you said). I'm in favor of implementation as is with a regular dict.
skops/card/_card_alternative.py
Outdated
self._add_metrics(self._metrics) | ||
return self | ||
|
||
def _add_metrics(self, metrics: dict[str, float | int]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thought I had (feel free to challenge me or tell me it's a bad idea):
If we have a lot of these helper _add_...
sections, it might be nice to split this class into two classes, one with the core functionality methods (e.g _strip_blank
or __repr__
) and a subclass with the default formatting/helper functions, which a user might want to tweak themselves.
That way, it would be a lot easier for a user to make their own MyCard
class that has different formatting/ extra methods for their own models, similar to how they might create their own template in the old .md world.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of that they can pass something Formatable
to add
, and that's much easier to implement for them, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That way, it would be a lot easier for a user to make their own MyCard class that has different formatting/ extra methods for their own models, similar to how they might create their own template in the old .md world.
Let me try to understand this better. Right now, if a user would like to make small modifications to the existing templates, I think this can be easily accommodated with the added dynamism of the new class. If they want to have a new default template that they can share with others, it's not easily possible, they would have to share the code (a script) that generates this new template.
What you suggest is that it should be possible to create a subclass of Card
that starts with the new default. I like the idea. If that requires to split the class, I'm not sure, since pre-filling the card is not implemented yet (this is what I hoped to achieve with the make_default
class method). If splitting makes it easier for the user, then I'm all for it, I would have to see it in action.
Another issue is if we want to enable completely different output formats. Just as an example, what if a user wants to create an rst
or org
file instead of md
, or another md
flavor? Right now, the md
format is hard-coded. If we could split the class so that the "formatting engine" is decoupled, it would be a nice gain. But maybe we'll just never support other formats and users have to use pandoc
or similar tools to achieve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quite like the idea. I wonder how much work it'd be to load an .md
file back to this object.
skops/card/_card_alternative.py
Outdated
self.model_diagram = model_diagram | ||
self.metadata = metadata or CardData() | ||
|
||
self._data: dict[str, Section] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to support that? I see it as "the card is rendered in the order you're adding things" kinda API, especially since we don't expose this _data
to the users as a public API.
skops/card/_card_alternative.py
Outdated
self._metrics: dict[str, float | int] = {} | ||
self._reset() | ||
|
||
def _reset(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the code I don't see a reset happening here. As in, nothing seems to be cleared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, it's not used yet. The idea is that if I override, say, the model, we can internally call _reset
and more sure that everything that relates to the change is updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this going to simply erase README.md as it's still what we use to version models and create a new one? just to make sure we're on same page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this will not erase the README.md
. The intent was just that if a user re-assigns the model
, in the setter, we can call self._reset()
and every part of the model card that needs updating will be updated. I will do the proper implementation once #205 is merged, I hope it becomes more obvious then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it makes sense to update the parts that weren't previously written by the user (hyperparameters, plot and so on) it was a pain point with modelcards
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the idea (though I wouldn't touch any written README.md
). So when we have prefilled=True
, those sections are auto-created and auto-reset.
Something that doesn't work, though, is if the user would choose a different initial template, because then we don't know which sections need to be rewritten or not. I have to see if I find a solution there.
skops/card/_card_alternative.py
Outdated
self._add_metrics(self._metrics) | ||
return self | ||
|
||
def _add_metrics(self, metrics: dict[str, float | int]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of that they can pass something Formatable
to add
, and that's much easier to implement for them, isn't it?
Added a test that shows that the new card produces the same output as the old card (except for a few non-deterministic parts). This includes most of the idiosyncrasies of the old card we might want to change in the future (e.g. inconsistent capitalization, use of empty lines). Some of the more problematic behaviors of the old card class were, however, fixed (e.g. creating an empty metrics table when there are no metrics). The other tests have been reworked to use the new card features to make them more precise. Often, that means that instead of having a very weak test like "assert 'foo' in card.render()", it is now possible to select the exact section and check that it equals the expected output. This work is still unfinished, specifically it still lacks tests for the card repr and for the newly added features.
Some refactoring to clean up things, rework repr, make repr tests pass.
Also some better type annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For very basic details I'm in favor of keeping this PR as is and not changing any data structures as it looks quite simple. It was quite overwhelming to catch up with this and review though, I will never take time off from work or go to far east again 😂😂
skops/card/_card_alternative.py
Outdated
self._metrics: dict[str, float | int] = {} | ||
self._reset() | ||
|
||
def _reset(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this going to simply erase README.md as it's still what we use to version models and create a new one? just to make sure we're on same page.
skops/card/_card_alternative.py
Outdated
self.model_diagram = model_diagram | ||
self.metadata = metadata or CardData() | ||
|
||
self._data: dict[str, Section] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like nested list is a bit overkill (and selection should be less efficient as you said). I'm in favor of implementation as is with a regular dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't dig deep into everything here, but overall it looks great. Genrally:
- I think quite a few of the private methods could use a docstring to make it more clear what they do.
- The tests are changed quite a bit. I can't tell which ones are improvements to existing tests and which ones are due to breaking changes. I don't mind breaking changes when it comes to the rendered doc as a result of this PR, but it'd be nice to have a rather small-ish change in the tests compared to what we have now.
- It's not clear to me what the user can do when they
select
a section, and why they'd do that.
skops/card/_model_card.py
Outdated
if sys.version_info >= (3, 8): | ||
from typing import Protocol | ||
else: | ||
from typing_extensions import Protocol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this go in fixes.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny that you should ask. That's where I put it initially and imported it here. This does not work. If my understanding is correct, there is some code in mypy that literally looks for this snippet in the module and then makes sure not to make a fuss. Putting it elsewhere and importing does not work -__-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it works if you put type: ignore
on it? I rather do that than having it here. To me, "doesn't work" would mean "python can't run it" rather than "mypy doesn't understand it" 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem when putting this into fixes.py
and importing it is that for some reason, mypy doesn't understand the protocol anymore. E.g. it'll claim:
skops/card/_model_card.py:694: error: Argument 2 to "_add_single" of "Card" has incompatible type "PlotSection"; expected "Union[Formattable, str]"
even though PlotSection
conforms to the Formattable
protocol. Now we could put a # type: ignore
on this method, but then we might as well not use a protocol at all, if it is never checked.
Yes, it is annoying that this doesn't work. However, in this case I'd be in favor of keeping it as is because it's a temporary fix anyway. As soon as we can stop supporting Python 3.7, all these gymnastics can go away (official EOL for Python 3.7 is 27 June 2023).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, then we can add a comment here and in fixes.py
to remember it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think quite a few of the private methods could use a docstring to make it more clear what they do.
Yeah, I focused on public methods, will add docstrings to the private ones too.
The tests are changed quite a bit. I can't tell which ones are improvements to existing tests and which ones are due to breaking changes. I don't mind breaking changes when it comes to the rendered doc as a result of this PR, but it'd be nice to have a rather small-ish change in the tests compared to what we have now.
None of the changes are breaking, except that we don't automatically put some stuff in specific sections anymore, instead giving users the control over the sections. In a previous version of this PR, I even had a test that showed that old and new implementation provide identical results:
However, as we decided to replace the old implementation, the test also had to go.
Regarding how the tests are changed, they test for the most part exactly what has been tested before, but they are more strict. Let's illustrate with an example. Below is the old test for adding a plot:
def test_add_plot(destination_path, model_card):
plt.plot([4, 5, 6, 7])
plt.savefig(Path(destination_path) / "fig1.png")
model_card = model_card.add_plot(fig1="fig1.png").render()
assert "![fig1](fig1.png)" in model_card
Here is the new test:
def test_add_plot(destination_path, model_card):
plt.plot([4, 5, 6, 7])
plt.savefig(Path(destination_path) / "fig1.png")
model_card = model_card.add_plot(fig1="fig1.png")
plot_content = model_card.select("fig1").content.format()
assert plot_content == "![fig1](fig1.png)"
As you can see, they are fairly similar. However, the old test would succeed if anywhere in the model card, there is a string "![fig1](fig1.png)"
. The new test specifically tests that this string is in the section we indicated. A similar pattern can be observed with the other tests.
I know that the diff view does not really help in identifying these changes, maybe it's better to put old and new tests side-by-side and going from top to bottom.
It's not clear to me what the user can do when they select a section, and why they'd do that.
The main reason I added select
and delete
was to be complete, so if a user wants to change or delete a section, they get the possibility (allowing all CRUD operations basically).
E.g. select allows you to do stuff like:
card = Card(...)
card.add_plot(**{"Plots/My awesome plot": "myfig.png"})
pl = card.select("Plots/My awesome plot")
pl.title = "My new title"
and it will be reflected in the model card. Without select, you'd need to recreate the whole section. In effect, this means that building the model card as a whole can be a lot more interactive, leaving room for mistakes and fixing them.
Another benefit of select
is that we can use it for testing more precisely (see comment above).
I will address some of your minor comments and then provide a new update once the permutation importances and model loading from str are merged, to integrate those features as well.
skops/card/_model_card.py
Outdated
if sys.version_info >= (3, 8): | ||
from typing import Protocol | ||
else: | ||
from typing_extensions import Protocol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny that you should ask. That's where I put it initially and imported it here. This does not work. If my understanding is correct, there is some code in mypy that literally looks for this snippet in the module and then makes sure not to make a fuss. Putting it elsewhere and importing does not work -__-
- Remove noise from docstring example - Add the comma after model repr - Add docstrings to private methods
Users can now choose to use no template, skops template, hub template, or their own template. Using their own template disables a lot of prefilling (say, putting the model plot in the card) because we wouldn't know where to put it. Users will need to call card.add for the otherwise prefilled sections.
skops/card/_templates.py
Outdated
), | ||
} | ||
|
||
HUB_TEMPLATE = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spending more time on this, I don't see an easy way to make it suitable for usual tabular usecases. For instance, the environmental impact is pretty much zero for most models.
However, one can do NLP/image usecases and have massive models as a part of a scikit-learn pipeline. We can start with a minimal model card template and improve it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what is your suggestion here? To trim the HUB_TEMPLATE
down to only contain those sections that make most sense for sklearn models? Or create a new template?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my suggestion is to go back to support only one template, the one we already have, and change it/improve it/add more templates in separate PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so keep the existing code that would allow to use different templates but only keep the "skops" template as a valid template for the time being. Would you still keep support for custom templates (passed as a dict) and no template?
Also, I'd probably keep the HUB_TEMPLATE
in _template.py
but make it private, since it was a lot of work to implement it. Otherwise, if we squash and merge but later want to go back, we can't :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That all sounds good to me. Supporting custom templates via dict also makes sense.
This can be useful, because otherwise it takes a bit of effort to retrieve the latest section.
It's ugly, but there is no technical reason from prohibiting the addition of tables without rows. (Note, columns are still required). This allows us to use TableSection for formatting the metrics, instead of calling tabulate there directly. This is better, since we don't have 2 separate ways of creating metrics.
As discussed here: skops-dev#72 (comment) Description This feature adds a new function, skops.card.parse_modelcard. When passing it the path to a dumped model card, it parses it using pandoc and returns a Card object, which can be further modified by the user. In the end, this turned out easier than I initially thought it would. The main difficulty are the data structures returned by the pandoc parser, for which I couldn't find any documentation. I guess Haskell code is just self-documenting. For this reason, there are probably quite a few edge cases that I haven't covered yet. Just as an example, when parsing tables, pandoc tells us how the columns are aligned. This information is currently completely discarded (we let tabulate choose the alignment). If we want to preserve the table alignment, we would need to make some changes Implementation This feature requires the alternative card implementation from skops-dev#203 pandoc is used for the following reasons: - widely used and thus battle tested - can read many other formats, not just markdown, so in theory, we should be able to read, e.g., rst model cards without modifying any code The disadvantage is that pandoc is not a Python package, so users need to install it separately. But it is available on all common platforms. For calling pandoc, I chose to shell out using subprocess. I think this should be fine but LMK if there is a better way. There is a Python package that binds pandoc (https://github.com/boisgera/pandoc) but I don't think it's worth it for us to add it, just to avoid shelling out. The package seems to have low adoption and contains a bunch of stuff we don't need. I chose to implement this such that the parser that generates the Card object should not have to know anything about Markdown. Everything related to Markdown is moved to a separate class in _markup.py. In an ideal world, we would not have to know anything about markdown either. Instead the Card object shoud have methods (similar to what we already have for add_plot etc.) that handles all of that. But in practice, this is far from being true. E.g. if a user wants to add bold text, there is no special method for it, so they would need to add raw Markdown. The Card class is thus a leaky abstraction. TODOs This PR is not finished. Remaining TODOs that come to mind: 1. We need to merge the alternative card implementation 2. Documentation has to be updated in several places 3. Tests need to be more complex, right now only one Card is tested 4. CI needs to install pandoc so that the tests are actually run 5. There are some specifics here that won't work with all Python versions, like the use of TypedDict.
Most notably, dynamic model loading in Cards was broken.
1. Don't use Hub template for now 2. Document how to add new default templates in code 3. Nicer error message when using invalid template
@skops-dev/maintainers So I think this PR is finally ready. Please review (again). There will be some conflict with #142 but it shouldn't be too hard to merge (I can do the work for that). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check the tests in detail, but they seem pretty good.
To me this seems pretty close to being mergeable.
docs/model_card.rst
Outdated
Templates | ||
--------- | ||
TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove this for the merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
skops/card/_model_card.py
Outdated
if sys.version_info >= (3, 8): | ||
from typing import Literal, Protocol | ||
else: # TODO: remove when Python 3.7 is dropped | ||
from typing_extensions import Literal, Protocol | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #245
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove this, assuming that #245 will be merged first.
skops/card/_model_card.py
Outdated
if is_skops_format: | ||
lines += [ | ||
"from skops.io import load", | ||
f'model = load("{file_name}")', | ||
] | ||
else: # pickle | ||
lines += [f'model = joblib.load("{file_name}")'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably either do from ... import load
or import xxx; xxx.load(...)
in both cases. I prefer the latter pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good opportunity to determine what the canonical way of importing the persistence functions should be. I opened a discord discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the consensus from discord seems to be import skops.io as sio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh well, democracy has spoken. I changed it here, but would prefer to change the other occurrences (docs, examples) for another PR, since this one is already big as is.
skops/card/_model_card.py
Outdated
def _add_get_started_code(self, file_name: str, indent: str = " ") -> None: | ||
"""Add getting started code to the corresponding section""" | ||
if self.template not in VALID_TEMPLATES: | ||
# unknown template, cannot prefill |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably same here, users should have an easy way to add this to their templates
- Remove Python 3.7 compatibility code (requires skops-dev#246) - Removed empty section in docs - Don't add empty metrics table by default - Remove option to select with a list of str (e.g. card.select(['Foo', 'Bar'])) is no longer possible) - Add option to chain selections using select method (e.g. card.select('Foo').select('Bar'))
@adrinjalali I think I replied to all your comments, addressed most open points or created an issue for future us. Here are the things in particular that I changed:
Regarding the last point, it's not quite what you suggested, but I found that having the same method name, |
skops/card/_model_card.py
Outdated
if is_skops_format: | ||
lines += [ | ||
"from skops.io import load", | ||
f'model = load("{file_name}")', | ||
] | ||
else: # pickle | ||
lines += [f'model = joblib.load("{file_name}")'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the consensus from discord seems to be import skops.io as sio
skops/card/_model_card.py
Outdated
def _generate_content( | ||
self, data: dict[str, Section], depth: int = 1 | ||
) -> Iterator[str]: | ||
"""Yield title and (formatted) contents""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Yield title and (formatted) contents""" | |
"""Yield title and (formatted) contents. | |
It recursively goes through the data and yields the title | |
with the appropriate number of `#`s and the associated | |
content. | |
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added something similar, changed the wording a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addresses your comments.
Tests for Python 3.7 still failing until we drop it.
skops/card/_model_card.py
Outdated
def _generate_content( | ||
self, data: dict[str, Section], depth: int = 1 | ||
) -> Iterator[str]: | ||
"""Yield title and (formatted) contents""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added something similar, changed the wording a bit.
skops/card/_model_card.py
Outdated
if is_skops_format: | ||
lines += [ | ||
"from skops.io import load", | ||
f'model = load("{file_name}")', | ||
] | ||
else: # pickle | ||
lines += [f'model = joblib.load("{file_name}")'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh well, democracy has spoken. I changed it here, but would prefer to change the other occurrences (docs, examples) for another PR, since this one is already big as is.
@adrinjalali CI is green. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing it, but I don't see this point resolved.
skops/card/_model_card.py
Outdated
|
||
return section[leaf_node_name] | ||
|
||
def _add_model_section(self, model) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we talked about exposing these somehow to users with custom templates, how is that done now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we had agreed that for now, users would have to use card.add(...)
? Or do you mean that this PR should create helper functions like get_model_repr
and expose them to users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing with these methods is tat they do more than a simple add
. So what we could do in this PR, is to make them public and change the signature to:
def add_model_section(self, model, section=None) -> None:
Then we'd only accept section=None
if template is the default template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember that we discussed it like that, but I can take a look :)
Presumably, we don't need the model
argument and use self.model
instead? Should users be allowed to add the model reprs multiple times into different sections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we can use self.model
, then this argument shouldn't be here in the first place.
I don't care for now if users end up adding it multiple times. If they do, their model card looks silly.
So far, if users had a custom template (including no template at all), they would lose the possibility to add some default sections: - metrics - model plot - hyperparamters - getting started code Now we expose methods to the users to add those sections with a simple call (no need to manually format the content and use card.add). For this to work, users have to indicate the desired section, since we would otherwise not know where to put the content. During the work on this, I also cleaned up the Card tests, which became messier and messier over time. Related tests are now all contained in a class, which makes it a little bit easier to see if for a certain method, all test cases have been covered.
@adrinjalali I added the feature you requested. Copied the So far, if users had a custom template (including no template at all),
Now we expose methods to the users to add those sections with a simple During the work on this, I also cleaned up the Edit: @E-Aho maybe also interesting for you to take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very happy with this now. I'll let @E-Aho have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I love this new implementation, major kudos to @BenjaminBossan for this design.
It looks really extensible and user friendly, great documentation, and really solid usage pattern. 'll try to play with it properly over the weekend to get a better feel for it, but just looking at the code, it already looks awesome.
I had one minor comment that could potnentially help in an edge case for bugs, but in any case it LGTM 🚀.
The individual (sub)sections. | ||
|
||
""" | ||
placeholder = "$%!?" # arbitrary sting that never appears naturally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is minor, but I don't love that this will introduce bugs if a user for some reason actually adds this into their string.
If there's not an easier way to do this, we could maybe use a rarer character, like one of the Unicode Control codes. Maybe the separator character \037
would fit better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a valid concern. I had a regex version that should achieve the same:
def split_subsection_names(key: str) -> list[str]:
# Use a negative lookbehind assert to make sure that the "/" character is
# not preceded by a backslash
parts = re.split(r"/(?<!\\/)", key)
res = [part.strip().replace("\\", "") for part in parts]
return res
Discussed with @adrinjalali and we found it harder to understand and performance-wise, it was the same.
I haven't thought through the implication of using control codes instead. Maybe we can change it later? This shouldn't really affect users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to change later, I'll merge this in and maybe raise a small PR to do the above over the weekend
In any case, happy for @adrinjalali to merge this in, the comment I made can easily be tweaked after merging. |
Woohoo, this is massive! 🥳 |
This is an RFC for a suggested solution for the issues discussed on #72
Description
The proposed model card implementation would allow to dynamically add sections or overwrite them.
This is not a complete implementation but already covers most of the features we already have and then some.
On top of these features, it would be possible to add more features like creating a default
Card
with placeholders, just like the existing template, or the possibility to delete existing sections or to retrieve the result of a certain section.Implementation
The underlying data structure consists of a dict and a
Section
dataclass
.All data is stored in a
_data
attribute with the typedict[str, Section]
. Thedataclass
hold the section contents, i.e. the section title, the section content, and subsections, which again have the same type. It's thus recursive data structure (basically a tree). Section title and dict key are identical, which is mostly for convenience.With this refactor, there are no separate data containers anymore for eval results, template sections, extra sections, etc. They are all treated the same.
IMHO, this greatly simplifies the code overall. The only complex function that's left is the one needed to traverse the tree holding the data, and even that is just 14 LOC.
Demo
updated
To see how the new class can be used, take a look at the main function. The resulting Card can be seen here:
https://huggingface.co/skops-ci/hf_hub_example-e02eee42-5c99-4924-879e-95a20d795d49
The README file looks like this:
https://huggingface.co/skops-ci/hf_hub_example-e02eee42-5c99-4924-879e-95a20d795d49/raw/main/README.md