Have attributes of training dataset in the repository #266

merveenoyan · 2023-01-16T17:27:13Z

The widget is cool and everything but it's hard to see all the unique values of categorical variables, which variables are categorical or the range for continuous columns. Couple of solutions:

Have attributes in config or README file
Have these in a separate file.
Ping @skops-dev/maintainers

BenjaminBossan · 2023-01-17T10:14:47Z

I agree it would be useful to have this information.

Some questions I would have:

How would this information be collected? I don't think it's feasible to automatically derive it from the training data. Even if it's a pandas df, there is still room for ambiguity. Therefore, it sounds like the user would have to indicate the information.
What are all the different types that can exist? Categorical, ordinal, cardinal. How about time (at what resolution)? Text? Images? I don't think there is an agreed upon standard for all feature types.
Is there a standard of how to represent these types? It would be good if we didn't have to invent something new.

Of course, we don't have to have everything right from the start, but we should have an idea of what this addition would entail. And to me, it looks like it's far from trivial.

adrinjalali · 2023-01-19T16:09:14Z

I think it'd make sense to have this in the README as a part of the model card, we can have some method to generate as much info as we can from a given input dataframe for example.

BenjaminBossan · 2023-01-20T10:11:30Z

I think the reason why Merve wanted to have them in the config.json or a separate file is that this information could be used to improve the UI on Hub. E.g. in the inference widget, if we know the distinct values of a categorical features, the widget could allow to choose the value from a list. If this information is added to the README, it would make it more difficult to extract the information.

adrinjalali · 2023-01-20T13:18:43Z

I see, for that I'm happy for that to be in a data-info.yml/json kinda file. We probably don't want to make the config file too large I guess?

merveenoyan · 2023-01-20T16:18:18Z

@adrinjalali I agree.

lazarust · 2023-09-03T01:19:32Z

@merveenoyan I'm happy to take this if it still needs to be done!

lazarust · 2023-09-08T00:20:13Z

@BenjaminBossan I'm happy to take this one but had a few thoughts/questions:

When should the file be generated?
Is there a list of data types that we want to support initially? You mentioned a couple above and I agree it would be pretty hard to have all of them since there isn't an agreed-upon standard.

BenjaminBossan · 2023-09-08T09:39:24Z

Thanks for taking an interest in the issue. I think there is no definite answer to your question. The initial motivation is to know in advance what options exist for categorical data to improve the widget, but I think Adrin made a good point about file size, which can easily get large if we just record all distinct values, so some kind of compromise would need to be found.

Also, for this feature to make sense, we would need to do work on the widget side as well, for which there is currently no capacity AFAIK, so I would rather not work on this feature right now.

lazarust · 2023-09-08T14:10:49Z

@BenjaminBossan Sounds good! Is there another issue I could help out with?

BenjaminBossan · 2023-09-08T14:38:55Z

If this is something you're willing to jump into, I think we have some room to improve the skops.io persistence format. For instance, support for me external libraries could be added, like scikeras (#388) or skorch :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have attributes of training dataset in the repository #266

Have attributes of training dataset in the repository #266

merveenoyan commented Jan 16, 2023

BenjaminBossan commented Jan 17, 2023

adrinjalali commented Jan 19, 2023

BenjaminBossan commented Jan 20, 2023

adrinjalali commented Jan 20, 2023

merveenoyan commented Jan 20, 2023

lazarust commented Sep 3, 2023

lazarust commented Sep 8, 2023

BenjaminBossan commented Sep 8, 2023

lazarust commented Sep 8, 2023

BenjaminBossan commented Sep 8, 2023

Have attributes of training dataset in the repository #266

Have attributes of training dataset in the repository #266

Comments

merveenoyan commented Jan 16, 2023

BenjaminBossan commented Jan 17, 2023

adrinjalali commented Jan 19, 2023

BenjaminBossan commented Jan 20, 2023

adrinjalali commented Jan 20, 2023

merveenoyan commented Jan 20, 2023

lazarust commented Sep 3, 2023

lazarust commented Sep 8, 2023

BenjaminBossan commented Sep 8, 2023

lazarust commented Sep 8, 2023

BenjaminBossan commented Sep 8, 2023