-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update kedro catalog list
command to account for dataset factories
#2793
Conversation
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
kedro/framework/cli/catalog.py
Outdated
unused_by_type = _map_type_to_datasets(unused_ds, datasets_meta) | ||
used_by_type = _map_type_to_datasets(used_ds, datasets_meta) | ||
|
||
if default_ds: | ||
used_by_type["DefaultDataset"].extend(default_ds) | ||
|
||
data = ((not_mentioned, dict(unused_by_type)), (mentioned, dict(used_by_type))) | ||
data = ((mentioned, dict(used_by_type)), (factories, dict(factory_ds_by_type)), (not_mentioned, dict(unused_by_type))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that PyYaml will automatically sort what it dumps to terminal. To configure this we would need to update to PyYAML 5.1.
Currently, generated datasets will be listed first, then those mentioned in the pipeline, and lastly those that aren't. A similar alphabetical ordering is applied to the pipelines themselves (as opposed to in order of input). I'm not the biggest fan of this, but I'm unsure it's worth bumping PyYAML for. In any case, I have re-ordered them, so that if/when we do update the PyYAML lower bound we can make the changes easily.
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
QQ: any reason we're distinguishing between datasets explicitly in the catalog vs generated from factories? |
The idea spawned from a conversation with @merelcht - it is probably useful to distinguish these not just for clarity but also debugging purposes. |
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
This sounds good but I think then the heading for the "non factories" datasets could be changed to be consistent. Something like "Datasets generated from catalog files:"? What do you think'? Also another thing, I was trying it out with my dummy project. But I noticed that this wasn't working for "Datasets not mentioned in pipeline" when these datasets are factories - When all the datasets are explicitly mentioned :
vs When all datasets are factories
|
The "Datasets not mentioned in pipeline" looks at entries included in the catalog that aren't included in the pipeline. Because the factory datasets aren't included in the catalog, they wouldn't show up here - for them to be resolved they would have to be explicitly defined in the pipeline, which would then mean they show up in the "Datasets mentioned in pipeline" section instead.
I'm unsure the distinction here would be necessary as it is the default case, but it's worth considering 🤔 |
Could this be resolved like we do in Line 79 in c1822a2
Like -> |
@ankatiyar I see, you're suggesting any datasets that make use of factories within the default pipeline should be considered in the "Datasets not mentioned in pipeline" section? To be completely honest, I feel like it might be overkill - to me "Datasets generated from factories" are generated by the pipeline therefore there would be no need to list the ones generated by a different pipeline, but it would be good to see if @merelcht / any one else have any thoughts on this. |
I don't really understand this. If a dataset isn't mentioned in the pipeline it wouldn't be picked up by a dataset factory, so how could a dataset generated by a factory every belong to the "not mentioned in pipeline" section? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a couple of suggestions and questions, but overall it looks good. I think "datasets generated from factories" is a good way to describe this section.
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! 🌟 This change makes a lot of sense and improves the kedro catalog list
to add visibility into datasets generated from factories well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! ⭐
Description
With the introduction of dataset factories to the data catalog (#2635) our catalog CLI commands need to be updated to reflect the changes. This PR updates
kedro catalog list
to include datasets that make use of the dataset factories.Development notes
I have tested this manually with a modified version of spaceflights. In the catalog, both entries for
preprocessed_shuttles
andpreprocessed_companies
have been replaced with the following:Before the changes made in this PR, this project setup would yield the following output using
kedro catalog list -p data_processing
:With the changes this now looks like:
Questions for reviewers
This is more of a nit but I'm unsure "Datasets generated from factories" is the best way to describe this section, any other ideas/comments?
Checklist
RELEASE.md
file