Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update kedro catalog list command to account for dataset factories #2793

Merged
merged 17 commits into from
Jul 27, 2023

Conversation

AhdraMeraliQB
Copy link
Contributor

@AhdraMeraliQB AhdraMeraliQB commented Jul 13, 2023

Description

With the introduction of dataset factories to the data catalog (#2635) our catalog CLI commands need to be updated to reflect the changes. This PR updates kedro catalog list to include datasets that make use of the dataset factories.

Development notes

I have tested this manually with a modified version of spaceflights. In the catalog, both entries for preprocessed_shuttles and preprocessed_companies have been replaced with the following:

preprocessed_{name}:
  type: pandas.ParquetDataSet
  filepath: data/02_intermediate/preprocessed_{name}.pq

Before the changes made in this PR, this project setup would yield the following output using kedro catalog list -p data_processing:

Datasets in 'data_processing' pipeline:
  Datasets mentioned in pipeline:
    CSVDataSet:
    - companies
    - reviews
    DefaultDataset:
    - preprocessed_shuttles
    - preprocessed_companies
    ExcelDataSet:
    - shuttles
    ParquetDataSet:
    - model_input_table
  Datasets not mentioned in pipeline:
    PickleDataSet:
    - regressor

With the changes this now looks like:

Datasets in 'data_processing' pipeline:
  Datasets generated from factories:
    pandas.ParquetDataSet:
    - preprocessed_shuttles
    - preprocessed_companies
  Datasets mentioned in pipeline:
    CSVDataSet:
    - companies
    - reviews
    ExcelDataSet:
    - shuttles
    ParquetDataSet:
    - model_input_table
  Datasets not mentioned in pipeline:
    PickleDataSet:
    - regressor

Questions for reviewers

This is more of a nit but I'm unsure "Datasets generated from factories" is the best way to describe this section, any other ideas/comments?

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

Ahdra Merali and others added 3 commits July 13, 2023 11:47
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
unused_by_type = _map_type_to_datasets(unused_ds, datasets_meta)
used_by_type = _map_type_to_datasets(used_ds, datasets_meta)

if default_ds:
used_by_type["DefaultDataset"].extend(default_ds)

data = ((not_mentioned, dict(unused_by_type)), (mentioned, dict(used_by_type)))
data = ((mentioned, dict(used_by_type)), (factories, dict(factory_ds_by_type)), (not_mentioned, dict(unused_by_type)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that PyYaml will automatically sort what it dumps to terminal. To configure this we would need to update to PyYAML 5.1.

Currently, generated datasets will be listed first, then those mentioned in the pipeline, and lastly those that aren't. A similar alphabetical ordering is applied to the pipelines themselves (as opposed to in order of input). I'm not the biggest fan of this, but I'm unsure it's worth bumping PyYAML for. In any case, I have re-ordered them, so that if/when we do update the PyYAML lower bound we can make the changes easily.

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Ahdra Merali added 2 commits July 13, 2023 12:14
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
@ankatiyar
Copy link
Contributor

QQ: any reason we're distinguishing between datasets explicitly in the catalog vs generated from factories?

@AhdraMeraliQB
Copy link
Contributor Author

QQ: any reason we're distinguishing between datasets explicitly in the catalog vs generated from factories?

@ankatiyar

The idea spawned from a conversation with @merelcht - it is probably useful to distinguish these not just for clarity but also debugging purposes.

Ahdra Merali added 3 commits July 14, 2023 08:54
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
@AhdraMeraliQB AhdraMeraliQB marked this pull request as ready for review July 14, 2023 08:09
@ankatiyar
Copy link
Contributor

The idea spawned from a conversation with @merelcht - it is probably useful to distinguish these not just for clarity but also debugging purposes.

This sounds good but I think then the heading for the "non factories" datasets could be changed to be consistent. Something like "Datasets generated from catalog files:"? What do you think'?

Also another thing, I was trying it out with my dummy project. But I noticed that this wasn't working for "Datasets not mentioned in pipeline" when these datasets are factories -

When all the datasets are explicitly mentioned :

Datasets in '__default__' pipeline:
  Datasets mentioned in pipeline:
    CSVDataSet:
    - france_companies
    - germany_companies
    - france_preprocessed_companies
    - germany_preprocessed_companies
    - switzerland_companies
    - switzerland_preprocessed_companies
    ParquetDataSet:
    - germany_final
    - switzerland_final
    - france_final
Datasets in 'pipe1' pipeline:
  Datasets mentioned in pipeline:
    CSVDataSet:
    - france_companies
    - germany_companies
    - france_preprocessed_companies
    - germany_preprocessed_companies
    - switzerland_companies
    - switzerland_preprocessed_companies
  Datasets not mentioned in pipeline:
    ParquetDataSet:
    - germany_final
    - switzerland_final
    - france_final

vs When all datasets are factories

Datasets in '__default__' pipeline:
  Datasets generated from factories:
    pandas.CSVDataSet:
    - france_preprocessed_companies
    - switzerland_companies
    - germany_companies
    - germany_preprocessed_companies
    - france_companies
    - switzerland_preprocessed_companies
    pandas.ParquetDataSet:
    - switzerland_final
    - france_final
    - germany_final
Datasets in 'pipe1' pipeline:
  Datasets generated from factories:
    pandas.CSVDataSet:
    - france_preprocessed_companies
    - switzerland_companies
    - germany_companies
    - germany_preprocessed_companies
    - france_companies
    - switzerland_preprocessed_companies

@AhdraMeraliQB
Copy link
Contributor Author

@ankatiyar

The "Datasets not mentioned in pipeline" looks at entries included in the catalog that aren't included in the pipeline. Because the factory datasets aren't included in the catalog, they wouldn't show up here - for them to be resolved they would have to be explicitly defined in the pipeline, which would then mean they show up in the "Datasets mentioned in pipeline" section instead.

This sounds good but I think then the heading for the "non factories" datasets could be changed to be consistent. Something like "Datasets generated from catalog files:"? What do you think'?

I'm unsure the distinction here would be necessary as it is the default case, but it's worth considering 🤔

@ankatiyar
Copy link
Contributor

The "Datasets not mentioned in pipeline" looks at entries included in the catalog that aren't included in the pipeline. Because the factory datasets aren't included in the catalog, they wouldn't show up here - for them to be resolved they would have to be explicitly defined in the pipeline, which would then mean they show up in the "Datasets mentioned in pipeline" section instead.

Could this be resolved like we do in runner.py -

registered_ds = [ds for ds in pipeline.data_sets() if ds in catalog]

Like -> catalog_ds = Union(set(catalog.list()), set(registered_ds))

@AhdraMeraliQB
Copy link
Contributor Author

@ankatiyar I see, you're suggesting any datasets that make use of factories within the default pipeline should be considered in the "Datasets not mentioned in pipeline" section?

To be completely honest, I feel like it might be overkill - to me "Datasets generated from factories" are generated by the pipeline therefore there would be no need to list the ones generated by a different pipeline, but it would be good to see if @merelcht / any one else have any thoughts on this.

@merelcht
Copy link
Member

merelcht commented Jul 14, 2023

@ankatiyar I see, you're suggesting any datasets that make use of factories within the default pipeline should be considered in the "Datasets not mentioned in pipeline" section?

To be completely honest, I feel like it might be overkill - to me "Datasets generated from factories" are generated by the pipeline therefore there would be no need to list the ones generated by a different pipeline, but it would be good to see if @merelcht / any one else have any thoughts on this.

I don't really understand this. If a dataset isn't mentioned in the pipeline it wouldn't be picked up by a dataset factory, so how could a dataset generated by a factory every belong to the "not mentioned in pipeline" section?

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a couple of suggestions and questions, but overall it looks good. I think "datasets generated from factories" is a good way to describe this section.

RELEASE.md Outdated Show resolved Hide resolved
kedro/framework/cli/catalog.py Show resolved Hide resolved
tests/framework/cli/test_catalog.py Outdated Show resolved Hide resolved
AhdraMeraliQB and others added 2 commits July 19, 2023 09:36
@AhdraMeraliQB AhdraMeraliQB requested review from noklam and SajidAlamQB and removed request for ankatiyar July 19, 2023 09:04
SajidAlamQB
SajidAlamQB previously approved these changes Jul 19, 2023
Copy link
Contributor

@SajidAlamQB SajidAlamQB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! 🌟 This change makes a lot of sense and improves the kedro catalog list to add visibility into datasets generated from factories well.

merelcht
merelcht previously approved these changes Jul 20, 2023
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 👍

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! ⭐

@AhdraMeraliQB AhdraMeraliQB enabled auto-merge (squash) July 27, 2023 10:52
@AhdraMeraliQB AhdraMeraliQB merged commit 3bcf2b1 into main Jul 27, 2023
28 of 29 checks passed
@AhdraMeraliQB AhdraMeraliQB deleted the feat/add-factories-cli-commands branch July 27, 2023 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update kedro catalog list command to account for dataset factories
4 participants