Update `kedro catalog list` command to account for dataset factories #2793

AhdraMeraliQB · 2023-07-13T10:49:06Z

Description

With the introduction of dataset factories to the data catalog (#2635) our catalog CLI commands need to be updated to reflect the changes. This PR updates kedro catalog list to include datasets that make use of the dataset factories.

Development notes

I have tested this manually with a modified version of spaceflights. In the catalog, both entries for preprocessed_shuttles and preprocessed_companies have been replaced with the following:

preprocessed_{name}:
  type: pandas.ParquetDataSet
  filepath: data/02_intermediate/preprocessed_{name}.pq

Before the changes made in this PR, this project setup would yield the following output using kedro catalog list -p data_processing:

Datasets in 'data_processing' pipeline:
  Datasets mentioned in pipeline:
    CSVDataSet:
    - companies
    - reviews
    DefaultDataset:
    - preprocessed_shuttles
    - preprocessed_companies
    ExcelDataSet:
    - shuttles
    ParquetDataSet:
    - model_input_table
  Datasets not mentioned in pipeline:
    PickleDataSet:
    - regressor

With the changes this now looks like:

Datasets in 'data_processing' pipeline:
  Datasets generated from factories:
    pandas.ParquetDataSet:
    - preprocessed_shuttles
    - preprocessed_companies
  Datasets mentioned in pipeline:
    CSVDataSet:
    - companies
    - reviews
    ExcelDataSet:
    - shuttles
    ParquetDataSet:
    - model_input_table
  Datasets not mentioned in pipeline:
    PickleDataSet:
    - regressor

Questions for reviewers

This is more of a nit but I'm unsure "Datasets generated from factories" is the best way to describe this section, any other ideas/comments?

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

AhdraMeraliQB · 2023-07-13T10:57:02Z

kedro/framework/cli/catalog.py

        unused_by_type = _map_type_to_datasets(unused_ds, datasets_meta)
        used_by_type = _map_type_to_datasets(used_ds, datasets_meta)

        if default_ds:
            used_by_type["DefaultDataset"].extend(default_ds)

-        data = ((not_mentioned, dict(unused_by_type)), (mentioned, dict(used_by_type)))
+        data = ((mentioned, dict(used_by_type)), (factories, dict(factory_ds_by_type)), (not_mentioned, dict(unused_by_type)))


It's worth noting that PyYaml will automatically sort what it dumps to terminal. To configure this we would need to update to PyYAML 5.1.

Currently, generated datasets will be listed first, then those mentioned in the pipeline, and lastly those that aren't. A similar alphabetical ordering is applied to the pipelines themselves (as opposed to in order of input). I'm not the biggest fan of this, but I'm unsure it's worth bumping PyYAML for. In any case, I have re-ordered them, so that if/when we do update the PyYAML lower bound we can make the changes easily.

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

ankatiyar · 2023-07-13T12:55:40Z

QQ: any reason we're distinguishing between datasets explicitly in the catalog vs generated from factories?

AhdraMeraliQB · 2023-07-14T07:19:30Z

QQ: any reason we're distinguishing between datasets explicitly in the catalog vs generated from factories?

@ankatiyar

The idea spawned from a conversation with @merelcht - it is probably useful to distinguish these not just for clarity but also debugging purposes.

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

ankatiyar · 2023-07-14T09:27:55Z

The idea spawned from a conversation with @merelcht - it is probably useful to distinguish these not just for clarity but also debugging purposes.

This sounds good but I think then the heading for the "non factories" datasets could be changed to be consistent. Something like "Datasets generated from catalog files:"? What do you think'?

Also another thing, I was trying it out with my dummy project. But I noticed that this wasn't working for "Datasets not mentioned in pipeline" when these datasets are factories -

When all the datasets are explicitly mentioned :

Datasets in '__default__' pipeline:
  Datasets mentioned in pipeline:
    CSVDataSet:
    - france_companies
    - germany_companies
    - france_preprocessed_companies
    - germany_preprocessed_companies
    - switzerland_companies
    - switzerland_preprocessed_companies
    ParquetDataSet:
    - germany_final
    - switzerland_final
    - france_final
Datasets in 'pipe1' pipeline:
  Datasets mentioned in pipeline:
    CSVDataSet:
    - france_companies
    - germany_companies
    - france_preprocessed_companies
    - germany_preprocessed_companies
    - switzerland_companies
    - switzerland_preprocessed_companies
  Datasets not mentioned in pipeline:
    ParquetDataSet:
    - germany_final
    - switzerland_final
    - france_final

vs When all datasets are factories

Datasets in '__default__' pipeline:
  Datasets generated from factories:
    pandas.CSVDataSet:
    - france_preprocessed_companies
    - switzerland_companies
    - germany_companies
    - germany_preprocessed_companies
    - france_companies
    - switzerland_preprocessed_companies
    pandas.ParquetDataSet:
    - switzerland_final
    - france_final
    - germany_final
Datasets in 'pipe1' pipeline:
  Datasets generated from factories:
    pandas.CSVDataSet:
    - france_preprocessed_companies
    - switzerland_companies
    - germany_companies
    - germany_preprocessed_companies
    - france_companies
    - switzerland_preprocessed_companies

AhdraMeraliQB · 2023-07-14T09:39:00Z

@ankatiyar

The "Datasets not mentioned in pipeline" looks at entries included in the catalog that aren't included in the pipeline. Because the factory datasets aren't included in the catalog, they wouldn't show up here - for them to be resolved they would have to be explicitly defined in the pipeline, which would then mean they show up in the "Datasets mentioned in pipeline" section instead.

This sounds good but I think then the heading for the "non factories" datasets could be changed to be consistent. Something like "Datasets generated from catalog files:"? What do you think'?

I'm unsure the distinction here would be necessary as it is the default case, but it's worth considering 🤔

ankatiyar · 2023-07-14T09:58:18Z

The "Datasets not mentioned in pipeline" looks at entries included in the catalog that aren't included in the pipeline. Because the factory datasets aren't included in the catalog, they wouldn't show up here - for them to be resolved they would have to be explicitly defined in the pipeline, which would then mean they show up in the "Datasets mentioned in pipeline" section instead.

Could this be resolved like we do in runner.py -

kedro/kedro/runner/runner.py

Line 79 in c1822a2

registered_ds = [ds for ds in pipeline.data_sets() if ds in catalog]

Like -> catalog_ds = Union(set(catalog.list()), set(registered_ds))

AhdraMeraliQB · 2023-07-14T10:19:05Z

@ankatiyar I see, you're suggesting any datasets that make use of factories within the default pipeline should be considered in the "Datasets not mentioned in pipeline" section?

To be completely honest, I feel like it might be overkill - to me "Datasets generated from factories" are generated by the pipeline therefore there would be no need to list the ones generated by a different pipeline, but it would be good to see if @merelcht / any one else have any thoughts on this.

merelcht · 2023-07-14T13:18:17Z

@ankatiyar I see, you're suggesting any datasets that make use of factories within the default pipeline should be considered in the "Datasets not mentioned in pipeline" section?

To be completely honest, I feel like it might be overkill - to me "Datasets generated from factories" are generated by the pipeline therefore there would be no need to list the ones generated by a different pipeline, but it would be good to see if @merelcht / any one else have any thoughts on this.

I don't really understand this. If a dataset isn't mentioned in the pipeline it wouldn't be picked up by a dataset factory, so how could a dataset generated by a factory every belong to the "not mentioned in pipeline" section?

merelcht

I've left a couple of suggestions and questions, but overall it looks good. I think "datasets generated from factories" is a good way to describe this section.

RELEASE.md

kedro/framework/cli/catalog.py

tests/framework/cli/test_catalog.py

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

SajidAlamQB

This looks great! 🌟 This change makes a lot of sense and improves the kedro catalog list to add visibility into datasets generated from factories well.

merelcht

Nice work 👍

kedro/framework/cli/catalog.py

merelcht

LGTM! ⭐

Ahdra Merali and others added 3 commits July 13, 2023 11:47

Resolve factory datasets when listing

366467f

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Merge branch 'main' into feat/add-factories-cli-commands

00dc871

Reorder datasets

ef24efa

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

AhdraMeraliQB commented Jul 13, 2023

View reviewed changes

Remove leftover print and lint

1e7201e

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

AhdraMeraliQB requested review from astrojuanlu and ankatiyar July 13, 2023 11:06

AhdraMeraliQB linked an issue Jul 13, 2023 that may be closed by this pull request

Update kedro catalog list command to account for dataset factories #2789

Closed

Ahdra Merali added 2 commits July 13, 2023 12:14

More linting

3e31b7b

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Even more linting

26a9a78

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Ahdra Merali added 3 commits July 14, 2023 08:54

Add test

de6a36f

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Update RELEASE.md

91b0c89

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Update test file extension

1f31217

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

AhdraMeraliQB marked this pull request as ready for review July 14, 2023 08:09

AhdraMeraliQB requested a review from merelcht as a code owner July 14, 2023 08:09

AhdraMeraliQB mentioned this pull request Jul 14, 2023

Add kedro catalog factory list CLI command #2796

Closed

5 tasks

merelcht reviewed Jul 14, 2023

View reviewed changes

RELEASE.md Outdated Show resolved Hide resolved

kedro/framework/cli/catalog.py Show resolved Hide resolved

tests/framework/cli/test_catalog.py Outdated Show resolved Hide resolved

AhdraMeraliQB and others added 2 commits July 19, 2023 09:36

Merge branch 'main' into feat/add-factories-cli-commands

b6001ba

Add suggestions from code review

ca06fa8

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

AhdraMeraliQB requested review from noklam and SajidAlamQB and removed request for ankatiyar July 19, 2023 09:04

SajidAlamQB previously approved these changes Jul 19, 2023

View reviewed changes

merelcht previously approved these changes Jul 20, 2023

View reviewed changes

AhdraMeraliQB added 3 commits July 25, 2023 08:07

Merge branch 'main' into feat/add-factories-cli-commands

8c5e689

Merge branch 'main' into feat/add-factories-cli-commands

219729b

Merge branch 'main' into feat/add-factories-cli-commands

c346c17

AhdraMeraliQB dismissed stale reviews from SajidAlamQB and merelcht via c346c17 July 27, 2023 08:12

AhdraMeraliQB requested review from merelcht and SajidAlamQB July 27, 2023 08:40

SajidAlamQB approved these changes Jul 27, 2023

View reviewed changes

merelcht reviewed Jul 27, 2023

View reviewed changes

kedro/framework/cli/catalog.py Outdated Show resolved Hide resolved

Remove holdovers from merge

d629385

merelcht approved these changes Jul 27, 2023

View reviewed changes

Merge branch 'main' into feat/add-factories-cli-commands

b78c5c9

AhdraMeraliQB enabled auto-merge (squash) July 27, 2023 10:52

Merge branch 'main' into feat/add-factories-cli-commands

5007619

AhdraMeraliQB merged commit 3bcf2b1 into main Jul 27, 2023
28 of 29 checks passed

AhdraMeraliQB deleted the feat/add-factories-cli-commands branch July 27, 2023 12:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `kedro catalog list` command to account for dataset factories #2793

Update `kedro catalog list` command to account for dataset factories #2793

AhdraMeraliQB commented Jul 13, 2023 •

edited

Loading

AhdraMeraliQB Jul 13, 2023

ankatiyar commented Jul 13, 2023

AhdraMeraliQB commented Jul 14, 2023

ankatiyar commented Jul 14, 2023

AhdraMeraliQB commented Jul 14, 2023

ankatiyar commented Jul 14, 2023

AhdraMeraliQB commented Jul 14, 2023

merelcht commented Jul 14, 2023 •

edited

Loading

merelcht left a comment

SajidAlamQB left a comment

merelcht left a comment

merelcht left a comment

Update kedro catalog list command to account for dataset factories #2793

Update kedro catalog list command to account for dataset factories #2793

Conversation

AhdraMeraliQB commented Jul 13, 2023 • edited Loading

Description

Development notes

Questions for reviewers

Checklist

AhdraMeraliQB Jul 13, 2023

Choose a reason for hiding this comment

ankatiyar commented Jul 13, 2023

AhdraMeraliQB commented Jul 14, 2023

ankatiyar commented Jul 14, 2023

When all the datasets are explicitly mentioned :

vs When all datasets are factories

AhdraMeraliQB commented Jul 14, 2023

ankatiyar commented Jul 14, 2023

AhdraMeraliQB commented Jul 14, 2023

merelcht commented Jul 14, 2023 • edited Loading

merelcht left a comment

Choose a reason for hiding this comment

SajidAlamQB left a comment

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

Update `kedro catalog list` command to account for dataset factories #2793

Update `kedro catalog list` command to account for dataset factories #2793

AhdraMeraliQB commented Jul 13, 2023 •

edited

Loading

merelcht commented Jul 14, 2023 •

edited

Loading