Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the data used in the spaceflights tutorial and starters #3109

Closed
merelcht opened this issue Oct 3, 2023 · 8 comments · Fixed by #3246
Closed

Reduce the data used in the spaceflights tutorial and starters #3109

merelcht opened this issue Oct 3, 2023 · 8 comments · Fixed by #3246
Assignees

Comments

@merelcht
Copy link
Member

merelcht commented Oct 3, 2023

Description

The spaceflights starter gets used in demos and testing, it takes a considerate of time to run the pipeline. For example in the Kedro bootcamp demoing catalog.load("shuttles") takes like 15-20 seconds and is a bit awkward for demo purpose.

Context

See more details on the spaceflights project + tutorial in the docs: https://docs.kedro.org/en/stable/tutorial/spaceflights_tutorial.html

We're now creating all new starters based on spaceflights so all of these examples could benefit from a smaller dataset.

Related to: #2008

@astrojuanlu
Copy link
Member

Now that we're reviewing the spaceflights data, I transferred #3110 to this repo

@laizaparizotto
Copy link
Contributor

@merelcht You mean literally making the dataset smaller?

@merelcht
Copy link
Member Author

@merelcht You mean literally making the dataset smaller?

Yes 🙂 It's just an example project to demonstrate how Kedro works, so we don't care much about model accuracy. This ticket is to try and reduce the size of all input datasets: companies, reviews and shuttles.

@merelcht
Copy link
Member Author

This should also be adressed at kedro-starters?

Yes this should indeed be done in kedro-starters.

I saw the shuttles.xlsx file is located in 4 different paths:

To check if my understanding is correct, should changes be made to these .xlsx files in all these locations?

I'd suggest doing the change first for just spaceflights and then when it's reviewed they can be replicated for the other spaceflights-... repositories. Is that okay? It will be faster to get it reviewed that way.

And the problem is related only toshuttles.xlsx or also companies.csv and reviews.csv?

Ideally all the datasets would be reduced in size if that's possible 🙂

@stichbury
Copy link
Contributor

Hi @laizaparizotto Is this a ticket you are actively working on? Shall I assign it to you and mark as "in progress"?

@laizaparizotto
Copy link
Contributor

Hi @stichbury, I will be able to work on it this weekend. If not a problem, yes, you can assign that to me :D.

@laizaparizotto
Copy link
Contributor

laizaparizotto commented Oct 15, 2023

I left a comment in the PR asking if I should also open a PR in this repo?

@merelcht merelcht moved this from In Progress to In Review in Kedro Framework Oct 30, 2023
@github-project-automation github-project-automation bot moved this from In Review to Done in Kedro Framework Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants