Is Kedro a good fit for data warehousing ? #360

flvndh · 2020-05-08T07:40:31Z

Hello,

I'm in the process of evaluating the pros/cons of using Kedro to build out my data warehouse pipelines using Spark.

By reading the documentation, my first impression is that Kedro is very well suited for clearly scoped data projects. I'm wondering if it is a good fit for data warehousing as the scope can grow quite large as new processes are added.

Has anyone ever used Kedro in this context ? What would you recommend ?

Thank you for your feedback.

datajoely · 2020-05-13T09:50:41Z

Hi @flvndh the short answer is - Yes, you can. Kedro is a framework for building out complex pipelines and provides abstractions for reading, transforming and writing data in neat and tidy way.

Since you're using Spark (and by definition here PySpark) as your execution engine very little of the pipeline's overhead will be Kedro and instead the heavy lifting will be executed on the cluster via Spark API calls.

In my opinion, Kedro's real strength is actually when used as an iterative development tool i.e. quick prototyping and keeping things organised in terms of business logic / data catalog / configuration. Moreover, it was initially designed for and has since been thoroughly battle-tested by ~10 people teams building agile 4 month projects.

Once you're done prototyping and need to move into a more production / live data warehousing use-case it might make sense to introduce kedro-airflow this allows you to port your pipelines into AirFlow and take advantage of its strongpoints i.e. scheduling and logging.

If you're stuck in a SQL world it dbt is the best thing I've seen there, but in the PySpark/Python world Kedro could be a good fit for what you're trying to do.

lorenabalan · 2020-05-27T10:23:16Z

Hello! I'll go ahead and close this issue as answered, but please feel free to reopen / open a new one if you require more information!

flvndh added the Issue: Question label May 8, 2020

lorenabalan closed this as completed May 27, 2020

pull bot pushed a commit to FoundryAI/kedro that referenced this issue Jul 17, 2020

[KED-1129] Allow duplicate node inputs (kedro-org#360)

8f076cf

Galileo-Galilei mentioned this issue Sep 20, 2021

Universal Kedro deployment (Part 2) - Offer a unified interface to external compute and storage backends #904

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is Kedro a good fit for data warehousing ? #360

Is Kedro a good fit for data warehousing ? #360

flvndh commented May 8, 2020

datajoely commented May 13, 2020

lorenabalan commented May 27, 2020

Is Kedro a good fit for data warehousing ? #360

Is Kedro a good fit for data warehousing ? #360

Comments

flvndh commented May 8, 2020

datajoely commented May 13, 2020

lorenabalan commented May 27, 2020