Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is Kedro a good fit for data warehousing ? #360

Closed
flvndh opened this issue May 8, 2020 · 2 comments
Closed

Is Kedro a good fit for data warehousing ? #360

flvndh opened this issue May 8, 2020 · 2 comments

Comments

@flvndh
Copy link

flvndh commented May 8, 2020

Hello,

I'm in the process of evaluating the pros/cons of using Kedro to build out my data warehouse pipelines using Spark.

By reading the documentation, my first impression is that Kedro is very well suited for clearly scoped data projects. I'm wondering if it is a good fit for data warehousing as the scope can grow quite large as new processes are added.

Has anyone ever used Kedro in this context ? What would you recommend ?

Thank you for your feedback.

@datajoely
Copy link
Contributor

Hi @flvndh the short answer is - Yes, you can. Kedro is a framework for building out complex pipelines and provides abstractions for reading, transforming and writing data in neat and tidy way.

Since you're using Spark (and by definition here PySpark) as your execution engine very little of the pipeline's overhead will be Kedro and instead the heavy lifting will be executed on the cluster via Spark API calls.

In my opinion, Kedro's real strength is actually when used as an iterative development tool i.e. quick prototyping and keeping things organised in terms of business logic / data catalog / configuration. Moreover, it was initially designed for and has since been thoroughly battle-tested by ~10 people teams building agile 4 month projects.

Once you're done prototyping and need to move into a more production / live data warehousing use-case it might make sense to introduce kedro-airflow this allows you to port your pipelines into AirFlow and take advantage of its strongpoints i.e. scheduling and logging.

If you're stuck in a SQL world it dbt is the best thing I've seen there, but in the PySpark/Python world Kedro could be a good fit for what you're trying to do.

@lorenabalan
Copy link
Contributor

Hello! I'll go ahead and close this issue as answered, but please feel free to reopen / open a new one if you require more information!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants