-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is Kedro a good fit for data warehousing ? #360
Comments
Hi @flvndh the short answer is - Yes, you can. Kedro is a framework for building out complex pipelines and provides abstractions for reading, transforming and writing data in neat and tidy way. Since you're using Spark (and by definition here PySpark) as your execution engine very little of the pipeline's overhead will be Kedro and instead the heavy lifting will be executed on the cluster via Spark API calls. In my opinion, Kedro's real strength is actually when used as an iterative development tool i.e. quick prototyping and keeping things organised in terms of business logic / data catalog / configuration. Moreover, it was initially designed for and has since been thoroughly battle-tested by ~10 people teams building agile 4 month projects. Once you're done prototyping and need to move into a more production / live data warehousing use-case it might make sense to introduce kedro-airflow this allows you to port your pipelines into AirFlow and take advantage of its strongpoints i.e. scheduling and logging. If you're stuck in a SQL world it dbt is the best thing I've seen there, but in the PySpark/Python world Kedro could be a good fit for what you're trying to do. |
Hello! I'll go ahead and close this issue as answered, but please feel free to reopen / open a new one if you require more information! |
Hello,
I'm in the process of evaluating the pros/cons of using Kedro to build out my data warehouse pipelines using Spark.
By reading the documentation, my first impression is that Kedro is very well suited for clearly scoped data projects. I'm wondering if it is a good fit for data warehousing as the scope can grow quite large as new processes are added.
Has anyone ever used Kedro in this context ? What would you recommend ?
Thank you for your feedback.
The text was updated successfully, but these errors were encountered: