-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kedro-airflow support #44
Comments
Hello @mwiewior , glad to see that you are trying the plugin out. What you describe is a common problem on how |
@Galileo-Galilei The solution we have is quite simple - one, additional step that runs nothing else than https://kedro-kubeflow.readthedocs.io/en/0.3.0/source/03_getting_started/03_mlflow.html Do you think this method could work for |
Hi @szczeles, sorry for the late reply. First of all, kudos to you guys for kedro-kubeflow! I have seen the development and I looked at how you handle mlflow configuration with this specific issue in mind ;) Basically, you add a node which plays the role of the "before_pipeline_run" hook. I am not sure that it can work for Airflow, but keep in mind I have almost never use it so it's hard to be really assertive. There are a few things to consider:
I must confess that since I do not use airflow as a scheduler in my day to day life, this has not been of paramount importance to me. I plan to support airflow, but I have no timeline to provide right now. I'd say likely not before this summer. PR are welcome if you really need it soon! P.S: This might be a bit off topic and is more general that this specific issue, but I have seen that almost all kedro plugins or tutorials for deployment to an orchestrator (airflow, kubeflow, prefect, argo...) tend to simply convert the kedro pipelines to the another pipeline of the target tool. I don't feel this is the right way to deploy ml applications in general, because kedro pipelines often contains a lot of nodes with very small operations where no data is persisted between nodes, especially for the ML pipelines. From an orchestration point of view, this is likely a single node that must be executed once, and eventually on a dedicated infra (GPU...), while other pipelines (for heavy feature extraction or engineering) might need a different infra / orchestration timeline. In a nutshell, I don't think there is an exact mapping between kedro nodes (designed by a data scientist for code readibility, easy debugging and partial execution...) and orchestrator nodes (designed for system robustness, ease of asynchronous execution, retry strategies, efficient compute...). kedro nodes are much more low-level in my opinion than orchestrator nodes. |
@Galileo-Galilei I completely agree with this assessment
The guides are more of a tutorial than anything else to convert Kedro pipeline to target orchestration platform's primitives. The reason why we don't make all of them into plugins were precisely because of what you say here: how you slice your pipeline and map them to the orchestrator's nodes is up to you. A good pattern is using tag. You can map a set of nodes tagged with Regarding Airflow & Mlflow, let me take a stab this weekend. We are in active discussion with the Airflow team actually. Would love to show case a Kedro x Mlflow on Airflow maybe through using your excellent plugin. |
Thank you for your instructive comment (and nice feedback on the plugin!) on the point of view of kedro team on this. I had this intuition when I saw you did choose tutorials rather than plugins as you explain. This is completely in line with how I envision ML deployment: you can read in the README of this example project how I suggest an organisation in 3 "apps" each containing several Kedro pipelines which are contructed by filtering out a bigger pipeline with tags (but the same apply with namespace). This is not explicit in the example, but in my mind the objects which are going to be turned into orchestrator nodes are these smaller pipelines. If you want to read more on this, we have a documentation PR opened which describes how one can use kedro-mlflow as a mlops framework. This is quite theoretical (and maybe more suited to a blog post rather than techincal documentation) and focus on the fact that we need to synchronize training and inference development which is a big issue in ML (but not our point here), but the underlying proposed architecture is always described with a deployment to an orchestrator in mind. For you record, this is very close to how my team deploy its kedro projects in real world, at least for the underlying principles. I have in mind to open an issue to Kedro's repo to give feedback on deployment strategies and suggest some doc design for deployment, but I need to think deeply and design it carefully, so it can take weeks (months?). To be back to this topic, I'd be glad to come up to a solution for the interaction betwen the 3 tools since it seems to be a "hot topic" for some users. It's even better if this solution is supported by Kedro & Airflow's core teams! I'd love to see what you will come up with and I'll support it as much as I can, so I'll wait for your feedback. |
Hi everyone, given all discusssions above and after many thougts:
P.S: @limdauto if you came up with something you want to share after tour discussions with the airflow teams, feel free to reopen the issue |
Hi - is anyone working currently on integration with kedro-airflow (or pipeline scheduling in general)?
I've got it working but the problem is that each task within a DAG is tracked under a separate run id
which of course does not make much sense here. I'm thinking of adding a feature to track the whole pipeline
under the same run id when scheduled with airflow. Any comments, hints how to approach that more than welcome!
The text was updated successfully, but these errors were encountered: