-
Notifications
You must be signed in to change notification settings - Fork 14.7k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement all_done_min_one_success trigger rule #28382
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
Is there any reason you can't have an "all_done" task that checks if there is an upstream task that succeeded and trigger data loading then?
IMHO, it is fully doable, does not require to add any complexity and adding more complex trigger rules, it very well reflects your use case (you anyhow need to wait with the decision until all of the ingestion tasks are done), and it also gives you clear indication of the case where the "CHECK ALL UPSTREAMS" task will clearly inform you that this was the reason. Checking the state of those tasks can be easily done by XCom (and if you do the "CHECK" task as @task decorated operator, it is as simple as checking the output of all of the tasks. Moreover the "check" task could simply pass the list of all ingested datasets as a single input tho the LOAD task, so that it does not have to rely on checking if any of the tasks failed or not. For me this is a very nice separation of concerns - having one task to decide whether to proceed or not based on deep analysis of which of those tasks succeed and which did not, and separately having a LOAD tasks that loads whatever you pass to it - without worrying about states there. I am afraid your LOAD task in your case will have to do that kind of check anyway if you would like to get the rule you want (but in a multi-function task). Splitting those seems much cleaner (and you could make yur "CHECK" task reusable across several different LOAD tasks as well - without having to implement the same "select only those that succeeded" logic. Anything wrong with this approach in your case? |
@potiuk your proposed solution sounds really good, thank you! I'd just like to remark that the data loading task doesn't necessarily have to know which data ingestion tasks actually failed/succeeded to make further decisions. E.g., it may use some kind of a pattern like So my idea was to just have a trigger rule handling such simple cases (of course, without introducing extra complexity). Yes, I agree that adding yet another auxiliary task to check the state of the upstream tasks isn't a big deal, but it's still a kind of logic that might arise again and again... Also, I believe the approach described above is pretty generic, so it can be used to imitate a lot of different behaviors, including but not limited to I know that what I'm trying to describe covers a pretty simple use case, but I really think that good software should strive to make simple things even easier (and wrong things impossible!), and in this particular case, unfortunately, there isn't an easy way to achieve the desired behavior. |
Then your LOAD task should fail if it does not find any file matching the pattern (and still can use all_done). Problem solved. |
Conveting it into discussion. I do not really see a reason for new trigger rule, seems that those patterns we have now are supporting the use cases pretty well (either by intermediate task or by simply failing when there are no outputs from upstream tasks produced) - for example with pattern matching, while using |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Description
The trigger rule was requested in #17010, but
none_failed_min_one_success
was proposed as a solution.I find the proposed solution unsatisfactory since
none_failed_min_one_success
isn't the same asall_done_min_one_success
(the latter rule allows some upstream tasks to fail as soon as at least one of them succeeds).I can try working on a PR if the feature is approved.
Use case/motivation
I have a few data ingestion tasks (each one ingests data from a separate external data source to the internal data lake) and a data loading task (it loads data from the data lake to the data warehouse). I want the data loading task to be triggered if at least one of the data ingestion tasks succeeds. Obviously, if all the data ingestion tasks fail, then there is no point in triggering the data loading task. But if at least one of them succeeds, then in my particular case it's better to load the just ingested data (and figure out why the other tasks failed later) rather than not loading anything at all.
Related issues
#17010
Are you willing to submit a PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: