Implement all_done_min_one_success trigger rule #28382

gdavoian · 2022-12-15T16:02:16Z

Description

The trigger rule was requested in #17010, but none_failed_min_one_success was proposed as a solution.

I find the proposed solution unsatisfactory since none_failed_min_one_success isn't the same as all_done_min_one_success (the latter rule allows some upstream tasks to fail as soon as at least one of them succeeds).

I can try working on a PR if the feature is approved.

Use case/motivation

I have a few data ingestion tasks (each one ingests data from a separate external data source to the internal data lake) and a data loading task (it loads data from the data lake to the data warehouse). I want the data loading task to be triggered if at least one of the data ingestion tasks succeeds. Obviously, if all the data ingestion tasks fail, then there is no point in triggering the data loading task. But if at least one of them succeeds, then in my particular case it's better to load the just ingested data (and figure out why the other tasks failed later) rather than not loading anything at all.

Related issues

#17010

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2022-12-15T16:02:18Z

Thanks for opening your first issue here! Be sure to follow the issue template!

potiuk · 2022-12-15T22:09:09Z

Is there any reason you can't have an "all_done" task that checks if there is an upstream task that succeeded and trigger data loading then?

INGEST 1. ----\

INGEST 2  ----- (all_done) --> CHECK ANY UPSTREAM SUCCEEDED [fail if there none of the upstreams succeeded] -> LOAD 

INGEST 3  ----/

IMHO, it is fully doable, does not require to add any complexity and adding more complex trigger rules, it very well reflects your use case (you anyhow need to wait with the decision until all of the ingestion tasks are done), and it also gives you clear indication of the case where the "CHECK ALL UPSTREAMS" task will clearly inform you that this was the reason.

Checking the state of those tasks can be easily done by XCom (and if you do the "CHECK" task as @task decorated operator, it is as simple as checking the output of all of the tasks. Moreover the "check" task could simply pass the list of all ingested datasets as a single input tho the LOAD task, so that it does not have to rely on checking if any of the tasks failed or not.

For me this is a very nice separation of concerns - having one task to decide whether to proceed or not based on deep analysis of which of those tasks succeed and which did not, and separately having a LOAD tasks that loads whatever you pass to it - without worrying about states there. I am afraid your LOAD task in your case will have to do that kind of check anyway if you would like to get the rule you want (but in a multi-function task). Splitting those seems much cleaner (and you could make yur "CHECK" task reusable across several different LOAD tasks as well - without having to implement the same "select only those that succeeded" logic.

Anything wrong with this approach in your case?

gdavoian · 2022-12-15T23:53:11Z

@potiuk your proposed solution sounds really good, thank you!

I'd just like to remark that the data loading task doesn't necessarily have to know which data ingestion tasks actually failed/succeeded to make further decisions. E.g., it may use some kind of a pattern like YYYY/mm/dd/**.json to load the current/latest data from the data lake into the data warehouse (whatever files it was able to find matching the pattern).

So my idea was to just have a trigger rule handling such simple cases (of course, without introducing extra complexity). Yes, I agree that adding yet another auxiliary task to check the state of the upstream tasks isn't a big deal, but it's still a kind of logic that might arise again and again...

Also, I believe the approach described above is pretty generic, so it can be used to imitate a lot of different behaviors, including but not limited to none_failed, none_skipped, none_failed_min_one_success. Since the latter ones actually have their own trigger rules, I don't see a reason why the all_done_min_one_success behavior doesn't deserve a trigger rule on its own :)

I know that what I'm trying to describe covers a pretty simple use case, but I really think that good software should strive to make simple things even easier (and wrong things impossible!), and in this particular case, unfortunately, there isn't an easy way to achieve the desired behavior.

potiuk · 2022-12-16T00:03:57Z

I'd just like to remark that the data loading task doesn't necessarily have to know which data ingestion tasks actually failed/succeeded to make further decisions. E.g., it may use some kind of a pattern like YYYY/mm/dd/**.json to load the current/latest data from the data lake into the data warehouse (whatever files it was able to find matching the pattern).

Then your LOAD task should fail if it does not find any file matching the pattern (and still can use all_done). Problem solved.

potiuk · 2022-12-16T00:09:25Z

Conveting it into discussion. I do not really see a reason for new trigger rule, seems that those patterns we have now are supporting the use cases pretty well (either by intermediate task or by simply failing when there are no outputs from upstream tasks produced) - for example with pattern matching, while using all_done trigger rule. IMHO those cases do not justify a new rule. But if others think differently the discussion is open and a PR might be created out of that if there are peoeple convinced it should be done.

gdavoian added the kind:feature Feature Requests label Dec 15, 2022

potiuk added the pending-response label Dec 15, 2022

apache locked and limited conversation to collaborators Dec 16, 2022

potiuk converted this issue into discussion #28396 Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Implement all_done_min_one_success trigger rule #28382

Implement all_done_min_one_success trigger rule #28382

gdavoian commented Dec 15, 2022 •

edited

Loading

boring-cyborg bot commented Dec 15, 2022

potiuk commented Dec 15, 2022 •

edited

Loading

gdavoian commented Dec 15, 2022 •

edited

Loading

potiuk commented Dec 16, 2022 •

edited

Loading

potiuk commented Dec 16, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

Implement all_done_min_one_success trigger rule #28382

Implement all_done_min_one_success trigger rule #28382

Comments

gdavoian commented Dec 15, 2022 • edited Loading

Description

Use case/motivation

Related issues

Are you willing to submit a PR?

Code of Conduct

boring-cyborg bot commented Dec 15, 2022

potiuk commented Dec 15, 2022 • edited Loading

gdavoian commented Dec 15, 2022 • edited Loading

potiuk commented Dec 16, 2022 • edited Loading

potiuk commented Dec 16, 2022

This issue was moved to a discussion.

gdavoian commented Dec 15, 2022 •

edited

Loading

potiuk commented Dec 15, 2022 •

edited

Loading

gdavoian commented Dec 15, 2022 •

edited

Loading

potiuk commented Dec 16, 2022 •

edited

Loading