Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement all_done_min_one_success trigger rule #28382

Closed
2 tasks done
gdavoian opened this issue Dec 15, 2022 · 5 comments
Closed
2 tasks done

Implement all_done_min_one_success trigger rule #28382

gdavoian opened this issue Dec 15, 2022 · 5 comments
Labels

Comments

@gdavoian
Copy link
Contributor

gdavoian commented Dec 15, 2022

Description

The trigger rule was requested in #17010, but none_failed_min_one_success was proposed as a solution.

I find the proposed solution unsatisfactory since none_failed_min_one_success isn't the same as all_done_min_one_success (the latter rule allows some upstream tasks to fail as soon as at least one of them succeeds).

I can try working on a PR if the feature is approved.

Use case/motivation

I have a few data ingestion tasks (each one ingests data from a separate external data source to the internal data lake) and a data loading task (it loads data from the data lake to the data warehouse). I want the data loading task to be triggered if at least one of the data ingestion tasks succeeds. Obviously, if all the data ingestion tasks fail, then there is no point in triggering the data loading task. But if at least one of them succeeds, then in my particular case it's better to load the just ingested data (and figure out why the other tasks failed later) rather than not loading anything at all.

Related issues

#17010

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@gdavoian gdavoian added the kind:feature Feature Requests label Dec 15, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Dec 15, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@potiuk
Copy link
Member

potiuk commented Dec 15, 2022

Is there any reason you can't have an "all_done" task that checks if there is an upstream task that succeeded and trigger data loading then?

INGEST 1. ----\

INGEST 2  ----- (all_done) --> CHECK ANY UPSTREAM SUCCEEDED [fail if there none of the upstreams succeeded] -> LOAD 

INGEST 3  ----/

IMHO, it is fully doable, does not require to add any complexity and adding more complex trigger rules, it very well reflects your use case (you anyhow need to wait with the decision until all of the ingestion tasks are done), and it also gives you clear indication of the case where the "CHECK ALL UPSTREAMS" task will clearly inform you that this was the reason.

Checking the state of those tasks can be easily done by XCom (and if you do the "CHECK" task as @task decorated operator, it is as simple as checking the output of all of the tasks. Moreover the "check" task could simply pass the list of all ingested datasets as a single input tho the LOAD task, so that it does not have to rely on checking if any of the tasks failed or not.

For me this is a very nice separation of concerns - having one task to decide whether to proceed or not based on deep analysis of which of those tasks succeed and which did not, and separately having a LOAD tasks that loads whatever you pass to it - without worrying about states there. I am afraid your LOAD task in your case will have to do that kind of check anyway if you would like to get the rule you want (but in a multi-function task). Splitting those seems much cleaner (and you could make yur "CHECK" task reusable across several different LOAD tasks as well - without having to implement the same "select only those that succeeded" logic.

Anything wrong with this approach in your case?

@gdavoian
Copy link
Contributor Author

gdavoian commented Dec 15, 2022

@potiuk your proposed solution sounds really good, thank you!

I'd just like to remark that the data loading task doesn't necessarily have to know which data ingestion tasks actually failed/succeeded to make further decisions. E.g., it may use some kind of a pattern like YYYY/mm/dd/**.json to load the current/latest data from the data lake into the data warehouse (whatever files it was able to find matching the pattern).

So my idea was to just have a trigger rule handling such simple cases (of course, without introducing extra complexity). Yes, I agree that adding yet another auxiliary task to check the state of the upstream tasks isn't a big deal, but it's still a kind of logic that might arise again and again...

Also, I believe the approach described above is pretty generic, so it can be used to imitate a lot of different behaviors, including but not limited to none_failed, none_skipped, none_failed_min_one_success. Since the latter ones actually have their own trigger rules, I don't see a reason why the all_done_min_one_success behavior doesn't deserve a trigger rule on its own :)

I know that what I'm trying to describe covers a pretty simple use case, but I really think that good software should strive to make simple things even easier (and wrong things impossible!), and in this particular case, unfortunately, there isn't an easy way to achieve the desired behavior.

@potiuk
Copy link
Member

potiuk commented Dec 16, 2022

I'd just like to remark that the data loading task doesn't necessarily have to know which data ingestion tasks actually failed/succeeded to make further decisions. E.g., it may use some kind of a pattern like YYYY/mm/dd/**.json to load the current/latest data from the data lake into the data warehouse (whatever files it was able to find matching the pattern).

Then your LOAD task should fail if it does not find any file matching the pattern (and still can use all_done). Problem solved.

@potiuk
Copy link
Member

potiuk commented Dec 16, 2022

Conveting it into discussion. I do not really see a reason for new trigger rule, seems that those patterns we have now are supporting the use cases pretty well (either by intermediate task or by simply failing when there are no outputs from upstream tasks produced) - for example with pattern matching, while using all_done trigger rule. IMHO those cases do not justify a new rule. But if others think differently the discussion is open and a PR might be created out of that if there are peoeple convinced it should be done.

@apache apache locked and limited conversation to collaborators Dec 16, 2022
@potiuk potiuk converted this issue into discussion #28396 Dec 16, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Projects
None yet
Development

No branches or pull requests

2 participants