-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix external_executor_id not being set for manually run jobs. #17207
Conversation
The other idea:
Would look something like adding this to the event_buffer = executor.get_event_buffer()
for ti_key, (state, external_executor_id) in event_buffer.items():
if ti.key == ti_key and state == State.QUEUED:
session.merge(ti)
ti.external_executor_id = external_executor_id
session.commit()
updated = True |
airflow/cli/commands/task_command.py
Outdated
if "external_executor_id" in args: | ||
return args.external_executor_id | ||
elif "external_executor_id" in os.environ: | ||
return args.external_executor_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we read env variable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Will change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did you think of the overal implementation @mik-laj ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the reference.
07e3e68
to
24804a7
Compare
Rebased on latest main. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we should not make this change but dig more to reliably reproduce the linked issue.
On my first try, I was able to see Scheduler marked as failed
but on several tries now, I can't reproduce it. I ended up having task stuck in running which is related to another issue(Trying to find how I got to that)
First of all, thanks for taking a look at the PR! There are a couple things that need to hold for you to be able to reproduce it:
I am able to reliable reproduce this issue when these two things are held. However, I am not really sure what you mean with dig deeper. Unless the whole executor part is rewritten, this is the best I could come up with. Could you please be a bit more explicit? |
I did reproduce it initially but can't reproduce it again. Have done |
@ephraimbuddy please make sure you are using a form of a Celery Executor, as this ticket is only for celery setups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now makes sense to me and will make it possible for CeleryExecutors to adopt tasks from failed SchedulerJobs when tasks are run from the UI with Ignore All deps
.
Can you add some tests?
c9f0505
to
204cb32
Compare
I tried to add some tests on the CLI task part. That is pretty much done. |
CI/CD failed on transient errors, not related to this PR :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to add some tests on the CLI task part. That is pretty much done.
However, I had quite some trouble wrapping my head around a decent test approach on the celery_executor part.
There is currently not really a test for any of the functions I modified which makes me wonder if I should add them.
If so, do you have any remarks on how I could best do that?
For the methods modified in CeleryExecutor, you can mock them and assert they were called with a celeryID.
The app can be mocked to return an ID. See
patch_app = mock.patch('airflow.executors.celery_executor.app', test_app) |
You can provide appID there.
I will do this after I am back from my vacation in about two weeks 👍. |
I modified the existing test to add this behaviour. |
You have some conflicts :) |
Fixed. |
So @ephraimbuddy, do you think it's ready to be merged now or? |
Looks like you have test failure |
aece1ec
to
e9921d5
Compare
Tests should be fixed now 👍 |
Can someone please re-trigger the pipeline :)? |
You can actually retrigger it by closing and opening the PR |
Didn't know I could re-open it myself. Thank you! |
AFAIK, the failures in the tests are not related to this PR. |
17a2e45
to
0cbe996
Compare
Rebased on latest main again to see if that makes it better. |
Yes the remaining failure is not related to this PR. |
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
Should be fixed on main now, so re-triggering build :) |
Summary of the problem:
Currently the celery
task_id
is being stored only after a scheduler launched a task in its executor.Then the celery
task_id
is being put on theevent_buffer
and the scheduler periodically reads out theevent_buffer
and stores theexternal_executor_id
.Because manually trigger tasks enter the adoption flow -- as their executor instances are only there for the launching of that one specific tasks -- and the
external_executor_id
is not set, they won't get adopted. Instead they get killed. Meaning, any manually triggered task that doesn't have anexternal_executor_id
from a previous scheduled run before being launched, might get killed if the adoption routines kicks in while the task is still running.Solution:
I could imagine two solution:
external_executor_id
at the start up of the task.I implemented both but found the second version nicer.
I am looking for some feedback so please provide me with any you can think of :)
Opened issues that are related:
related: #16023