Fix external_executor_id not being set for manually run jobs. #17207

Jorricks · 2021-07-25T11:51:11Z

Summary of the problem:
Currently the celery task_id is being stored only after a scheduler launched a task in its executor.
Then the celery task_id is being put on the event_buffer and the scheduler periodically reads out the event_buffer and stores the external_executor_id.
Because manually trigger tasks enter the adoption flow -- as their executor instances are only there for the launching of that one specific tasks -- and the external_executor_id is not set, they won't get adopted. Instead they get killed. Meaning, any manually triggered task that doesn't have an external_executor_id from a previous scheduled run before being launched, might get killed if the adoption routines kicks in while the task is still running.

Solution:
I could imagine two solution:

After every manually triggered task, we read the event_buffer and store that in the task instances.
Every task that is triggered for Celery Executors automatically stores its external_executor_id at the start up of the task.

I implemented both but found the second version nicer.
I am looking for some feedback so please provide me with any you can think of :)

Opened issues that are related:
related: #16023

Jorricks · 2021-07-25T11:54:33Z

The other idea:

After every manually triggered task, we read the event_buffer and store that in the task instances.

Would look something like adding this to the def run() of class TaskInstanceModelView:

            event_buffer = executor.get_event_buffer()
            for ti_key, (state, external_executor_id) in event_buffer.items():
                if ti.key == ti_key and state == State.QUEUED:
                    session.merge(ti)
                    ti.external_executor_id = external_executor_id
                    session.commit()
                    updated = True

mik-laj · 2021-07-27T05:22:39Z

airflow/cli/commands/task_command.py

+    if "external_executor_id" in args:
+        return args.external_executor_id
+    elif "external_executor_id" in os.environ:
+        return args.external_executor_id


Should we read env variable here?

Good catch! Will change that.

What did you think of the overal implementation @mik-laj ?

Fixed the reference.

Jorricks · 2021-08-01T14:00:15Z

Rebased on latest main.

ephraimbuddy

I feel we should not make this change but dig more to reliably reproduce the linked issue.
On my first try, I was able to see Scheduler marked as failed but on several tries now, I can't reproduce it. I ended up having task stuck in running which is related to another issue(Trying to find how I got to that)

airflow/cli/commands/task_command.py

Jorricks · 2021-08-19T11:19:10Z

I feel we should not make this change but dig more to reliably reproduce the linked issue.
On my first try, I was able to see Scheduler marked as failed but on several tries now, I can't reproduce it. I ended up having task stuck in running which is related to another issue(Trying to find how I got to that)

First of all, thanks for taking a look at the PR!

There are a couple things that need to hold for you to be able to reproduce it:

Before starting the task, the task should not have run before or at least the external_executor_id must be None.
The task must still be running when the adoption flow of one of the schedulers kicks in.

I am able to reliable reproduce this issue when these two things are held.

However, I am not really sure what you mean with dig deeper. Unless the whole executor part is rewritten, this is the best I could come up with. Could you please be a bit more explicit?

ephraimbuddy · 2021-08-19T13:44:46Z

I feel we should not make this change but dig more to reliably reproduce the linked issue.
On my first try, I was able to see Scheduler marked as failed but on several tries now, I can't reproduce it. I ended up having task stuck in running which is related to another issue(Trying to find how I got to that)

First of all, thanks for taking a look at the PR!

There are a couple things that need to hold for you to be able to reproduce it:

Before starting the task, the task should not have run before or at least the external_executor_id must be None.

The task must still be running when the adoption flow of one of the schedulers kicks in.

I am able to reliable reproduce this issue when these two things are held.

However, I am not really sure what you mean with dig deeper. Unless the whole executor part is rewritten, this is the best I could come up with. Could you please be a bit more explicit?

I did reproduce it initially but can't reproduce it again. Have done airflow db reset in breeze but can't reproduce it. Well, I do not fully understand this area, let's hear from others

Jorricks · 2021-08-21T00:24:36Z

@ephraimbuddy please make sure you are using a form of a Celery Executor, as this ticket is only for celery setups.

ephraimbuddy

This now makes sense to me and will make it possible for CeleryExecutors to adopt tasks from failed SchedulerJobs when tasks are run from the UI with Ignore All deps.

Can you add some tests?

airflow/cli/commands/task_command.py

airflow/executors/celery_executor.py

Jorricks · 2021-08-23T19:59:05Z

I tried to add some tests on the CLI task part. That is pretty much done.
However, I had quite some trouble wrapping my head around a decent test approach on the celery_executor part.
There is currently not really a test for any of the functions I modified which makes me wonder if I should add them.
If so, do you have any remarks on how I could best do that?

Jorricks · 2021-08-24T08:33:55Z

CI/CD failed on transient errors, not related to this PR :)

ephraimbuddy

I tried to add some tests on the CLI task part. That is pretty much done.
However, I had quite some trouble wrapping my head around a decent test approach on the celery_executor part.
There is currently not really a test for any of the functions I modified which makes me wonder if I should add them.
If so, do you have any remarks on how I could best do that?

For the methods modified in CeleryExecutor, you can mock them and assert they were called with a celeryID.
The app can be mocked to return an ID. See

airflow/tests/executors/test_celery_executor.py

Line 75 in 3d96ad6

patch_app = mock.patch('airflow.executors.celery_executor.app', test_app)

You can provide appID there.

airflow/executors/celery_executor.py

Jorricks · 2021-08-27T06:45:36Z

I tried to add some tests on the CLI task part. That is pretty much done.
However, I had quite some trouble wrapping my head around a decent test approach on the celery_executor part.
There is currently not really a test for any of the functions I modified which makes me wonder if I should add them.
If so, do you have any remarks on how I could best do that?

For the methods modified in CeleryExecutor, you can mock them and assert they were called with a celeryID.
The app can be mocked to return an ID. See

airflow/tests/executors/test_celery_executor.py

Line 75 in 3d96ad6

patch_app = mock.patch('airflow.executors.celery_executor.app', test_app)

You can provide appID there.

I will do this after I am back from my vacation in about two weeks 👍.

Jorricks · 2021-09-07T11:50:13Z

I tried to add some tests on the CLI task part. That is pretty much done.
However, I had quite some trouble wrapping my head around a decent test approach on the celery_executor part.
There is currently not really a test for any of the functions I modified which makes me wonder if I should add them.
If so, do you have any remarks on how I could best do that?

For the methods modified in CeleryExecutor, you can mock them and assert they were called with a celeryID.
The app can be mocked to return an ID. See

airflow/tests/executors/test_celery_executor.py

Line 75 in 3d96ad6

patch_app = mock.patch('airflow.executors.celery_executor.app', test_app)

You can provide appID there.

I modified the existing test to add this behaviour.

ephraimbuddy · 2021-09-08T07:52:20Z

You have some conflicts :)

Jorricks · 2021-09-11T15:59:23Z

You have some conflicts :)

Fixed.

tests/cli/commands/test_task_command.py

tests/executors/test_celery_executor.py

Jorricks · 2021-09-13T12:12:39Z

So @ephraimbuddy, do you think it's ready to be merged now or?

ephraimbuddy · 2021-09-13T13:15:19Z

So @ephraimbuddy, do you think it's ready to be merged now or?

Looks like you have test failure

Jorricks · 2021-09-13T19:36:57Z

Tests should be fixed now 👍

Jorricks · 2021-09-14T12:01:18Z

Can someone please re-trigger the pipeline :)?

ephraimbuddy · 2021-09-14T12:09:56Z

You can actually retrigger it by closing and opening the PR

Jorricks · 2021-09-14T13:04:04Z

You can actually retrigger it by closing and opening the PR

Didn't know I could re-open it myself. Thank you!

Jorricks · 2021-09-14T13:05:36Z

AFAIK, the failures in the tests are not related to this PR.

Jorricks · 2021-09-14T19:35:21Z

Rebased on latest main again to see if that makes it better.

Jorricks · 2021-09-15T06:18:49Z

Yes the remaining failure is not related to this PR.

github-actions · 2021-09-15T10:27:10Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

ephraimbuddy · 2021-09-15T10:28:33Z

cc: @ashb @kaxil

ashb · 2021-09-15T12:09:39Z

Should be fixed on main now, so re-triggering build :)

Jorricks requested review from ashb, kaxil and XD-DENG as code owners July 25, 2021 11:51

boring-cyborg bot added area:CLI area:Scheduler including HA (high availability) scheduler labels Jul 25, 2021

mik-laj reviewed Jul 27, 2021

View reviewed changes

Jorricks force-pushed the fix-external-executor-id branch from 07e3e68 to 24804a7 Compare August 1, 2021 13:59

ephraimbuddy requested changes Aug 19, 2021

View reviewed changes

airflow/cli/commands/task_command.py Outdated Show resolved Hide resolved

ephraimbuddy requested changes Aug 23, 2021

View reviewed changes

airflow/cli/commands/task_command.py Show resolved Hide resolved

Jorricks commented Aug 23, 2021

View reviewed changes

airflow/executors/celery_executor.py Outdated Show resolved Hide resolved

Jorricks force-pushed the fix-external-executor-id branch from c9f0505 to 204cb32 Compare August 23, 2021 19:55

Jorricks changed the title ~~WIP: Fix external_executor_id not being set for manually run jobs.~~ Fix external_executor_id not being set for manually run jobs. Aug 23, 2021

ephraimbuddy requested changes Aug 24, 2021

View reviewed changes

airflow/executors/celery_executor.py Outdated Show resolved Hide resolved

ephraimbuddy reviewed Sep 11, 2021

View reviewed changes

tests/cli/commands/test_task_command.py Outdated Show resolved Hide resolved

tests/executors/test_celery_executor.py Outdated Show resolved Hide resolved

tests/executors/test_celery_executor.py Outdated Show resolved Hide resolved

Jorricks force-pushed the fix-external-executor-id branch from aece1ec to e9921d5 Compare September 13, 2021 17:51

ephraimbuddy closed this Sep 14, 2021

ephraimbuddy reopened this Sep 14, 2021

Manual triggering task with celery executor fix

0cbe996

Jorricks force-pushed the fix-external-executor-id branch from 17a2e45 to 0cbe996 Compare September 14, 2021 19:35

ephraimbuddy approved these changes Sep 15, 2021

View reviewed changes

github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Sep 15, 2021

ashb approved these changes Sep 15, 2021

View reviewed changes

ashb closed this Sep 15, 2021

ashb reopened this Sep 15, 2021

ashb merged commit e7925d8 into apache:main Sep 15, 2021

kaxil added this to the Airflow 2.2.0 milestone Sep 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix external_executor_id not being set for manually run jobs. #17207

Fix external_executor_id not being set for manually run jobs. #17207

Jorricks commented Jul 25, 2021

Jorricks commented Jul 25, 2021 •

edited

Loading

mik-laj Jul 27, 2021

Jorricks Jul 27, 2021 •

edited

Loading

Jorricks Jul 27, 2021

Jorricks Jul 29, 2021

Jorricks commented Aug 1, 2021

ephraimbuddy left a comment

Jorricks commented Aug 19, 2021 •

edited

Loading

ephraimbuddy commented Aug 19, 2021

Jorricks commented Aug 21, 2021

ephraimbuddy left a comment

Jorricks commented Aug 23, 2021

Jorricks commented Aug 24, 2021

ephraimbuddy left a comment •

edited

Loading

Jorricks commented Aug 27, 2021

Jorricks commented Sep 7, 2021

ephraimbuddy commented Sep 8, 2021

Jorricks commented Sep 11, 2021

Jorricks commented Sep 13, 2021

ephraimbuddy commented Sep 13, 2021

Jorricks commented Sep 13, 2021

Jorricks commented Sep 14, 2021

ephraimbuddy commented Sep 14, 2021

Jorricks commented Sep 14, 2021

Jorricks commented Sep 14, 2021

Jorricks commented Sep 14, 2021

Jorricks commented Sep 15, 2021

github-actions bot commented Sep 15, 2021

ephraimbuddy commented Sep 15, 2021

ashb commented Sep 15, 2021

Fix external_executor_id not being set for manually run jobs. #17207

Fix external_executor_id not being set for manually run jobs. #17207

Conversation

Jorricks commented Jul 25, 2021

Jorricks commented Jul 25, 2021 • edited Loading

mik-laj Jul 27, 2021

Choose a reason for hiding this comment

Jorricks Jul 27, 2021 • edited Loading

Choose a reason for hiding this comment

Jorricks Jul 27, 2021

Choose a reason for hiding this comment

Jorricks Jul 29, 2021

Choose a reason for hiding this comment

Jorricks commented Aug 1, 2021

ephraimbuddy left a comment

Choose a reason for hiding this comment

Jorricks commented Aug 19, 2021 • edited Loading

ephraimbuddy commented Aug 19, 2021

Jorricks commented Aug 21, 2021

ephraimbuddy left a comment

Choose a reason for hiding this comment

Jorricks commented Aug 23, 2021

Jorricks commented Aug 24, 2021

ephraimbuddy left a comment • edited Loading

Choose a reason for hiding this comment

Jorricks commented Aug 27, 2021

Jorricks commented Sep 7, 2021

ephraimbuddy commented Sep 8, 2021

Jorricks commented Sep 11, 2021

Jorricks commented Sep 13, 2021

ephraimbuddy commented Sep 13, 2021

Jorricks commented Sep 13, 2021

Jorricks commented Sep 14, 2021

ephraimbuddy commented Sep 14, 2021

Jorricks commented Sep 14, 2021

Jorricks commented Sep 14, 2021

Jorricks commented Sep 14, 2021

Jorricks commented Sep 15, 2021

github-actions bot commented Sep 15, 2021

ephraimbuddy commented Sep 15, 2021

ashb commented Sep 15, 2021

Jorricks commented Jul 25, 2021 •

edited

Loading

Jorricks Jul 27, 2021 •

edited

Loading

Jorricks commented Aug 19, 2021 •

edited

Loading

ephraimbuddy left a comment •

edited

Loading