Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix critical CeleryKubernetesExecutor bug #13247

Merged
merged 7 commits into from
Feb 3, 2021

Conversation

dstandish
Copy link
Contributor

@dstandish dstandish commented Dec 22, 2020

CeleryKubernetesExeucutor is currently broken.

This PR gets it working again.

Issues resolved

1. missing job id

Before SchedulerJob starts executor it sets self.executor.job_id = self.id, where self.id is the primary key in the jobs table.

See here: https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1265.

So with KubernetesExecutor, when start is called, the executor has a value under job_id.

But with CeleryKubernetesExecutor enabled, the actual executors are under attributes on the CKE object, so they don't have the job ID, so when K8s executor tries to start, it fails because it doesn't find a job id.

Resolution:

I add a setter for job_id, so that when you set job_id on the CKE, it immediately propagates the value to the child executors.

2. missing slots_available

Since this executor does not inherit BaseExecutor, this property is missing and we must add it.

Resolution:

And I add slots_available property.

3. Celery worker does not run properly with CeleryKubernetesExecutor

When airflow celery worker is run with executor=CeleryKubernetesExecutor, celery worker fails to fork properly.

See more details in issue #13263.

Resolution:

  • Update helm chart so that flower and celery use CeleryExecutor while scheduler etc uses CeleryKubernetesExecutor
  • Update docs to highlight that celery workers must always be configured to use CeleryExecutor.

This is perhaps a bit of a hack, but it provides a workable way to use this executor.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Dec 22, 2020
@dstandish dstandish changed the title fix celery kubernetes executor bug preventing scheduler from starting WIP fix celery kubernetes executor bug preventing scheduler from starting Dec 22, 2020
@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@dstandish dstandish changed the title WIP fix celery kubernetes executor bug preventing scheduler from starting [WIP] fix celery kubernetes executor bug preventing scheduler from starting Dec 22, 2020
@dstandish dstandish force-pushed the fix-celery-kubernetes-executor branch 2 times, most recently from a63eaa2 to f5c7010 Compare December 22, 2020 20:55
@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@dstandish dstandish force-pushed the fix-celery-kubernetes-executor branch from f5c7010 to 62b2886 Compare December 22, 2020 22:12
@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@dstandish dstandish force-pushed the fix-celery-kubernetes-executor branch from 62b2886 to a642bf7 Compare December 22, 2020 22:58
@dstandish dstandish changed the title [WIP] fix celery kubernetes executor bug preventing scheduler from starting Fix celery kubernetes executor bug preventing scheduler from starting Dec 24, 2020
@dstandish dstandish force-pushed the fix-celery-kubernetes-executor branch from a642bf7 to b2270be Compare December 24, 2020 09:52
@dstandish dstandish changed the title Fix celery kubernetes executor bug preventing scheduler from starting Fix fatal celery kubernetes executor bug preventing scheduler from starting Dec 24, 2020
@dstandish dstandish force-pushed the fix-celery-kubernetes-executor branch from b2270be to 8c34465 Compare December 24, 2020 18:46
@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@dstandish dstandish force-pushed the fix-celery-kubernetes-executor branch from 8c34465 to 3793d2c Compare December 25, 2020 05:28
@dstandish dstandish changed the title Fix fatal celery kubernetes executor bug preventing scheduler from starting Fix critical celery kubernetes executor bug Dec 25, 2020
@dstandish dstandish changed the title Fix critical celery kubernetes executor bug Fix critical CeleryKubernetesExecutor bug Dec 25, 2020
@dstandish dstandish force-pushed the fix-celery-kubernetes-executor branch 2 times, most recently from db4540c to 0ef3c92 Compare December 27, 2020 19:06
@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@dstandish dstandish force-pushed the fix-celery-kubernetes-executor branch from 0ef3c92 to 8e95086 Compare December 27, 2020 22:20
@github-actions
Copy link

The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the the workflow link to check the reason.

@dstandish dstandish force-pushed the fix-celery-kubernetes-executor branch 4 times, most recently from 51d8f90 to 7bb8cb5 Compare January 3, 2021 05:34
@@ -30,7 +30,6 @@
from airflow.cli.commands.legacy_commands import check_legacy_command
from airflow.configuration import conf
from airflow.exceptions import AirflowException
from airflow.executors.executor_constants import CELERY_EXECUTOR, CELERY_KUBERNETES_EXECUTOR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you stop using these constants?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a particularly good reason, will add back

Copy link
Contributor Author

@dstandish dstandish Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for context, in a previous change, when i thought that CeleryKubernetesExecutor should be consistently set across the cluster, i had added one of them.

in this pr though i was essentially reverting that change and in my mind the constants were not use prior to that change. but in actuality, the constant was used prior to my prev change -- in the if statement but not in the logging. so this is actually a removal and not just a revert of a recent change.

i will restore use of constant.

@kaxil kaxil added this to the Airflow 2.0.1 milestone Jan 27, 2021
@kaxil kaxil force-pushed the fix-celery-kubernetes-executor branch from c958271 to a6853fd Compare January 28, 2021 00:51
@github-actions
Copy link

The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the the workflow link to check the reason.

@dstandish dstandish force-pushed the fix-celery-kubernetes-executor branch from a6853fd to acfb0c9 Compare January 28, 2021 04:28
@kaxil kaxil force-pushed the fix-celery-kubernetes-executor branch from acfb0c9 to 7367e46 Compare January 29, 2021 00:54
message = f'celery subcommand works only with CeleryExecutor, your current executor: {executor}'
raise ArgumentError(action, message)
if value == 'celery':
if executor != CELERY_EXECUTOR:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not cover CELERY_KUBERNETES_EXECUTOR though, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. if we don't run it via the Helm Chart, it should still work so I don't think this change is needed, apart from improving the error message

Copy link
Contributor Author

@dstandish dstandish Jan 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaxil the problem is that when executor==CELERY_KUBERNETES_EXECUTOR on the celery worker, the celery worker does not run properly.

you get some fork pool error

running airflow celery worker with CELERY_EXECUTOR instead resolves the issue

celery workers don't need to know that the scheduler is using CKE

i know this is a bit hacky. our code should handle having all components set to use the same executor. however, that simply doesn't work right now, and this change (forcing the C workers to think CeleryExecutor is used) makes CKE work again immediately.

perhaps it makes sense to push out this hacky fix and then later look into why we get fork pool issues. but that's a question for you guys

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that one feels too hacky, we should at least take care of it when running "airflow celery worker" (a less hacky --- so that users don't need to have a different configuration for the Celery Worker.

Can you paste the full stack trace of the error please too -- want to see if I have seen that before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See issue #13263

@jedcunningham
Copy link
Member

(Moved into the 2.0.3 milestone as it was never cherry picked into 2.0.1)

kaxil pushed a commit to astronomer/airflow that referenced this pull request Apr 26, 2021
kaxil pushed a commit that referenced this pull request Apr 26, 2021
@kaxil
Copy link
Member

kaxil commented Apr 26, 2021

(Moved into the 2.0.3 milestone as it was never cherry picked into 2.0.1)

🤦 I just cherry-picked it to v2-0-test

potiuk pushed a commit to potiuk/airflow that referenced this pull request May 6, 2021
@ashb ashb modified the milestones: Airflow 2.0.3, Airflow 2.1 May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants