Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful termination does not work with apache chart #13591

Closed
dstandish opened this issue Jan 9, 2021 · 10 comments
Closed

Graceful termination does not work with apache chart #13591

dstandish opened this issue Jan 9, 2021 · 10 comments
Labels
area:helm-chart Airflow Helm Chart kind:bug This is a clearly a bug

Comments

@dstandish
Copy link
Contributor

dstandish commented Jan 9, 2021

In apache/airflow, helm chart has worker default terminationGracePeriodSeconds: 600.

I observed after deploy using 1.10.14 that worker was terminated immediately. This reproduced consistently.

Tested also with 2.0.0 and again no lucke

Anyone have any hints of something to look into?

Here are some logs from a worker that shutdown ungracefully, running 1.10.14:

worker: Warm shutdown (MainProcess)
[2021-01-09 22:24:30,747: ERROR/MainProcess] Process 'ForkPoolWorker-15' pid:37 exited with 'signal 15 (SIGTERM)'
[2021-01-09 22:24:30,858: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).')
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/worker.py", line 208, in start
    self.blueprint.start(self)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/bootsteps.py", line 369, in start
    return self.obj.start()
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 318, in start
    blueprint.start(self)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 599, in start
    c.loop(*c.loop_args())
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/loops.py", line 83, in asynloop
    next(loop)
  File "/home/airflow/.local/lib/python3.7/site-packages/kombu/asynchronous/hub.py", line 308, in create_loop
    events = poll(poll_timeout)
  File "/home/airflow/.local/lib/python3.7/site-packages/kombu/utils/eventio.py", line 84, in poll
    return self._epoll.poll(timeout if timeout is not None else -1)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/apps/worker.py", line 285, in _handle_request
    raise exc(exitcode)
celery.exceptions.WorkerShutdown: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).
[2021-01-09 22:24:30,865: ERROR/MainProcess] Process 'ForkPoolWorker-16' pid:38 exited with 'signal 15 (SIGTERM)'

 -------------- celery@airflow-worker-66b7bf687b-8j2x5 v4.4.7 (cliffs)
--- ***** -----
-- ******* ---- Linux-4.14.209-160.335.amzn2.x86_64-x86_64-with-debian-10.6 2021-01-09 22:22:38
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app:         airflow.executors.celery_executor:0x7fc31160fd90
- ** ---------- .> transport:   redis://:**@airflow-redis:6379/0
- ** ---------- .> results:     postgresql://postgres:**@airflow-pgbouncer:6543/airflow-result-backend
- *** --- * --- .> concurrency: 16 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery


[tasks]
  . airflow.executors.celery_executor.execute_command

And again with 2.0.0:

[2021-01-10 06:46:21,159: INFO/MainProcess] Connected to redis://:**@airflow-redis:6379/0
[2021-01-10 06:46:21,168: INFO/MainProcess] mingle: searching for neighbors
[2021-01-10 06:46:22,208: INFO/MainProcess] mingle: all alone
[2021-01-10 06:46:22,224: INFO/MainProcess] celery@airflow-worker-66b9b6495b-6m7jd ready.
[2021-01-10 06:46:25,199: INFO/MainProcess] Events of group {task} enabled by remote.
[2021-01-10 06:47:47,441: INFO/MainProcess] Received task: airflow.executors.celery_executor.execute_command[dab010cc-72fb-4f73-8c53-05b46ca71848]
[2021-01-10 06:47:47,503: INFO/ForkPoolWorker-7] Executing command in Celery: ['airflow', 'tasks', 'run', 'standish_test_dag', 'test-secrets-backend', '2021-01-10T06:44:35.573723+00:00', '--local', '--pool', 'default_pool', '--subdir', '/opt/airflow/dags/standish_test.py']
[2021-01-10 06:47:47,549: INFO/ForkPoolWorker-7] Filling up the DagBag from /opt/airflow/dags/standish_test.py
[2021-01-10 06:47:47,830: INFO/ForkPoolWorker-7] Loading 1 plugin(s) took 0.26 seconds
[2021-01-10 06:47:47,845: WARNING/ForkPoolWorker-7] Running <TaskInstance: standish_test_dag.test-secrets-backend 2021-01-10T06:44:35.573723+00:00 [queued]> on host 10.5.21.64
[2021-01-10 06:48:17,735] {_internal.py:113} INFO - 10.5.22.61 - - [10/Jan/2021 06:48:17] "GET /log/standish_test_dag/test-secrets-backend/2021-01-10T06:44:35.573723+00:00/1.log HTTP/1.1" 404 -
[2021-01-10 06:48:17,738] {_internal.py:113} INFO - 10.5.22.61 - - [10/Jan/2021 06:48:17] "GET /log/standish_test_dag/test-secrets-backend/2021-01-10T06:44:35.573723+00:00/2.log HTTP/1.1" 200 -

With 2.0.0 theres no error, but still it is immediate termination with no respecting of grace period.

I tried various combinations of args and saw the same behavior every time:

  • ["bash", "-c", "airflow worker"]
  • ["bash", "-c", "exec airflow worker"]
  • ["worker"]
@dstandish dstandish added the kind:bug This is a clearly a bug label Jan 9, 2021
@vikramkoka vikramkoka added the area:helm-chart Airflow Helm Chart label Jan 9, 2021
@dstandish
Copy link
Contributor Author

dstandish commented Jan 10, 2021

What does work:

I verified, at least with 2.0 that warm shutdown works with the following change for worker deployment:

          command: ["airflow"]
          args: ["celery", "worker"]

More things that don't work

Before arriving at the above, I tried this (didn't work):

          command: ["/usr/bin/dumb-init", "--", "airflow"]
          args: ["celery", "worker"]

This also didn't work:

          command: ["/usr/bin/dumb-init"]
          args: ["airflow", "celery", "worker"]

Produced more of these errors:

[2021-01-10 09:36:09,037: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).')

@dstandish
Copy link
Contributor Author

@potiuk i think you are main architect of dockerfile. do you know whats going on here? i don't really understand this area very well... dumb-init / tini / gosu, and what happens when combined with entrypoints and args... though i'd like to!

@dstandish
Copy link
Contributor Author

dstandish commented Jan 10, 2021

astronomer's helm chart uses gosu and tini, it seems. fwiw, in my previous company we used astronomer EE and the termination did work

@xinbinhuang
Copy link
Contributor

This issue from celery might be relevant: celery/billiard#273

Some people saying it's Linux distro related: celery/billiard#273 (comment) ..

@potiuk
Copy link
Member

potiuk commented Jan 11, 2021

I will take a look later this week. It also depends which command is used to run airflow components. You are.talking about the current master version of the 'chart' yeah ? No modification to the entrypoint or command ?

The dumb init and tini are equivalent and they are indeed there to forward signals to the running processes (this is really useful when you have a bash script as entrypoint (if you have bash as direct entrypoint then it will not forward signals to it's children. There are two solutions to solve it:

A) dumb init or tini as entrypoint
b) exec 'binary' at the end of the bash script (exec another bash won't work)

Default entrypoint in prod image is dumb-init so it should propagate the signals properly, but as @xinbinhuang mentioned when you have celery worker it has a number of config options when you send a SIGTERM to it celery worker it will stop spawning new processes and wait for all the running tasks to terminate. So by definition the worker might take quite some time to exit. There is the termination grace period that controls how long it will take for the celery to wait for all processes to terminate before it will 'kill -9' and exits 'non gracefully'.

Also there is another gotcha - if you send SECOND SIGTERM to such celery worker while it is waiting for tasks, it will terminate all the processes with 'kill -9' and will exit immediately.

So if you expect the worker to terminate immeditaely you might have observed actually wrong behaviour where someone sent more than one SIGTERM to those workers (I've seen such setups) - but this is a rather bad idea IMHO.

@dstandish
Copy link
Contributor Author

dstandish commented Jan 11, 2021

I will take a look later this week. It also depends which command is used to run airflow components. You are.talking about the current master version of the 'chart' yeah ? No modification to the entrypoint or command ?

correct, no mods to entrypoint. You can see which things i tried in helm config above -- diff values of args or command.

So if you expect the worker to terminate immeditaely you might have observed actually wrong behaviour where someone sent more than one SIGTERM to those workers (I've seen such setups) - but this is a rather bad idea IMHO

No, I do not want worker to terminate immediately. I want it to do what it is supposed to, namely warm shutdown -- i.e. stop taking tasks, and run until either all tasks done or grace period has elapsed

@dstandish dstandish changed the title Graceful termination does not work with apache chart running 1.10.14 Graceful termination does not work with apache chart Jan 11, 2021
@dstandish
Copy link
Contributor Author

and to clarify @potiuk yes it is latest master

@auvipy
Copy link
Contributor

auvipy commented Mar 4, 2021

this sounds interesting celery/billiard#273 (comment)

@anitakar
Copy link
Contributor

also this one does: celery/billiard#273 (comment)

@dstandish
Copy link
Contributor Author

dstandish commented Jun 2, 2021

i think we can consider this resolved by #16153

graceful termination still does not work out of the box with released 1.0.0 chart but with that PR you can use the command / args combination that works, namely this:

          command: ["airflow"]
          args: ["celery", "worker"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:helm-chart Airflow Helm Chart kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

6 participants