Skip to content

Celery worker enters a catatonic state after redis restart #26542

@wolfier

Description

@wolfier

Apache Airflow version

main (development)

What happened

Worker seems to be stuck in a catatonic state where queued tasks instance messages are not consumed from redis.

Redis did restart while the worker remained as is. The worker did output logs that indicated a loss in connection but was able to reconnect after redis came back online.

[2022-09-19 23:58:00,794: ERROR/MainProcess] consumer: Cannot connect to redis://:**@accurate-axis-9558-redis:6379/0: Error 111 connecting to accurate-axis-9558-redis:6379. Connection refused..
Trying again in 2.00 seconds... (1/100)

[2022-09-19 23:58:03,802: ERROR/MainProcess] consumer: Cannot connect to redis://:**@accurate-axis-9558-redis:6379/0: Error 111 connecting to accurate-axis-9558-redis:6379. Connection refused..
Trying again in 4.00 seconds... (2/100)

[2022-09-19 23:58:08,830: ERROR/MainProcess] consumer: Cannot connect to redis://:**@accurate-axis-9558-redis:6379/0: Error 111 connecting to accurate-axis-9558-redis:6379. Connection refused..
Trying again in 6.00 seconds... (3/100)

[2022-09-19 23:58:15,866: ERROR/MainProcess] consumer: Cannot connect to redis://:**@accurate-axis-9558-redis:6379/0: Error 111 connecting to accurate-axis-9558-redis:6379. Connection refused..
Trying again in 8.00 seconds... (4/100)

[2022-09-19 23:58:24,890: ERROR/MainProcess] consumer: Cannot connect to redis://:**@accurate-axis-9558-redis:6379/0: Error 111 connecting to accurate-axis-9558-redis:6379. Connection refused..
Trying again in 10.00 seconds... (5/100)

[2022-09-19 23:58:34,907: INFO/MainProcess] Connected to redis://:**@accurate-axis-9558-redis:6379/0
[2022-09-19 23:58:34,915: INFO/MainProcess] mingle: searching for neighbors
[2022-09-19 23:58:35,923: INFO/MainProcess] mingle: all alone

What you think should happen instead

After redis comes back online and the worker connected again, the worker should consume the messages and execute queued task instances.

How to reproduce

  1. Delete the existing redis pod and the worker should be unable to connect to redis
  2. Redis restarts and the worker connects as expected
  3. Worker does not consume new messages (queued task instances)

Operating System

N/A

Versions of Apache Airflow Providers

No response

Deployment

Astronomer

Deployment details

No response

Anything else

There was a Github Discussion earlier this year about this behaviour.

This didn't seem to be an issue on an early version of celery (4.4.7).

The current installed version is celery==5.2.7 and I use redis versioned at 6.2.6.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions