Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic worker becomes stale #124

Open
koslib opened this issue Apr 28, 2022 · 8 comments
Open

Generic worker becomes stale #124

koslib opened this issue Apr 28, 2022 · 8 comments

Comments

@koslib
Copy link

koslib commented Apr 28, 2022

Hello,

I've stumbled upon an issue where generic workers (and possibly scheduled workers too) become stable at arbitrary intervals. By stale I mean they don't pick up new jobs neither process anything. The only workaround I've found so far is to kill the pods so that they get recreated, but I'm trying to automate this.

Have anyone had the same problem?

Chart Version: 3.0.0-beta1

@oedri
Copy link

oedri commented May 20, 2022

I believe I'm having the same issue - after a while I'm reaching "Unknown error occurred while performing connection test" for all queries and it seems that adhocworker gets stuck. Currently it's only a guess because there are no indicative logs in any of the pods.

@grugnog grugnog changed the title generic worker becomes stable Generic worker becomes stale Jun 13, 2022
@grugnog
Copy link
Collaborator

grugnog commented Jun 13, 2022

Would be good to confirm if this is a chart specific issue or a data source connection issue - I see some reports of this message (e.g. getredash/redash#2047 & getredash/redash#5664).
If it's only a temporary data source connection issue I would expect the worker to continue once the connection is back at least (not sure exactly how it works), but if that isn't the case I am guessing it's an application bug.
We could also look for ways to improve the health check to detect if this is happening and restart the worker then, although we should probably open an ticket with the application also (I guess it's conceivable this is somehow correct behavior).

@oedri
Copy link

oedri commented Jun 13, 2022

@grugnog This happened for all datasources I've tried - Postgres and Prometheus. Connections worked again after restarting the workers.

@grugnog
Copy link
Collaborator

grugnog commented Jun 13, 2022

@oedri if you are able to add any detail (debug logs, strace perhaps?) it would be great if you could open a ticket regarding this on https://github.com/getredash/redash - it seems unlikely to be a Kubernetes issue, except perhaps something environmental (resource exhaustion etc) which is not really in scope of this chart, although we could adjust the docs/defaults perhaps if we identify that as the cause.
On the detection/recovery side, we have an existing open issue for that #72 so I think we can close this one.

@aberenshtein
Copy link

happed to me too

@shubhwip
Copy link
Contributor

happening to me everyday too.

@shubhwip
Copy link
Contributor

@grugnog How can we enable debug logs, strace in the redash helm charts ?

@aneagoe
Copy link

aneagoe commented Apr 17, 2023

+1. There seems to be an issue respawning the process if it dies. A transient redis issue triggers a persistent problem for the worker.
Digging through logs of workers, I can see the following:

  April 15th 2023, 02:41:35.532 Traceback (most recent call last):
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 606, in _connect
  April 15th 2023, 02:41:35.532     raise err
  April 15th 2023, 02:41:35.532 TimeoutError: [Errno 110] Connection timed out
  April 15th 2023, 02:41:35.532     return _process_result(sub_ctx.command.invoke(sub_ctx))
  April 15th 2023, 02:41:35.532     return ctx.invoke(self.callback, **ctx.params)
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 535, in invoke
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/flask/cli.py", line 426, in decorator
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/redis/client.py", line 1581, in exists
  April 15th 2023, 02:41:35.532     return self.execute_command('EXISTS', *names)
  April 15th 2023, 02:41:35.532     connection.connect()
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 550, in connect
  April 15th 2023, 02:41:35.532     sock = self._connect()
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 594, in _connect
  April 15th 2023, 02:41:35.532     sock.connect(socket_address)
  April 15th 2023, 02:41:35.532 
  April 15th 2023, 02:41:35.532     manager()
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 722, in __call__
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/flask/cli.py", line 586, in main
  April 15th 2023, 02:41:35.532     return super(FlaskGroup, self).main(*args, **kwargs)
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 895, in invoke
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
  April 15th 2023, 02:41:35.532     return callback(*args, **kwargs)
  April 15th 2023, 02:41:35.532   File "/app/redash/cli/rq.py", line 49, in worker
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 1182, in get_connection
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 554, in connect
  April 15th 2023, 02:41:35.532     raise ConnectionError(self._error_message(e))
  April 15th 2023, 02:41:35.532 
  April 15th 2023, 02:41:35.532 During handling of the above exception, another exception occurred:
  April 15th 2023, 02:41:35.532 Traceback (most recent call last):
  April 15th 2023, 02:41:35.532   File "./manage.py", line 9, in <module>
  April 15th 2023, 02:41:35.532     return self.main(*args, **kwargs)
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 697, in main
  April 15th 2023, 02:41:35.532     rv = self.invoke(ctx)
  April 15th 2023, 02:41:35.532     return _process_result(sub_ctx.command.invoke(sub_ctx))
  April 15th 2023, 02:41:35.532     return callback(*args, **kwargs)
  April 15th 2023, 02:41:35.532     return f(get_current_context(), *args, **kwargs)
  April 15th 2023, 02:41:35.532     return __ctx.invoke(f, *args, **kwargs)
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 535, in invoke
  April 15th 2023, 02:41:35.532     w.work()
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/rq/worker.py", line 511, in work
  April 15th 2023, 02:41:35.532     self.register_birth()
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/rq/worker.py", line 273, in register_birth
  April 15th 2023, 02:41:35.532     if self.connection.exists(self.key) and \
  April 15th 2023, 02:41:35.532   File "/usr/local/lib/python3.7/site-packages/redis/client.py", line 898, in execute_command
  April 15th 2023, 02:41:35.532     conn = self.connection or pool.get_connection(command_name, **options)
  April 15th 2023, 02:41:35.532 redis.exceptions.ConnectionError: Error 110 connecting to redash-redis-master:6379. Connection timed out.
  April 15th 2023, 02:41:35.986 2023-04-15 00:41:35,986 INFO exited: worker-0 (exit status 1; not expected)
  April 15th 2023, 02:41:36.987 2023-04-15 00:41:36,987 INFO gave up: worker-0 entered FATAL state, too many start retries too quickly
  April 15th 2023, 02:42:01.011 2023/04/15 00:42:01 [worker_healthcheck] Received TICK_60 event from supervisor
  April 15th 2023, 02:42:01.013 RESULT 2
  April 15th 2023, 02:42:01.013 2023/04/15 00:42:01 [worker_healthcheck] No processes in state RUNNING found for process worker
  April 15th 2023, 02:42:01.013 OKREADY

Liveness check for workers PR should improve the sittuation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants