Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error handling request (no URI read) #3207

Open
nonprofitnerd opened this issue May 10, 2024 · 26 comments · May be fixed by #3275
Open

Error handling request (no URI read) #3207

nonprofitnerd opened this issue May 10, 2024 · 26 comments · May be fixed by #3275

Comments

@nonprofitnerd
Copy link

After upgrading from version 20.1.0 to 22.0.0, we have encountered "Error handling request (no URI read)" errors.

[ERROR] 2024-05-10 10:15:49,532 gunicorn.error: Error handling request (no URI read)
Traceback (most recent call last):
  File "/app/env/lib64/python3.8/site-packages/gunicorn/workers/base_async.py", line 48, in handle
    req = next(parser)
  File "/app/env/lib64/python3.8/site-packages/gunicorn/http/parser.py", line 42, in __next__
    self.mesg = self.mesg_class(self.cfg, self.unreader, self.source_addr, self.req_count)
  File "/app/env/lib64/python3.8/site-packages/gunicorn/http/message.py", line 257, in __init__
    super().__init__(cfg, unreader, peer_addr)
  File "/app/env/lib64/python3.8/site-packages/gunicorn/http/message.py", line 60, in __init__
    unused = self.parse(self.unreader)
  File "/app/env/lib64/python3.8/site-packages/gunicorn/http/message.py", line 269, in parse
    self.get_data(unreader, buf, stop=True)
  File "/app/env/lib64/python3.8/site-packages/gunicorn/http/message.py", line 260, in get_data
    data = unreader.read()
  File "/app/env/lib64/python3.8/site-packages/gunicorn/http/unreader.py", line 37, in read
    d = self.chunk()
  File "/app/env/lib64/python3.8/site-packages/gunicorn/http/unreader.py", line 64, in chunk
    return self.sock.recv(self.mxchunk)
  File "/app/env/lib64/python3.8/site-packages/gevent/_socketcommon.py", line 666, in recv
    self._wait(self._read_event)
  File "src/gevent/_hub_primitives.py", line 317, in gevent._gevent_c_hub_primitives.wait_on_socket
  File "src/gevent/_hub_primitives.py", line 322, in gevent._gevent_c_hub_primitives.wait_on_socket
  File "src/gevent/_hub_primitives.py", line 304, in gevent._gevent_c_hub_primitives._primitive_wait
  File "src/gevent/_hub_primitives.py", line 46, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_hub_primitives.py", line 46, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_hub_primitives.py", line 55, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_waiter.py", line 154, in gevent._gevent_c_waiter.Waiter.get
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
greenlet.GreenletExit

/var/log/messages

gunicorn: [2024-05-10 10:15:49 +0000] [24020] [INFO] Booting worker with pid: 24020

gunicorn.conf.py

import os
import sys

import gunicorn

gunicorn.SERVER_SOFTWARE = '-'

bind = '0.0.0.0:9000'
workers = 8
worker_class = 'gevent'
max_requests = 5000
max_requests_jitter = 50
keepalive = 45


raw_env = ['SCRIPT_NAME=/xxx']

def pre_exec(server):
    """
    Resets the working directory of the server to current symlink.
    """
    app_root = '/app'

    server.START_CTX['cwd'] = app_root
    os.chdir(app_root)

    orig_path = os.path.dirname(sys.executable)
    new_path = os.path.join(app_root, 'env', 'bin')

    server.START_CTX[0] = server.START_CTX[0].replace(orig_path, new_path)
    server.START_CTX['args'] = [arg.replace(orig_path, new_path)
                                for arg in server.START_CTX['args']]

gunicorn -c /app/xxx/gunicorn.conf.py -p /run/xxx.pid xxx.wsgi

Env:

  • gunicorn==22.0.0
  • gevent==24.2.1
  • greenlet==2.0.2
@pajod
Copy link
Contributor

pajod commented May 10, 2024

Got any context; when does this get triggered?
Did you at the same time bump dependency or Python versions?

I partially reverted 1ccebab (meant to silence exceptions below SSLError) in 0b10cba specifically to make cases like this visible again - you are not necessarily seeing changed behaviour beyond changed log entries.

@nonprofitnerd
Copy link
Author

Hello, we're experiencing an issue in our production environment, and since we have limited information, diagnosing it is challenging. The problem occurs randomly, and we're using Python version 3.8.18 and Django version 4.2.12.

@pajod
Copy link
Contributor

pajod commented May 10, 2024

  • What exactly is the issue here? Just the log entry, or is any request failing that you expect to work?
  • Is there a proxy in front of gunicorn that may have its own logs, possibly providing additional context?
  • How many of those log entries are you getting?
  • Does the timing of the log message correlate with any other symptom or action performed on your environment?

@nonprofitnerd
Copy link
Author

Hello,

  • For now, it's a log entry; we are not sure of any side effects on clients.
  • Yes, there is. We are using Apache2, and we also have a load balancer.
  • 11 logs have appeared in the last 24 hours.
  • I think I found something. Here are the logs.

[INFO] 2024-05-13 06:08:22,863 gunicorn.error: Worker exiting (pid: 28063)
[ERROR] 2024-05-13 06:08:22,860 gunicorn.error: Error handling request (no URI read)
[DEBUG] 2024-05-13 06:08:22,409 gunicorn.error: Closing connection.
[WARNING] 2024-05-13 06:08:21,858 gunicorn.error: Worker graceful timeout (pid:28063)
[DEBUG] 2024-05-13 06:07:51,192 gunicorn.error: Closing connection.
[DEBUG] 2024-05-13 06:07:50,892 gunicorn.error: Closing connection.
[INFO] 2024-05-13 06:07:50,835 gunicorn.error: Autorestarting worker after current request.
--
[INFO] 2024-05-13 04:11:34,150 gunicorn.error: Worker exiting (pid: 5111)
[ERROR] 2024-05-13 04:11:34,146 gunicorn.error: Error handling request (no URI read)
[WARNING] 2024-05-13 04:11:33,145 gunicorn.error: Worker graceful timeout (pid:5111)
[DEBUG] 2024-05-13 04:11:03,058 gunicorn.error: Closing connection.
[INFO] 2024-05-13 04:11:02,192 gunicorn.error: Autorestarting worker after current request.
--
[DEBUG] 2024-05-13 02:48:42,494 gunicorn.error: Closing connection.
[INFO] 2024-05-13 02:48:42,460 gunicorn.error: Worker exiting (pid: 2016)
[ERROR] 2024-05-13 02:48:42,459 gunicorn.error: Error handling request (no URI read)
[WARNING] 2024-05-13 02:48:41,457 gunicorn.error: Worker graceful timeout (pid:2016)
[DEBUG] 2024-05-13 02:48:11,691 gunicorn.error: Closing connection.
[INFO] 2024-05-13 02:48:10,755 gunicorn.error: Autorestarting worker after current request.
--
[INFO] 2024-05-13 00:51:37,997 gunicorn.error: Worker exiting (pid: 123650)
[ERROR] 2024-05-13 00:51:37,994 gunicorn.error: Error handling request (no URI read)
[WARNING] 2024-05-13 00:51:36,992 gunicorn.error: Worker graceful timeout (pid:123650)
[INFO] 2024-05-13 00:51:06,191 gunicorn.error: Autorestarting worker after current request.
--
[INFO] 2024-05-12 23:19:47,497 gunicorn.error: Worker exiting (pid: 93595)
[ERROR] 2024-05-12 23:19:47,494 gunicorn.error: Error handling request (no URI read)
[WARNING] 2024-05-12 23:19:46,492 gunicorn.error: Worker graceful timeout (pid:93595)
[INFO] 2024-05-12 23:19:16,330 gunicorn.error: Autorestarting worker after current request.
--
[INFO] 2024-05-12 12:22:49,710 gunicorn.error: Worker exiting (pid: 61916)
[ERROR] 2024-05-12 12:22:49,709 gunicorn.error: Error handling request (no URI read)
[WARNING] 2024-05-12 12:22:48,708 gunicorn.error: Worker graceful timeout (pid:61916)
[INFO] 2024-05-12 12:22:17,893 gunicorn.error: Autorestarting worker after current request.

@pajod
Copy link
Contributor

pajod commented May 13, 2024

Your logs are upside down 🙃 - but yes, you did find something.

There are various issues around the max_requests feature (and the basic recommendation is to not use it unless you really do need it). I expect this one is not any more serious than the other ones, given that the worker has stopped processing requests and is shutting down. Suppress or patch those lines if it bothers you, while no immediate solution is available.

In 4023228 all BaseException handling was forced into the parent class, and because exceptions handled from connections that had not yet received a full request line were previously not logged, that greenlet-specific BaseException suppression looks(²) neutralized. I think some class hierarchy respecting way of moving just those two would do the trick, but would prefer a solution that undoes the imho too broad except block in favour of only handling those not-quite-fatal BaseException in our area of responsibility. (² I'll have to write a few tests to tell. gevent exception propagation is somewhat special.)

@nonprofitnerd
Copy link
Author

nonprofitnerd commented May 13, 2024

Thank you, @pajod, for the detailed explanation. We will attempt to disable max_requests. Have a wonderful week!

P.S. Sorry for the logs 🙃

@MiaZhou0424
Copy link

Hi @pajod , thank you for looking into this! I have the same issue here.
My environment setting is basically the same but was receiving huge volume of greenlet.GreenletExit (~2k/h).

Is there any reason for exposing no URI read and greenlet.GreenletExit here?
Since greenlet.GreenletExit is claimed to be harmless in greenlet official doc here, can we change from self.log.exception("Error handling request (no URI read)") to self.log.info("Error handling request (no URI read)"). Please correct me if I'm wrong.

Thanks!

pajod referenced this issue Jul 24, 2024
ensure we can catch  correctly  exceptions based on BaseException.

Note: patch was origninally proposed by the pr #2923, but original
author closed it.

Fix #2923
@fsrajer
Copy link

fsrajer commented Jul 25, 2024

FYI, I am describing a related bug. I am not opening a new issue so as not to make too much noise, but I would be interested about comments from the maintainers
4023228#r144628336

@jeverling
Copy link

jeverling commented Jul 30, 2024

Hi, this might be completely unrelated, but I was seeing a similar error message:

app-1  | [2024-07-30 22:16:06 +0000] [9] [CRITICAL] WORKER TIMEOUT (pid:15)
app-1  | [2024-07-30 22:16:06 +0000] [15] [ERROR] Error handling request (no URI read)
app-1  | Traceback (most recent call last):
app-1  |   File "/usr/local/lib/python3.12/site-packages/gunicorn/workers/sync.py", line 134, in handle
app-1  |     req = next(parser)
app-1  |           ^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.12/site-packages/gunicorn/http/parser.py", line 42, in __next__
app-1  |     self.mesg = self.mesg_class(self.cfg, self.unreader, self.source_addr, self.req_count)
app-1  |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.12/site-packages/gunicorn/http/message.py", line 257, in __init__
app-1  |     super().__init__(cfg, unreader, peer_addr)
app-1  |   File "/usr/local/lib/python3.12/site-packages/gunicorn/http/message.py", line 60, in __init__
app-1  |     unused = self.parse(self.unreader)
app-1  |              ^^^^^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.12/site-packages/gunicorn/http/message.py", line 269, in parse
app-1  |     self.get_data(unreader, buf, stop=True)
app-1  |   File "/usr/local/lib/python3.12/site-packages/gunicorn/http/message.py", line 260, in get_data
app-1  |     data = unreader.read()
app-1  |            ^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.12/site-packages/gunicorn/http/unreader.py", line 37, in read
app-1  |     d = self.chunk()
app-1  |         ^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.12/site-packages/gunicorn/http/unreader.py", line 64, in chunk
app-1  |     return self.sock.recv(self.mxchunk)
app-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.12/site-packages/gunicorn/workers/base.py", line 203, in handle_abort
app-1  |     sys.exit(1)
app-1  | SystemExit: 1
app-1  | [2024-07-30 22:16:06 +0000] [15] [INFO] Worker exiting (pid: 15)
app-1  | [2024-07-30 22:16:06 +0000] [17] [INFO] Booting worker with pid: 17

In my case, I was playing around with sendfile/X-Accel-Redirect and had a Django view like this:

def login_required_media_view(request, path):
    response = HttpResponse(status=200)
    response["Content-Type"] = ""
    response["X-Accel-Redirect"] = "/media/" + quote(path)
    return response

nginx was not set up at this point, and opening a media URL lead to the timeout ERROR above.

Just in case this is useful for someone here.

@pajod
Copy link
Contributor

pajod commented Jul 30, 2024

@jeverling It is related - that is how the exception handling looks like without the added gevent complexity. Useful for others to find this issue. Where they can learn that they can revert 0b10cba (to silence the error log again) or 4023228 (to return to the previous exception handling) before a proper bugfix release is issued. (At this time I do not suggest a downgrade.)

@KenxinKun
Copy link

KenxinKun commented Aug 14, 2024

So, I'm a bit confused since I'm getting both these errors as well in my Django application:

  • [Critical] WORKER TIMEOUT
  • [ERROR] Error handling request (no URI read)
Screenshot 2024-08-14 at 17 29 25

The Django application is a very simple render, with no slow operations in the background and I can see it complete way before the timeout (checking network in Chrome). It doesn't matter how long I set the --timeout it will eventually kill the worker.

Is my issue linked to this error handling or should I be looking elsewhere? The primary effect I'm getting is that:

  • First requests go well
  • Worker is killed due to alleged timeout
  • Later requests may fail if there's not an available worker since there's some boot up time

Any suggestions?

@pajod pajod linked a pull request Aug 14, 2024 that will close this issue
@braiam
Copy link

braiam commented Aug 26, 2024

I don't understand this issue. Gunicorn responds appropriately when queried directly with wget/curl, but fails when there's a webserver inbetween. Is that correct? How could such behavior be correct? My nginx configuration is using the default uwsgi_params in the example documentation https://docs.nginx.com/nginx/admin-guide/web-server/app-gateway-uwsgi-django/

@ashishislive
Copy link

I can reproduce this issue when I use chrome and after default timeout it through error, but in case of postman or safari I am not seeing it.

@lordsarcastic
Copy link

I am facing the same problem except that the error shows up intermittently. I am running gunicorn in Docker without any proxy:

app-1       | [2024-09-08 05:52:36 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:7)
app-1       | [2024-09-08 06:52:36 +0100] [7] [ERROR] Error handling request (no URI read)
app-1       | Traceback (most recent call last):
app-1       |   File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/sync.py", line 134, in handle
app-1       |     req = next(parser)
app-1       |           ^^^^^^^^^^^^
app-1       |   File "/usr/local/lib/python3.11/site-packages/gunicorn/http/parser.py", line 42, in __next__
app-1       |     self.mesg = self.mesg_class(self.cfg, self.unreader, self.source_addr, self.req_count)
app-1       |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app-1       |   File "/usr/local/lib/python3.11/site-packages/gunicorn/http/message.py", line 257, in __init__
app-1       |     super().__init__(cfg, unreader, peer_addr)
app-1       |   File "/usr/local/lib/python3.11/site-packages/gunicorn/http/message.py", line 60, in __init__
app-1       |     unused = self.parse(self.unreader)
app-1       |              ^^^^^^^^^^^^^^^^^^^^^^^^^
app-1       |   File "/usr/local/lib/python3.11/site-packages/gunicorn/http/message.py", line 269, in parse
app-1       |     self.get_data(unreader, buf, stop=True)
app-1       |   File "/usr/local/lib/python3.11/site-packages/gunicorn/http/message.py", line 260, in get_data
app-1       |     data = unreader.read()
app-1       |            ^^^^^^^^^^^^^^^
app-1       |   File "/usr/local/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 37, in read
app-1       |     d = self.chunk()
app-1       |         ^^^^^^^^^^^^
app-1       |   File "/usr/local/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 64, in chunk
app-1       |     return self.sock.recv(self.mxchunk)
app-1       |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app-1       |   File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/base.py", line 203, in handle_abort
app-1       |     sys.exit(1)
app-1       | SystemExit: 1
app-1       | [2024-09-08 06:52:36 +0100] [7] [INFO] Worker exiting (pid: 7)
app-1       | [2024-09-08 05:52:36 +0000] [13] [INFO] Booting worker with pid: 13

As you can see, we get a timeout and suddenly an exception shows up. When this happens, there is a temporary Internal Server Error due to this but it goes away once another worker is up.

@gasparbrogueira
Copy link

gasparbrogueira commented Sep 11, 2024

I have the same problem:

[2024-09-11 12:45:45 +0100] [26508] [CRITICAL] WORKER TIMEOUT (pid:26634)
[2024-09-11 11:45:45 +0000] [26634] [ERROR] Error handling request (no URI read)
Traceback (most recent call last):
  File "/home/mb/Documents/venv/registo3.11/lib/python3.11/site-packages/gunicorn/workers/sync.py", line 133, in handle
    req = next(parser)
          ^^^^^^^^^^^^
  File "/home/mb/Documents/venv/registo3.11/lib/python3.11/site-packages/gunicorn/http/parser.py", line 41, in __next__
    self.mesg = self.mesg_class(self.cfg, self.unreader, self.source_addr, self.req_count)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mb/Documents/venv/registo3.11/lib/python3.11/site-packages/gunicorn/http/message.py", line 259, in __init__
    super().__init__(cfg, unreader, peer_addr)
  File "/home/mb/Documents/venv/registo3.11/lib/python3.11/site-packages/gunicorn/http/message.py", line 60, in __init__
    unused = self.parse(self.unreader)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mb/Documents/venv/registo3.11/lib/python3.11/site-packages/gunicorn/http/message.py", line 274, in parse
    line, rbuf = self.read_line(unreader, buf, self.limit_request_line)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mb/Documents/venv/registo3.11/lib/python3.11/site-packages/gunicorn/http/message.py", line 326, in read_line
    self.get_data(unreader, buf)
  File "/home/mb/Documents/venv/registo3.11/lib/python3.11/site-packages/gunicorn/http/message.py", line 262, in get_data
    data = unreader.read()
           ^^^^^^^^^^^^^^^
  File "/home/mb/Documents/venv/registo3.11/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 36, in read
    d = self.chunk()
        ^^^^^^^^^^^^
  File "/home/mb/Documents/venv/registo3.11/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 63, in chunk
    return self.sock.recv(self.mxchunk)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mb/Documents/venv/registo3.11/lib/python3.11/site-packages/gunicorn/workers/base.py", line 204, in handle_abort
    sys.exit(1)
SystemExit: 1
[2024-09-11 11:45:45 +0000] [26634] [INFO] Worker exiting (pid: 26634)

Even with a substantial timeout:

gunicorn -w 2 -b 0.0.0.0:7000 --timeout 120 --log-level=debug wsgi:application

Software versions:

nginx version: nginx/1.22.1
Django==5.1
gunicorn==23.0.0

UPDATE:

With this setup on NGINX, the timeout stopped occurring:

(...)
    location / {
        proxy_pass http://0.0.0.0:7000;
        proxy_http_version  1.1;
        proxy_cache_bypass  $http_upgrade;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-SSL   on;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header User-Agent $http_user_agent;
        proxy_set_header REMOTE_ADDR $remote_addr;
    }
(...)

@pajod
Copy link
Contributor

pajod commented Sep 11, 2024

For those in a position to review and/or test it: you are welcome to provide feedback on my suggested patch: #3275

@ncpalmie
Copy link

ncpalmie commented Sep 13, 2024

Has anyone identified what the actual originating issue is that is causing the worker timeout? I see the PR addresses the exception bubbling, but I've yet to find any threads where anyone has identified what the actual cause of this is.

I'm seeing the same issue where a worker will handle the request, the browser receives the data and request ends there, but then the worker dies after the set timeout period and reboots. However, even after rebooting (or with multiple other workers which should be fine to take the request), the next most immediate request fails with 500 Internal Server Error and requests after that work fine.

Setting timeout to 0 appears to prevent the workers from dying (as they never hit the max timeout) but why are they held up after successful requests anyways?

@braiam
Copy link

braiam commented Sep 16, 2024

Considering that using raw http or proxy protocol with nginx doesn't generate problems, while uwsgi_pass does, this problem is contained to using wsgi.

@DvaMishkiLapa
Copy link

I seem to encounter the same problem after running any Celery-task (@shared_task with func.delay).

core-1    | [2024-10-24 15:58:04 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:8)
core-1    | [2024-10-24 15:58:04 +0000] [8] [ERROR] Error handling request (no URI read)
core-1    | Traceback (most recent call last):
core-1    |   File "/usr/local/lib/python3.12/site-packages/gunicorn/workers/sync.py", line 133, in handle
core-1    |     req = next(parser)
core-1    |           ^^^^^^^^^^^^
core-1    |   File "/usr/local/lib/python3.12/site-packages/gunicorn/http/parser.py", line 41, in __next__
core-1    |     self.mesg = self.mesg_class(self.cfg, self.unreader, self.source_addr, self.req_count)
core-1    |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
core-1    |   File "/usr/local/lib/python3.12/site-packages/gunicorn/http/message.py", line 259, in __init__
core-1    |     super().__init__(cfg, unreader, peer_addr)
core-1    |   File "/usr/local/lib/python3.12/site-packages/gunicorn/http/message.py", line 60, in __init__
core-1    |     unused = self.parse(self.unreader)
core-1    |              ^^^^^^^^^^^^^^^^^^^^^^^^^
core-1    |   File "/usr/local/lib/python3.12/site-packages/gunicorn/http/message.py", line 271, in parse
core-1    |     self.get_data(unreader, buf, stop=True)
core-1    |   File "/usr/local/lib/python3.12/site-packages/gunicorn/http/message.py", line 262, in get_data
core-1    |     data = unreader.read()
core-1    |            ^^^^^^^^^^^^^^^
core-1    |   File "/usr/local/lib/python3.12/site-packages/gunicorn/http/unreader.py", line 36, in read
core-1    |     d = self.chunk()
core-1    |         ^^^^^^^^^^^^
core-1    |   File "/usr/local/lib/python3.12/site-packages/gunicorn/http/unreader.py", line 63, in chunk
core-1    |     return self.sock.recv(self.mxchunk)
core-1    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
core-1    |   File "/usr/local/lib/python3.12/site-packages/gunicorn/workers/base.py", line 204, in handle_abort
core-1    |     sys.exit(1)
core-1    | SystemExit: 1
core-1    | [2024-10-24 15:58:04 +0000] [8] [INFO] Worker exiting (pid: 8)
core-1    | [2024-10-24 15:58:05 +0000] [17] [INFO] Booting worker with pid: 17
gunicorn==23.0.0
Django==5.1.1
celery==5.4.0
django-celery-beat==2.7.0
django-celery-results==2.5.1

On gunicorn==20.1.0 I see a simpler output:

core-1    | [2024-10-24 16:02:31 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:8)
core-1    | [2024-10-24 16:02:31 +0000] [8] [INFO] Worker exiting (pid: 8)
core-1    | [2024-10-24 16:02:32 +0000] [10] [INFO] Booting worker with pid: 10

@jbaptperez
Copy link

I also faced the problem but in my case, I could find what was going on.

Very simple (after 1 day to find out why): My worker was trying to make an HTTPS request to another web server (Gunicorn by the way, but not really important) that was listening in HTTP (without TLS).

It was really difficult to "see" that as I was debugging a running set of services without a comfortable way to see every environment variables I passed to each one.
And I was migrating one from HTTPS to HTTP (internal Docker containers, not exposed but talking each others).

I hope I could help someone.

@vbabiy
Copy link

vbabiy commented Oct 29, 2024

I just tried #3275 and I am still getting these errors in sentry.

@pajod
Copy link
Contributor

pajod commented Oct 29, 2024

I just tried #3275 and I am still getting these errors in sentry.

In #3275 the message in the title of this issue is debug, not error-level severity.
If you do still get errors, you should quote them, in that PR.

@passivereader
Copy link

Has anyone identified what the actual originating issue is that is causing the worker timeout?

@ncpalmie I think I did, at least for my specific case (homelab). My frontend is nginx (nginx version: nginx/1.26.2) running in a FreeBSD jail and from there I reverse-proxy to another FreeBSD jail running Gunicorn (gunicorn==23.0.0), serving Flask. The proxy connection is TLS1.3 only - the nginx-side has a client cert and the gunicorn-side has a server-cert, and you would get SSL_ERROR_RX_CERTIFICATE_REQUIRED_ALERT from Firefox when trying to access the Gunicorn side directly as only the client cert (Nginx) and the server cert (Gunicorn) together can establish the connection. The whole CA infrastructure is setup using openssl and I documented every single step when I created my test-CA, the certs, the keys, nginx, gunicorn, etc.

My error logs looked exactly like the ones of @DvaMishkiLapa @lordsarcastic and @nonprofitnerd when I did the following:

After I managed to get the reverse-proxy connection TLS1.3-encrypted I tried to find out how SNI works, as mentioned in the sni_callback() part of the Gunicorn example config file.

To avoid the error everyone had in their logs I believe the following must be set up correctly:

(all this assumes that your firewall is set up correctly, otherwise openssl gives you connect() permission denied errors)

  1. When I created my certs using openssl I made sure to have correct subject alternative names (i.e. an appropriate DNS name as a SAN in the cert) and I believe that a matching CN doesn't hurt here either.
  2. Whatever it mentions in the cert as SAN (i.e. a correct DNS name) must be reflected in nginx.conf as proxy_ssl_name. This combined with proxy_ssl_server_name on; was necessary to ensure that nginx did not crash anymore when proxy_ssl_verify on; is set.
  3. On both sides, Nginx and Gunicorn, the same fullchain.pem, and corresponding cert.pem and key.pem files need to be configured correctly in nginx.conf and gunicorn.conf.py.

When I tried to use sni_callback() in gunicorn.conf.py I outcommented keyfile and certfile, believing it is enough that those will be present anyway in the new_context.load_cert_chain() part of the sni_callback() function.

This was when I got the error everyone had in my logs (restoring keyfile and certfile made the error disappear). Investigating the error brought me here.

Note that I only configured my original Gunicorn server in sni_callback() (and not yet alternative ones which is probably the purpose of all that SNI stuff).

Also note that Gunicorn did not crash until the arrival of a TLS1.3 request. I could reproduce the error via Nginx and also using either of the following openssl-s_client commands that I used to verify that TLS1.3 works:

echo -n | openssl s_client -cert /full/path/to/nginx/side/cert.pem -key /full/path/to/nginx/side/key.pem -CAfile /full/path/fullchain.pem -connect correct-gunicorn-dns-name.com:443 -verify_return_error -servername correct-gunicorn-dns-name.com </dev/null | openssl x509 -noout -text | grep correct-gunicorn-dns-name.com

and

openssl s_client -cert /full/path/to/nginx/side/cert.pem -key /full/path/to/nginx/side/key.pem -CAfile /full/path/fullchain.pem -connect correct-gunicorn-dns-name.com:443 -verify_return_error -crlf -brief

The internet said that the first one is good for checking the SANs in the cert, the -servername option is SNI-related.

Using either of these commands on the Nginx side made Gunicorn crash and produced the error everyone had in the logs, just like Nginx did before.

Not sure if it matters, but I used EC instead of RSA for all my certs and used the P-256 curve in my openssl commands.

I'm fairly new to all this. I was a little suprised that both, Nginx and Gunicorn, seem to behave so erratic when a TLS error of this kind occurs. I don't know what kind of parsing and the like is involved behind the scenes though, this is probably very complicated.

Anyway, I hope someone will benefit from my observations.

@wdar
Copy link

wdar commented Nov 27, 2024

I'm observing the same issue locally and on a production server. This can be replicated with the most basic Flask, Django or simple 'hello world' app.py.

Bug: Chrome on ARM processors stalls booting of Gunicorn worker after timeouts. Chrome user may experience a 500 Internal Server Error.

Envrionment:

  • Python 3.11.5
  • Gunicorn 23.0.0
  • Chrome browser (logged into Google account)
  • M2 Mac hardware

Minimal reproduction code app.py:

def application(environ, start_response):
    status = '200 OK'
    headers = [('Content-type', 'text/plain')]
    start_response(status, headers)
    return [b'Hello, World!']

Steps to reproduce:

  1. Run: gunicorn app:application -b 192.168.86.224:8000 --access-logfile - --timeout 5
  2. Visit http://192.168.86.224:8000 in Chrome (loggle into Google account)
  3. Initial request works successfully
  4. After 5 seconds, observe:
    • [CRITICAL] WORKER TIMEOUT in logs
    • [ERROR] Error handling request (no URI read) in logs
  5. Reload within ~60s: "Internal Server Error" appears on browser (no Gunicorn log entries)
  6. Second reload: Works normally, restarts cycle

Key Observations:

  1. Only affects Chrome-based browsers on ARM architecture:

    • M2 Macs
    • Windows VM on M2 Mac (Chrome/Edge)
    • Jetson Nano (Chromium)
  2. Does NOT occur with:

    • Chrome in Guest/Incognito mode (M2 Macs)
    • Firefox, Safari, Brave (M2 Macs)
    • curl (M2 Macs)
    • Chrome on Intel Mac/AMD Windows
    • Android Chrome
  3. Configuration specific:

    • Occurs with -b 192.168.86.224:8000
    • Does NOT occur with default binding or -b 0.0.0.0:8000
    • Does NOT occur when accessing via local DNS name
  4. Impact:

    • Main concern is the ~60s "Internal Server Error" blocking period
    • Issue less frequent under higher traffic, presumably the workers are too busy to timeout.
    • Behavior changed with Gunicorn >= 22.0.0.

With prior versions of Gunicorn, before version 22, the workers booted instantly after a timeout. Now it seems that under specific circumstances, using a Chrome browser on ARM, this can cause the booting of the new worker to stall without log entries for up to a minute.

Hoping this helps narrow down the issue for the developers.

@wdar
Copy link

wdar commented Nov 28, 2024

To follow up a bit. After just rebooting my laptop the issue with Chrome now does not appear to happen. Chromium on the Jetson Nano still had the issue until that was rebooted. So now everything I was sure about earlier is again a question. What a difficult problem to troubleshoot.

@wojcikstefan
Copy link

wojcikstefan commented Dec 3, 2024

I'm also experiencing this problem after upgrading Gunicorn from v20.1.0 to v22.0.0.

Steps that lead to these exceptions:

  1. Gunicorn Gevent worker decides to shut down because of max_requests, logging Autorestarting worker after current request.:
    Image
  2. The worker stops accepting new requests and starts waiting for the existing ones to finish.
  3. ~30 seconds later (i.e. after a graceful_timeout), the worker acknowledges that it didn't manage to finish processing all the requests and logs Worker graceful timeout (pid:XXXX).
    Screenshot 2024-12-03 at 7 18 20 AM
  4. Then we're getting GreenletExits for all the request Greenlets that got axed:
    Screenshot 2024-12-03 at 7 09 05 AM
  5. Finally the worker shuts down, logging Worker exiting (pid: XXXX):
    Image

What's suspicious is that the tracebacks for requests that didn't manage to finish are always the same and never point to my application code. Instead, they always seem to be stuck at self.sock.recv(self.mxchunk)

Full traceback
Traceback (most recent call last):
  File "/home/closeio/venv/lib/python3.12/site-packages/gunicorn/workers/base_async.py", line 48, in handle
    req = next(parser)
          ^^^^^^^^^^^^
  File "/home/closeio/venv/lib/python3.12/site-packages/gunicorn/http/parser.py", line 42, in __next__
    self.mesg = self.mesg_class(self.cfg, self.unreader, self.source_addr, self.req_count)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/closeio/venv/lib/python3.12/site-packages/gunicorn/http/message.py", line 257, in __init__
    super().__init__(cfg, unreader, peer_addr)
  File "/home/closeio/venv/lib/python3.12/site-packages/gunicorn/http/message.py", line 60, in __init__
    unused = self.parse(self.unreader)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/closeio/venv/lib/python3.12/site-packages/gunicorn/http/message.py", line 269, in parse
    self.get_data(unreader, buf, stop=True)
  File "/home/closeio/venv/lib/python3.12/site-packages/gunicorn/http/message.py", line 260, in get_data
    data = unreader.read()
           ^^^^^^^^^^^^^^^
  File "/home/closeio/venv/lib/python3.12/site-packages/gunicorn/http/unreader.py", line 37, in read
    d = self.chunk()
        ^^^^^^^^^^^^
  File "/home/closeio/venv/lib/python3.12/site-packages/gunicorn/http/unreader.py", line 64, in chunk
    return self.sock.recv(self.mxchunk)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/closeio/venv/lib/python3.12/site-packages/gevent/_socketcommon.py", line 662, in recv
    self._wait(self._read_event)
  File "src/gevent/_hub_primitives.py", line 317, in gevent._gevent_c_hub_primitives.wait_on_socket
  File "src/gevent/_hub_primitives.py", line 322, in gevent._gevent_c_hub_primitives.wait_on_socket
  File "src/gevent/_hub_primitives.py", line 304, in gevent._gevent_c_hub_primitives._primitive_wait
  File "src/gevent/_hub_primitives.py", line 46, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_hub_primitives.py", line 46, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_hub_primitives.py", line 55, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_waiter.py", line 154, in gevent._gevent_c_waiter.Waiter.get
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
greenlet.GreenletExit

Reading a request from a socket should be very quick, orders of magnitude faster than the graceful timeout of 30 seconds. Just guessing: Is it perhaps a bug in some requests still being accepted after the server is not supposed to accept any more? Or maybe the code for reading from a socket and filling in the buffer can get stuck in an infinite loop under some conditions?

Edit: Ah, I see this is already being worked on in #3275

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.