Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Request timeout for async workers #1730

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

opiumozor
Copy link

> Work in progress

Problem

Currently --timeout and --graceful-timeout act like request timeout on sync workers but doesn't on async workers. Which means that there is currently no option to "kill" a request that takes too long in a async worker.

Proposed solution

Add a --request-timeout option that defines after how many seconds the request should timeout. Setup a signal alarm in the handle_request() method and then have a handler function to handle the timeout.

The code works but the timeout handling obviously needs to be cleaner. I was wondering if this kind of implementation could work.

Feel free to share your opinion and ideas for implementations!

Related issue: #1658

@tilgovi
Copy link
Collaborator

tilgovi commented Mar 21, 2018

Thank you so much for picking this up! You're going to make a lot of people very happy :).

SIGALRM is definitely a common way to do this, but it's probably not necessary. It may be cleaner to use the timeout facility of each async framework. We already do this for keepalive timeouts by overriding the timeout_ctx method of the worker class. Maybe we can do something similar to that?

@orenmazor
Copy link

@tilgovi great point. it would be ideal if we had the native timeout mechanism for each worker type.

but we dont really have the expertise in all the worker types other than gevent, so implementing this feature for all workers turns this into the sort of thing that probably wont get tackled anytime soon.

how would you feel if we implemented this just for gevent and then let other contributors add the feature for their favourite worker type?

@tilgovi
Copy link
Collaborator

tilgovi commented Mar 21, 2018

Sure! I can jump in and help with eventlet and maybe others.

@opiumozor
Copy link
Author

@tilgovi just pushed the a new implementation overwriting request_timeout() and using gevent.Timeout(), let me know what you think.

i tried to implement the eventlet part but for some reason it's not working properly.

def request_timeout(self):
        return eventlet.Timeout(self.cfg.request_timeout, False)

I read in the eventlet docs that you cannot time out CPU-only operations with this class, so that's maybe where the problem is (Note: my test is a flask endpoint that uses time.sleep()

@tilgovi
Copy link
Collaborator

tilgovi commented Mar 23, 2018

I think the same is true for gevent as eventlet. If you do not yield to the event loop the timeouts don't complete.

I wonder how well the signals approach would work. It's possible that code could even mask the signals. Or more likely, I think CPython delivers signals to Python code only at certain moments; this timeout would not work if code is in a long running call in a C extension module.

Maybe we should take a step back and figure out what we really want from this feature. When I look at a server like NGINX I note that there is no request timeout. There are only timeouts around I/O. There are timeouts for receiving and parsing the request (headers, body) and for reading and writing data to the socket. For the proxy module, there are timeouts for sending and receiving data from the upstream server. There is no general request timeout.

I am not completely convinced that we should implement such a feature. Do you have a specific use case in mind for yourself that can help us reason through this?

@jamadden
Copy link
Collaborator

I think the same is true for gevent as eventlet. If you do not yield to the event loop the timeouts don't complete.

That's true. (Well, they complete, but they don't interrupt.)

. It's possible that code could even mask the signals. Or more likely, I think CPython delivers signals to Python code only at certain moments; this timeout would not work if code is in a long running call in a C extension module.

That's also true. gevent goes out of its way to try to ensure that signals get raised in a timely manner. Other C code may not be so considerate. (And even in gevent in certain cases it really is just "best effort" for timely signal delivery.)

@orenmazor
Copy link

I am not completely convinced that we should implement such a feature. Do you have a specific use case in mind for yourself that can help us reason through this?

@tilgovi I get where you're coming from, and framing the nginx proxy timeouts around IO is pretty bang on. This is exactly what I want as well from gunicorn, where it waits on slow IO.

specific use case:

We're currently facing gunicorn worker exhaustion issues during heavy traffic. Nginx times out on slow requests, and is ready to accept new requests, but all of the gunicorn workers are still chugging along. We need to somehow tell these workers nobody cares about them anymore.

tackling the slow things inside a request is pretty obvious. but I want to address the gunicorn vulnerability here as well.

we're pretty open to any approach you guys have in mind!

@paultiplady
Copy link

I'd just like to add another use-case which is similar to the above, in case more user stories are required to justify this feature.

In Django, a very easy mistake is to write an ORM query that will unexpectedly take a long time to complete (say a 10m query, in a pathological case of loading a full DB table with lots of joins). If you're using a sync worker, then gunicorn will time out in timeout seconds, since the entire worker pool (of one) is busy. However if you're using async workers, then the individual worker thread/gthread will just keep working even after the nginx reverse proxy has closed the client connection.

This poses two problems:

  1. Your worker is being wasted on work that nobody cares about, since the client has already timed out and gone away. (This is @orenmazor 's point above). If you ever got to a fully-starved daemon, then you should timeout and kill all the workers. But in the partial-exhaustion state that would lead up to this, you'll just see the latency creep up as more workers get blocked.
  2. This can produce seriously unexpected behaviour, if for example your worker commits changes to the DB long after the client has received a 503 error and assumed that their request was not fulfilled. While obviously you should use idempotency keys or some other method for guaranteeing that important requests aren't duplicated, having the option of terminating the gunicorn worker and returning a 500 error safely to the client, instead of leaving a zombie worker, would be preferable from a "principle of least surprise" perspective.

It's true that if the greenlet is blocked on a C extension, then you can't get the cancel signal through. But there are many cases where the worker is doing work that does yield to the event loop (say hitting other backends as an API gateway).

And even if you're using a C extension (like an ORM wrapper), I believe we'd still have the option of adding a gevent.sleep(0) before calling out the the C code, if we wanted to enable this feature in that context.

I may just be missing an obvious way to fix these issues; if there's another approach for handling this in the sync worker case I'd love to know it. Thanks!

@tilgovi
Copy link
Collaborator

tilgovi commented Jan 23, 2019

Thanks, @paultiplady.

It's definitely possible for a user to wrap a timeout function around the application as WSGI middleware. However, I'm not sure it can be done in an event loop agnostic way. For that reason, we may want to have support in Gunicorn.

I also think there's an argument for just having something to clear up confusion about how --timeout works. 😁

@yoobles
Copy link

yoobles commented May 27, 2020

Bumping this PR.

Without this PR, how do you timeout an async worker request? This seems like a large deficiency in gunicorn + gevent.

@tilgovi
Copy link
Collaborator

tilgovi commented May 27, 2020

@yoobles if you know what event loop you're using you can implement your own timeouts. That is the most robust solution because it lets your application handle cancellation signals and perform any necessary cleanup so that timeouts don't cause clients to simply retry and overwhelm an already slow system. If you read the discussion, I hope you'll agree that we haven't yet come to a consensus about what general purpose timeout would be appropriate to implement, or how it would work. If you have something to contribute to that discussion, it would be very appreciated.

@yoobles
Copy link

yoobles commented May 29, 2020

I implemented a middleware based off the gevent timer code here that times out requests but I believe that overall this should be handled in gunicorn.

I don't have enough knowledge about event loops to say what the right approach is but uwsgi seems to support this with harakiri. Which appears to be implemented at least partially with SIGALRM (https://uwsgi-docs.readthedocs.io/en/latest/FAQ.html#what-is-harakiri-mode).

Even if the SIGALRM approach doesn't handle every application, I think having having the starting point is superior to letting requests potentially run unbounded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants