[WIP] Request timeout for async workers #1730

opiumozor · 2018-03-21T00:20:25Z

> Work in progress

Problem

Currently --timeout and --graceful-timeout act like request timeout on sync workers but doesn't on async workers. Which means that there is currently no option to "kill" a request that takes too long in a async worker.

Proposed solution

Add a --request-timeout option that defines after how many seconds the request should timeout. Setup a signal alarm in the handle_request() method and then have a handler function to handle the timeout.

The code works but the timeout handling obviously needs to be cleaner. I was wondering if this kind of implementation could work.

Feel free to share your opinion and ideas for implementations!

Related issue: #1658

tilgovi · 2018-03-21T00:29:44Z

Thank you so much for picking this up! You're going to make a lot of people very happy :).

SIGALRM is definitely a common way to do this, but it's probably not necessary. It may be cleaner to use the timeout facility of each async framework. We already do this for keepalive timeouts by overriding the timeout_ctx method of the worker class. Maybe we can do something similar to that?

orenmazor · 2018-03-21T17:11:04Z

@tilgovi great point. it would be ideal if we had the native timeout mechanism for each worker type.

but we dont really have the expertise in all the worker types other than gevent, so implementing this feature for all workers turns this into the sort of thing that probably wont get tackled anytime soon.

how would you feel if we implemented this just for gevent and then let other contributors add the feature for their favourite worker type?

tilgovi · 2018-03-21T17:41:45Z

Sure! I can jump in and help with eventlet and maybe others.

opiumozor · 2018-03-21T21:58:44Z

@tilgovi just pushed the a new implementation overwriting request_timeout() and using gevent.Timeout(), let me know what you think.

i tried to implement the eventlet part but for some reason it's not working properly.

def request_timeout(self):
        return eventlet.Timeout(self.cfg.request_timeout, False)

I read in the eventlet docs that you cannot time out CPU-only operations with this class, so that's maybe where the problem is (Note: my test is a flask endpoint that uses time.sleep()

tilgovi · 2018-03-23T03:54:00Z

I think the same is true for gevent as eventlet. If you do not yield to the event loop the timeouts don't complete.

I wonder how well the signals approach would work. It's possible that code could even mask the signals. Or more likely, I think CPython delivers signals to Python code only at certain moments; this timeout would not work if code is in a long running call in a C extension module.

Maybe we should take a step back and figure out what we really want from this feature. When I look at a server like NGINX I note that there is no request timeout. There are only timeouts around I/O. There are timeouts for receiving and parsing the request (headers, body) and for reading and writing data to the socket. For the proxy module, there are timeouts for sending and receiving data from the upstream server. There is no general request timeout.

I am not completely convinced that we should implement such a feature. Do you have a specific use case in mind for yourself that can help us reason through this?

jamadden · 2018-03-23T04:08:01Z

I think the same is true for gevent as eventlet. If you do not yield to the event loop the timeouts don't complete.

That's true. (Well, they complete, but they don't interrupt.)

. It's possible that code could even mask the signals. Or more likely, I think CPython delivers signals to Python code only at certain moments; this timeout would not work if code is in a long running call in a C extension module.

That's also true. gevent goes out of its way to try to ensure that signals get raised in a timely manner. Other C code may not be so considerate. (And even in gevent in certain cases it really is just "best effort" for timely signal delivery.)

orenmazor · 2018-03-23T13:05:15Z

I am not completely convinced that we should implement such a feature. Do you have a specific use case in mind for yourself that can help us reason through this?

@tilgovi I get where you're coming from, and framing the nginx proxy timeouts around IO is pretty bang on. This is exactly what I want as well from gunicorn, where it waits on slow IO.

specific use case:

We're currently facing gunicorn worker exhaustion issues during heavy traffic. Nginx times out on slow requests, and is ready to accept new requests, but all of the gunicorn workers are still chugging along. We need to somehow tell these workers nobody cares about them anymore.

tackling the slow things inside a request is pretty obvious. but I want to address the gunicorn vulnerability here as well.

we're pretty open to any approach you guys have in mind!

paultiplady · 2019-01-23T03:00:00Z

I'd just like to add another use-case which is similar to the above, in case more user stories are required to justify this feature.

In Django, a very easy mistake is to write an ORM query that will unexpectedly take a long time to complete (say a 10m query, in a pathological case of loading a full DB table with lots of joins). If you're using a sync worker, then gunicorn will time out in timeout seconds, since the entire worker pool (of one) is busy. However if you're using async workers, then the individual worker thread/gthread will just keep working even after the nginx reverse proxy has closed the client connection.

This poses two problems:

Your worker is being wasted on work that nobody cares about, since the client has already timed out and gone away. (This is @orenmazor 's point above). If you ever got to a fully-starved daemon, then you should timeout and kill all the workers. But in the partial-exhaustion state that would lead up to this, you'll just see the latency creep up as more workers get blocked.
This can produce seriously unexpected behaviour, if for example your worker commits changes to the DB long after the client has received a 503 error and assumed that their request was not fulfilled. While obviously you should use idempotency keys or some other method for guaranteeing that important requests aren't duplicated, having the option of terminating the gunicorn worker and returning a 500 error safely to the client, instead of leaving a zombie worker, would be preferable from a "principle of least surprise" perspective.

It's true that if the greenlet is blocked on a C extension, then you can't get the cancel signal through. But there are many cases where the worker is doing work that does yield to the event loop (say hitting other backends as an API gateway).

And even if you're using a C extension (like an ORM wrapper), I believe we'd still have the option of adding a gevent.sleep(0) before calling out the the C code, if we wanted to enable this feature in that context.

I may just be missing an obvious way to fix these issues; if there's another approach for handling this in the sync worker case I'd love to know it. Thanks!

tilgovi · 2019-01-23T03:59:44Z

Thanks, @paultiplady.

It's definitely possible for a user to wrap a timeout function around the application as WSGI middleware. However, I'm not sure it can be done in an event loop agnostic way. For that reason, we may want to have support in Gunicorn.

I also think there's an argument for just having something to clear up confusion about how --timeout works. 😁

yoobles · 2020-05-27T16:28:06Z

Bumping this PR.

Without this PR, how do you timeout an async worker request? This seems like a large deficiency in gunicorn + gevent.

tilgovi · 2020-05-27T21:14:34Z

@yoobles if you know what event loop you're using you can implement your own timeouts. That is the most robust solution because it lets your application handle cancellation signals and perform any necessary cleanup so that timeouts don't cause clients to simply retry and overwhelm an already slow system. If you read the discussion, I hope you'll agree that we haven't yet come to a consensus about what general purpose timeout would be appropriate to implement, or how it would work. If you have something to contribute to that discussion, it would be very appreciated.

yoobles · 2020-05-29T06:59:50Z

I implemented a middleware based off the gevent timer code here that times out requests but I believe that overall this should be handled in gunicorn.

I don't have enough knowledge about event loops to say what the right approach is but uwsgi seems to support this with harakiri. Which appears to be implemented at least partially with SIGALRM (https://uwsgi-docs.readthedocs.io/en/latest/FAQ.html#what-is-harakiri-mode).

Even if the SIGALRM approach doesn't handle every application, I think having having the starting point is superior to letting requests potentially run unbounded.

Request timeout for async workers draft

0a86cfb

opiumozor mentioned this pull request Mar 21, 2018

"timeout" setting used for two things that should be distinct #1658

Closed

request timeout using gevent timeout

89e6185

remove unused import

5199e87

benoitc force-pushed the master branch from 5a43f72 to fe7632f Compare January 24, 2019 22:19

davidmegginson mentioned this pull request Oct 25, 2022

Put a time limit on flask responses HXLStandard/hxl-proxy#412

Closed

benoitc added the Feature/Worker label May 10, 2023

pajod mentioned this pull request Aug 14, 2024

Proposed as a solution for issue #1658 #2348

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Request timeout for async workers #1730

[WIP] Request timeout for async workers #1730

opiumozor commented Mar 21, 2018

tilgovi commented Mar 21, 2018

orenmazor commented Mar 21, 2018

tilgovi commented Mar 21, 2018 •

edited

Loading

opiumozor commented Mar 21, 2018

tilgovi commented Mar 23, 2018

jamadden commented Mar 23, 2018

orenmazor commented Mar 23, 2018

paultiplady commented Jan 23, 2019

tilgovi commented Jan 23, 2019

yoobles commented May 27, 2020

tilgovi commented May 27, 2020

yoobles commented May 29, 2020 •

edited

Loading

[WIP] Request timeout for async workers #1730

Are you sure you want to change the base?

[WIP] Request timeout for async workers #1730

Conversation

opiumozor commented Mar 21, 2018

Problem

Proposed solution

tilgovi commented Mar 21, 2018

orenmazor commented Mar 21, 2018

tilgovi commented Mar 21, 2018 • edited Loading

opiumozor commented Mar 21, 2018

tilgovi commented Mar 23, 2018

jamadden commented Mar 23, 2018

orenmazor commented Mar 23, 2018

paultiplady commented Jan 23, 2019

tilgovi commented Jan 23, 2019

yoobles commented May 27, 2020

tilgovi commented May 27, 2020

yoobles commented May 29, 2020 • edited Loading

tilgovi commented Mar 21, 2018 •

edited

Loading

yoobles commented May 29, 2020 •

edited

Loading