Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out backpressure #3

Closed
carllerche opened this issue Jul 29, 2017 · 1 comment · Fixed by #6
Closed

Figure out backpressure #3

carllerche opened this issue Jul 29, 2017 · 1 comment · Fixed by #6

Comments

@carllerche
Copy link
Member

A placeholder for figuring out how to correctly handle back pressure w/ Service. I have attached notes I wrote a while ago, but my thoughts have evolved since. I haven't written them down yet.

Service back pressure

Basic options

Option 1: Use response futures

In this strategy, there is no back pressure strategy built into a service implementation. Each service implementation always accepts a request and starts processing the request, queuing if necessary when upstream resources are not available.

The assumption is that, the caller of service (tokio-proto) will maintain a maximum number of outstanding futures for a given connection. For example, a common number for an HTTP/2.0 implementation is 100 in-flight requests.

The problem here is that this strategy only really works if each connection is independent. If all connections require access to a global resource, say a global queue that has the buffer set to 1,000 and there are 100k open connections, the number of outstanding requests will start to backup heavily. In this case, 10 open connections are able to fill the global queue, causing 999,990 other connections to effectively be able to buffer 100 requests each.

The ideal situation here is that, as the global queue becomes full, tokio-proto stops accepting new connections.

Option 2: AsyncService

This strategy modifies Service::call to align more with the Sink API. If a service is not ready to accept new connections, a call would return immediately with AsyncService::NotReady(request). When the service becomes ready, the task is notified and the request can be attempted again. When tokio-proto and other callers of Service receive an AsyncService::NotReady(req) the common response will be to buffer the returned request until the service becomes ready again and to not generate any further requests until that buffer slot is cleared. In rarer cases, it may be possible to fail over to a secondary service.

This strategy creates significant complexity for all implementations of Service. First, it exposes AsyncService and requires understanding that concept, then every middleware needs to be able to return the original request if the upstream is not ready.

BEFORE:

impl<T, P> Service for ClientService<T, P> where T: 'static, P: ClientProto<T> {
    type Request = P::Request;
    type Response = P::Response;
    type Error = P::Error;
    type Future = ClientFuture<T, P>;

    fn call(&mut self, req: P::Request) -> Self::Future {
        ClientFuture {
            inner: self.inner.call(Message::WithoutBody(req))
        }
    }
}

AFTER

impl<T, P> Service for ClientService<T, P> where T: 'static, P: ClientProto<T> {
    type Request = P::Request;
    type Response = P::Response;
    type Error = P::Error;
    type Future = ClientFuture<T, P>;

    fn call(&mut self, req: P::Request) -> AsyncService<Self::Future, P::Request> {
        match self.inner.call(Message::WithoutBody(req)) {
            AsyncService::Ready(f) => {
                AsyncService::Ready(ClientFuture {
                    inner: f,
                })
            },
            AsyncService::NotReady(req) => {
                match req {
                    Message::WithoutBody(req) => AsyncService::NotReady(req),
                    _ => panic!("wat"),
                }
            }
        }
    }
}

If the middleware mutates the request in such a way that is not reversible and the upstream returns AsyncService::NotReady, then the middleware has no choice but to error the request, which kind of defeats the purpose.

Also, a service must be able to determine if the request can be processed immediately, which can cause additional complexity in the case where a service implementation does something like:

upstream_a.call(request)
	.and_then(|resp_a| upstream_b.call(resp_a));

It is unclear how to handle upstream_b.call returning AsyncService::NotReady since the original call has already returned. The only solution that I can think of is to buffer resp_a and set a flag on the middleware to not accept any more requests.

AsyncService does not handle the “router” problem where one route may be ready but another one is not. If a request sent in results in AsyncService::NotReady, the caller will either have to stop all further requests or buffer the requests that are rejected while sending in new requests. This “pending” buffer could be large. Also, for every “tick” of the event loop, the service will need to attempt to flush all buffered requests even if only a single one may be ready (it’s also unclear how the service flushes the buffered requests since there is no “poke”). If a buffer is introduced, it is unclear why the router doesn’t manage the buffer itself w/ the poll_ready strategy. It would be much simpler.

This seems to imply that the right thing to do when a service is overloaded is to resolve the response future as an error. The problem is then, how does the caller know when the service will accept another request?

Option 3: poll_ready

In this strategy, Service has an additional function: poll_ready(). The function returns Async::Ready when the service is ready to accept another request. The caller then sends the request, which returns a response future as it “knows” that the service can accept the request.

Questions would be, is poll_ready() a guarantee or can it return false positives? If it does return a false positive, how does a service respond to a call when there is no availability? I would suggest that this can be left up to the service implementation (maybe it is a configuration setting) and maybe we provide some best practice hints? There really would be two options, either the service accepts the request, buffers it, and returns a response future or the service returns an error future. Both seem like they could be acceptable depending on the situation. Note, that if the service returns an “out of capacity” error, the poll_ready function should provide help to the caller to determine when to resume sending requests.

Another question is how to handle services that are conditionally ready depending on the details of the request. For example, a router may have one route that is ready and another one that isn’t. The way this would work with poll_ready would be that the router would be configured with a max buffer size per route. When a route’s buffer is exceeded, the router can either disable the entire router (poll_ready returns not ready) or it can error further requests on the route and keep the router service “ready”.

I believe that this strategy is simpler and provides the same capabilities as option 2.

Advanced opt-in possibilities

Service specific back pressure strategies could be employed as well. For example, in the router case, when a route is not ready, the router could error the request but provide back pressure information in the error:

struct RouteUnavailable<R> {
	request: R,
	route: String,
	ready: impl Future<()>,
}

In this case, the error includes the route that was unavailable as well as a future representing that specific route becoming available again. The application can then handle the back pressure signal or just ignore the error and the request is aborted.

Another strategy could be having a “state” on the response future:

enum ResponseState {
	Healthy,
	Distressed,
}

Services could then always accept all requests, but if there is a back pressure situation going on, the response future will be in the “distressed” state. The service caller could then decide if it wants to abort the request (drop the response future) or buffer it…

@danburkert
Copy link

Here are a couple of scenarios I've run into while writing Service implementations:

  1. Pushing on to a queue
  2. Pushing on to one of many queues, e.g., a router service
  3. Pushing on to a sequence of queues

For each of these 'pushing on to a queue' can be replaced with pretty much any operation which needs backpressure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants