-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error from gRPC .net client throw error "proxy max-concurrency exhausted" #5183
Comments
proxy max-concurrency exhausted
@rafalkasa we'll need a lot more information to help out I'm afraid. A replication setup would be perfect, but maybe some more details about the types of workloads and what's happening at the time? Perhaps some metrics from the client side? |
we had the same kind of error in one of our services. the error message was confusing because there was no way we were generating enough connections to make "proxy max-concurrency exhausted" make any sense. However, the grpc service being unavailable gave us something to try because the service was listening on the pod's IP address. When we injected linkerd, the request was now coming from the proxy container, and when we changed the IP we were listening to to 127.0.0.1, then the connection started to be made to our service. we're just starting with linkerd, so I'm not sure if this is expected behavior or not or if we had done something else to get the connection working at that time, but changing the IP we were listening for seemed to be key in our case. |
In today's edge-20.11.5, we've made some adjustments to the default proxy configuration to try to avoid this type of error. If you have a staging environment where this issue can be reproduced, we'd love help testing these changes that we plan to include in stable-2.9.1. |
@olix0r I actually get a proper UNAVAILABLE status when using edge-20.11.5 and stable-2.9.1, which is an improvement over stable-2.9.0. It still says For context, I'm simulating some error scenarios by killing grpc service pods and checking that the clients return reasonable errors. |
Is there any update on if |
I'm still occasionally seeing max-concurrency errors on 2.9.1. It seems to happen when k8s removes pods which are serving active requests. |
We are still on |
Same here on |
So, this error means that the proxy has 10K pending requests--meaning requests for which there has been no response headers. In a recent edge release we've updated this limit to 100K, but I suspect that this is just hiding some other sort of problem. Unfortunately, I've been unable to reproduce this with our test apps and we haven't been provided with diagnostics to help narrow this down. In fact, we don't even have a full log message that indicates whether this is being emitted on the If someone can provide a test application (k8s yaml) that can reproduce this, that would be ideal. Barring that, it would be helpful to at least get proxy debug logs (i.e., with the |
Hi @olix0r, here's some more info for our usecase: the error shows up on the client-side, i.e. the sidecar of a pod that's making requests to a service. The service has a sidecar with the same version of linkerd. This happens even when there's virtually zero other traffic. I'm not 100% sure if the service sidecar emits any sort of error. Using grpc with Scala (akka-grpc) client and service. I can setup a sandbox deployment of this service with a cronjob making requests on a five minute cron with the logging level you suggested. I'll let it run for about 24 hours and post back any occurrences of this error. |
Unfortunately I haven't been able to replicate the exact behavior on-demand. Go figure. I'll remember to post back once I see it again. |
I also haven't been able yet to reproduce this on a sample application. These are the debug and warn logs from the server: I noticed that I hadn't set up any timeouts on the calls and we thought adding those might help. After doing so and setting the timeout to 30s it took about an hour before the problem started showing up again. We don't really see how there could be 10k pending requests on each of the server instances linkerd proxy. The same setup was working on a cluster which was running linkerd version I'm still trying to provide you with sample application, but maybe this helps already in narrowing the issue down. |
@karsten42 thanks. That's helpful context. If you're able to, it would be helpful to see proxy debug logs (i.e., with the |
The gist I attached are the proxy debug logs, sorry if that wasn't clear.
I will try to get them. |
I added the output from |
While debugging this today, we noticed something surprising: this error message is completely misleading. The underlying error doesn't actually indicate that the service is at capacity (so we've been trying debug a very different situation). Instead, this error signals, simply, that the proxy is in failfast. A service may enter failfast whenever it is unable to process requests for some amount of time---outbound=3s, inbound=1s. A load balancer, for instance, may have no endpoints; or, when no balancer is present, an individual endpoint may be down or otherwise not processing requests. The failfast mechanism is primarily intended to prevent the proxy from buffering requests indefinitely. If the service has been stuck in an unavailable state, we start serving error response immediately, rather than waiting a full timeout before the request fails. Recently (since 2.9.x), we've updated the proxy to include more descriptive failfast error messages, to indicate which layer of the proxy is in failfast. I've opened a PR to replace "max concurrency exhausted" with this more descriptive error message (linkerd/linkerd2-proxy#847); and we'll revisit the data we have in light of this revelation. This is a high priority for us to address before stable-2.10. |
We are seeing this issue almost constantly when we try to add linkerd2 to our system. This morning I upgraded to linkerd 2.9.2 and the issue continues to persist. After the mention by @olix0r of the potential issue with failfast, I tried to grab some logs to help. See this gist. In the logs, most of the requests are coming from The health-check tool is periodically checking the Let me know if any other information would be helpful. |
Thanks @rltvty. In your case we see:
This error is pretty straightforward; the application isn't accepting these connections so the proxy is unable to forward requests. Is it possible that your application is not listening on If your application does not listen on localhost, communication would work without the proxy; but the proxy forwards all inbound traffic on localhost to prevent creating traffic loops, which are all too possible in more complex container networking configurations. |
For us at least part of the issue was actually in our code where we initialised our grpc server with the setting
where We saw another service where this was happening yesterday which didn't have this grpc server setting though. We will continue to investigate this next week. |
In my issue, the server container is still starting as these requests are coming in. It makes sense that the linkerd-proxy is unable to forward the requests while the server container is starting, but it seems that even after the server is up, linkerd-proxy continues to block drop all requests to the pod for some time. The server is written in scala... I'll check into how it is listening. In the pod configuration, we are exposing the following ports on the server container:
|
@olix0r It looks like we are listening on Thanks. |
@rltvty No, There is a default connect timeout of 100ms--I'd be surprised if you were hitting that on a pod where the application is effectively idle, but we could try increasing that limit. I believe that we'd have to use It would also probably be pretty illuminating to capture PCAP using either the linkerd debug container or |
I haven't seen the issue now since I added the |
We've merged linkerd/linkerd2-proxy#847 and linkerd/linkerd2-proxy#848 to main to help improve failfast diagnostics. I've built and pushed a container image for the proxy that includes these changes. It can be used with control planes from 2.9.x or recent edge releases.
|
@olix0r I tried using the new build, and my services are failing on startup. Logs are in this gist. I included both the logs from the service container itself and the linkerd-proxy container, for two of my services.
Both of these services worked "most of the time" with 2.9.0, 2.9.1, 2.9.2 builds, except the proxy would occasionally go into the "max-concurrency exhausted" state. With this test build, I can't get the services to start successfully. Please let me know if you need any additional info. |
Thanks for the helpful data! My observations: egretinboundAll of the inbound traffic to egret appears to be prometheus scrapes on port 9100; and it appears that all of these requests fail due to To test this, I'd If not, is it possible that the metrics server doesn't start listening until some initialization is complete? Perhaps it doesn't serve until the database connection is established, for instance? outboundOn the outbound side we see the application trying to connect to two different types of endpoints: port 5432 (rds)These connections are hitting a protocol detection timeout:
Because the proxy transparently detects HTTP traffic, it may fail to allow connections for some applications. See the docs for more information. The most reliable way to fix this is to simply disable the proxy on these connections by setting a workload annotation like:
This will configure iptables to bypass the proxy on these connections. port 8200It also looks like your application is establishing non-HTTP connections on port 8200. We are able to establish connections on these endpoints, though it appears that one of the peers is closing the connection fairly quickly. I don't have enough information to know whether this is by design or whether this indicates some problem:
marlinThe marlin logs are not at the debug level, so we don't have as much insight into what's going on here. It would probably be helpful to increase the log level via the workload annotation Here's what we see, though: inboundLike the port 8558 (http healtcheck)We see more
If you port 50051 (grpc healtcheck)The log level is hiding more details about the errors, but it's probably safe to assume that they are the same
I'd be curious to see this pod's logs with trace logging enabled. But, I think if we can at least determine whether the above outboundport 9042 (cassandra)Again, this error indicates that one of the peers is disconnecting. Without more detailed logging, it's hard to make a guess at what's going on here. We can try to improve the proxy's diagnostics in this regard, but more logging would help clarify things. One thing that stands out is that these errors are fairly sporadic:
With the most recent edge releases, there's another thing you can try here; though I don't have a strong signal about whether it would help. You can set the To summarize
|
I've opened linkerd/linkerd2-proxy#852 to help give us more information about which peer is failing when we see errors like Transport endpoint is not connected. |
I've verified that Instructing the linkerd-proxy to skip outbound port I might not have a chance to look at |
I found some time to play with I updated its linkerd-proxy to skip outbound port 9042 for Cassandra, and also updated the linkerd-proxy on Cassandra to skip inbound on 9042. Now marlin is starting successfully. Is there any documentation with some general guidance on which ports should skip the proxy? Initially, I was thinking it made sense to have the proxy handle all traffic to gain the benefits of auto mTLS and span-id injection for tracing. But now I'm realizing that a lot of traffic (like databases) already have built-in TLS solutions, and don't use http/http2 protocols so they aren't compatible with span-id injection. Should linkerd-proxy connections be limited to http/http2 traefik? |
https://linkerd.io/2/features/protocol-detection/ -- we ship with a standard set of skip ports to get many common cases; but clearly we don't catch all of them.
Correct, tracing doesn't really work with arbitrary TCP protocols. As for mTLS, the opaque ports feature I mentioned is intended to allow the proxy to transport arbitrary protocols over mTLS. This currently only works with resources that are running in your cluster (so if you're using RDS from your cloud provider, that won't work here). Though, it could potentially work for your cassandra use case.
Linkerd adds the most value for HTTP. However, in 2.9 we started supporting more features (mTLS, traffic split, etc) for non-HTTP traffic. In 2.10, we'll introduce the new opaque ports feature to broaden this support to non-detectable protocols; and we'll also start supporting non-HTTP traffic in multicluster configurations. |
Thanks @olix0r! I'm beginning to think our issues with "proxy max-concurrency exhausted" are just due us configuring linkerd incorrectly. I'm pushing out your suggested changes to more of our test environments today. I'll let you know if I encounter any additional issues. |
@olix0r we are still running into Is there some way we can configure linkerd to not drop into |
When a Service is in failfast, the inner service is only polled as new requests are processed. This means it's theoretically possible for certain service tasks to be starved. This change ensures that these layers are paired with a `SpawnReady` layer to ensure that the inner service is always driven to readiness. This could potentially explain behavior as described in linkerd/linkerd2#5183; though we don't have strong evidence to support that. This seems like a healthy defensive measure, in any case. This change also improves stack commentary to favor larger descriptive comments over layer-level annotations. While auditing services for readiness, an unnecessary buffer has been removed from the ingress HTTP stack.
When a Service is in failfast, the inner service is only polled as new requests are processed. This means it's theoretically possible for certain service tasks to be starved. This change ensures that these layers are paired with a `SpawnReady` layer to ensure that the inner service is always driven to readiness. This could potentially explain behavior as described in linkerd/linkerd2#5183; though we don't have strong evidence to support that. This seems like a healthy defensive measure, in any case. This change also improves stack commentary to favor larger descriptive comments over layer-level annotations. While auditing services for readiness, an unnecessary buffer has been removed from the ingress HTTP stack.
@rltvty A few questions: Does this only occur at startup or does this also occur after the process has successfully processed requests? Does the proxy ever recover (if given enough time) or does it stay stuck this way after the process has initialized? You could check this by If the proxy gets "stuck" in this state, it would be awesome if you can test this proxy version:
This may not fix the behavior you're seeing, but I found a place where, at least theoretically, a service could get stuck in failfast on the inbound side. How would this situation differ when Linkerd isn't present? I assume that the client would just get a TCP Connection Refused error and try reconnecting? How does the client connect to marlin? Is it connecting to a Does marlin have readiness probes configured? Generally, I would not expect pods to be discoverable until readiness probe succeeds. If they're not present, I strongly recommend configuring them--with our without Linkerd, it's a generally important feature of Kubernetes. |
Hi all, Infra I produced the issue with 2 scenarios so far:
pod error: client error: on having 20 connection with 400 concurrent configured on the client, |
Your error includes:
Are you sure there are actually endpoints in the service at the time when the error is logged? If you are running a control plane from version stable-2.9.x or a recent edge release, it may be helpful to test the above-mentioned proxy version by setting annotations like:
If none of this helps, it would probably be best to open a new issue so we get into more of the details of your specific issue. |
@olix0r i had 2 pods with ready status under that deployment, so it won't make much of sense |
When a Service is in failfast, the inner service is only polled as new requests are processed. This means it's theoretically possible for certain service tasks to be starved. This change ensures that these layers are paired with a `SpawnReady` layer to ensure that the inner service is always driven to readiness. This could potentially explain behavior as described in linkerd/linkerd2#5183; though we don't have strong evidence to support that. This seems like a healthy defensive measure, in any case. This change also improves stack commentary to favor larger descriptive comments over layer-level annotations. While auditing services for readiness, an unnecessary buffer has been removed from the ingress HTTP stack. Finally, this change updates the inbound connect timeout to 300ms to accommodate especially slow endpoints.
@olix0r Now that we are configuring linkerd-proxy properly, the issue seems to primarily happen during the launch of a new cluster of pods. Initializing and getting to a quorum state on a new cluster takes about 3 minutes. After reaching quorum the first time, a single pod takes about 45 seconds to restart and rejoin the cluster successfully. Yesterday in this scenario, I was seeing that the linkerd-proxy started blocking requests to the pods on the gRPC port (with fail-fast) shortly after launching the pods. I waited about 30 minutes to see if the proxy would start re-allowing requests to the pods, but it never happened. Then I rolled each of the pods in the cluster individually, and the proxy did not block requests on gRPC after the pods restarted. Note that we are currently only using linkerd to proxy the gRPC port that is served by If linkerd wasn't present, these requests would still fail, due to connection refused or connection timeout errors. However, after the service starts successfully the requests would then return successfully. With the linkerd-proxy, the requests seem to fail indefinitely. This is the issue. Linkerd never seems to allow requests back to the service to re-check if it should be blocking. |
oh, I'll try the new build too. thanks! |
@rltvty Thanks for sharing. Let us know how the new build works. If you're still seeing problems, I think we have enough information to try to reproduce this in a test configuration. |
@olix0r I don't see much improvement with the new build. Our pods are still not recovering on startup. I also noticed another thing... The marlin pods communicate with each other using UDP on port 25520. When the I can see the UDP traffic in the Any ideas on how to resolve this? |
Bummer. We'll take a deeper look at trying to reproduce this. (Specifically, a pod that receives requests before the application has initialized).
Linkerd shouldn't touch UDP at all right now. It makes sense that you wouldn't see anything in the proxy's logs because these packets should never hit the proxy. This could potentially point to some issue with our iptables initialization, but I'd be pretty surprised if that was the case. If you can't figure this out, it's probably best to open a new issue focused on that distinct problem. |
We continue to work on diagnosing and reproducing this issue. We've made some changes to the inbound stack that may better highlight what's going on. I'd recommend testing with the following workload annotations:
attn: @rltvty |
Sorry that I have been unable to test this as of yet. Maybe next week. |
Is this in the class of issues that was fixed in 2.10? |
I'm going to close this issue, as the proxy can no longer return this error. There are hopefully more descriptive errors and diagnostics now that we can use to debug this class of timeout. Please feel free to open new issues if undiagnosable failfast errors are encountered. |
Bug Report
From few days I have started receiving such error:
Grpc.Core.RpcException: Status(StatusCode=Unavailable, Detail="proxy max-concurrency exhausted")
What is the issue?
Problem to connect from .net client to the python gRPC server on kubernetes using linkerd as service mesh
How can it be reproduced?
I can't reproduce this by myself only happen on the production environment in last few days
Logs, error output, etc
(If the output is long, please create a gist and
paste the link here.)
linkerd check
outputEnvironment
Possible solution
Additional context
The text was updated successfully, but these errors were encountered: