Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error from gRPC .net client throw error "proxy max-concurrency exhausted" #5183

Closed
rafalkasa opened this issue Nov 6, 2020 · 45 comments
Closed

Comments

@rafalkasa
Copy link

Bug Report

From few days I have started receiving such error:
Grpc.Core.RpcException: Status(StatusCode=Unavailable, Detail="proxy max-concurrency exhausted")

What is the issue?

Problem to connect from .net client to the python gRPC server on kubernetes using linkerd as service mesh

How can it be reproduced?

I can't reproduce this by myself only happen on the production environment in last few days

Logs, error output, etc

(If the output is long, please create a gist and
paste the link here.)

linkerd check output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-addons
--------------
√ 'linkerd-config-addons' config map exists

linkerd-grafana
---------------
√ grafana add-on service account exists
√ grafana add-on config map exists
√ grafana pod is running

Status check results are √

Environment

  • Kubernetes Version: 1.15.12
  • Cluster Environment: AKS
  • Host OS: Linux
  • Linkerd version: 2.8.1 (stable)

Possible solution

Additional context

@rafalkasa rafalkasa changed the title Error from gRPC .net client throw error proxy max-concurrency exhausted Error from gRPC .net client throw error "proxy max-concurrency exhausted" Nov 6, 2020
@grampelberg
Copy link
Contributor

@rafalkasa we'll need a lot more information to help out I'm afraid. A replication setup would be perfect, but maybe some more details about the types of workloads and what's happening at the time? Perhaps some metrics from the client side?

@brianwint
Copy link

we had the same kind of error in one of our services. the error message was confusing because there was no way we were generating enough connections to make "proxy max-concurrency exhausted" make any sense. However, the grpc service being unavailable gave us something to try because the service was listening on the pod's IP address. When we injected linkerd, the request was now coming from the proxy container, and when we changed the IP we were listening to to 127.0.0.1, then the connection started to be made to our service.

we're just starting with linkerd, so I'm not sure if this is expected behavior or not or if we had done something else to get the connection working at that time, but changing the IP we were listening for seemed to be key in our case.

@olix0r
Copy link
Member

olix0r commented Nov 27, 2020

In today's edge-20.11.5, we've made some adjustments to the default proxy configuration to try to avoid this type of error. If you have a staging environment where this issue can be reproduced, we'd love help testing these changes that we plan to include in stable-2.9.1.

@alexklibisz
Copy link

alexklibisz commented Dec 11, 2020

@olix0r I actually get a proper UNAVAILABLE status when using edge-20.11.5 and stable-2.9.1, which is an improvement over stable-2.9.0. It still says proxy max-concurrency exhausted, which doesn't make sense IMO, but at least it's UNAVAILABLE and not a 200 containing an empty body.

For context, I'm simulating some error scenarios by killing grpc service pods and checking that the clients return reasonable errors.

@karsten42
Copy link

Is there any update on if stable-2.9.1 helps with this error. I'm seeing the same issue between two meshed go services with around 20 requests per second.
Is it possible to configure the max-concurrency? What is it set to by default? I didn't really find anything in the docs about this.

@alexklibisz
Copy link

I'm still occasionally seeing max-concurrency errors on 2.9.1. It seems to happen when k8s removes pods which are serving active requests.

@karsten42
Copy link

We are still on stable-2.9.0 and are seeing this constantly, not just in the context of pod removals or restarts.

@nnqq
Copy link

nnqq commented Jan 4, 2021

Same here on stable-2.9.1. It appears from time to time and i can't reproduce this

@olix0r
Copy link
Member

olix0r commented Jan 4, 2021

So, this error means that the proxy has 10K pending requests--meaning requests for which there has been no response headers. In a recent edge release we've updated this limit to 100K, but I suspect that this is just hiding some other sort of problem.

Unfortunately, I've been unable to reproduce this with our test apps and we haven't been provided with diagnostics to help narrow this down. In fact, we don't even have a full log message that indicates whether this is being emitted on the inbound (server-side) or outbound (client-side) of a connection. As mentioned above, we need better information if we're going to be able to fix this.

If someone can provide a test application (k8s yaml) that can reproduce this, that would be ideal. Barring that, it would be helpful to at least get proxy debug logs (i.e., with the config.linkerd.io/proxy-log-level: linkerd=debug,warn annotation set) from both the client and server proxies as well as the output of linkerd metrics on these pods.

@alexklibisz
Copy link

Hi @olix0r, here's some more info for our usecase: the error shows up on the client-side, i.e. the sidecar of a pod that's making requests to a service. The service has a sidecar with the same version of linkerd. This happens even when there's virtually zero other traffic. I'm not 100% sure if the service sidecar emits any sort of error. Using grpc with Scala (akka-grpc) client and service.

I can setup a sandbox deployment of this service with a cronjob making requests on a five minute cron with the logging level you suggested. I'll let it run for about 24 hours and post back any occurrences of this error.

@alexklibisz
Copy link

Unfortunately I haven't been able to replicate the exact behavior on-demand. Go figure. I'll remember to post back once I see it again.

@karsten42
Copy link

karsten42 commented Jan 11, 2021

I also haven't been able yet to reproduce this on a sample application.
But I reenabled linkerd on our production services and the problem shows up again.
We have version stable-2.9.0 running.
The communication is between two go services. In the scenario the client is doing Server side streaming calls at a rate of about 25Hz with an average call duration of around 200ms (although this varies a lot based on the request). There are 16 server instances running.
After roughly 15 minutes we start seeing the error "proxy max-concurrency exhausted" on the server side proxy.

These are the debug and warn logs from the server:
https://gist.github.com/karsten42/2198075dabdf34a6fc41cdf3c7f84db2

I noticed that I hadn't set up any timeouts on the calls and we thought adding those might help. After doing so and setting the timeout to 30s it took about an hour before the problem started showing up again.

We don't really see how there could be 10k pending requests on each of the server instances linkerd proxy.

The same setup was working on a cluster which was running linkerd version stable-2.6.1.

I'm still trying to provide you with sample application, but maybe this helps already in narrowing the issue down.

@olix0r
Copy link
Member

olix0r commented Jan 11, 2021

@karsten42 thanks. That's helpful context. If you're able to, it would be helpful to see proxy debug logs (i.e., with the config.linkerd.io/proxy-log-level: linkerd=debug,warn annotation set) as well as the output of linkerd metrics on a pod in this state.

@karsten42
Copy link

If you're able to, it would be helpful to see proxy debug logs

The gist I attached are the proxy debug logs, sorry if that wasn't clear.

linkerd metrics

I will try to get them.

@karsten42
Copy link

I added the output from linkerd metrics to the existing gist

@olix0r
Copy link
Member

olix0r commented Jan 15, 2021

While debugging this today, we noticed something surprising: this error message is completely misleading. The underlying error doesn't actually indicate that the service is at capacity (so we've been trying debug a very different situation). Instead, this error signals, simply, that the proxy is in failfast. A service may enter failfast whenever it is unable to process requests for some amount of time---outbound=3s, inbound=1s. A load balancer, for instance, may have no endpoints; or, when no balancer is present, an individual endpoint may be down or otherwise not processing requests. The failfast mechanism is primarily intended to prevent the proxy from buffering requests indefinitely. If the service has been stuck in an unavailable state, we start serving error response immediately, rather than waiting a full timeout before the request fails.

Recently (since 2.9.x), we've updated the proxy to include more descriptive failfast error messages, to indicate which layer of the proxy is in failfast. I've opened a PR to replace "max concurrency exhausted" with this more descriptive error message (linkerd/linkerd2-proxy#847); and we'll revisit the data we have in light of this revelation.

This is a high priority for us to address before stable-2.10.

@epinzur
Copy link

epinzur commented Jan 15, 2021

We are seeing this issue almost constantly when we try to add linkerd2 to our system.

This morning I upgraded to linkerd 2.9.2 and the issue continues to persist.

After the mention by @olix0r of the potential issue with failfast, I tried to grab some logs to help. See this gist.

In the logs, most of the requests are coming from 10.110.16.96, which is a home-made tool written in go-lang to help check the current status of our server pods. The logs are from the linkerd-proxy on one of the server pods 10.110.80.70.

The health-check tool is periodically checking the /ready endpoint on the server pod using HTTP on port 8558, and also trying to hit some gRPC HTTP2 endpoints on port 50051.

Let me know if any other information would be helpful.

@olix0r
Copy link
Member

olix0r commented Jan 15, 2021

Thanks @rltvty. In your case we see:

Jan 15 16:34:57.892 DEBUG ThreadId(01) inbound:accept{peer.addr=10.110.16.96:46856 target.addr=10.110.80.70:8558}:http{v=h2}:http1{name=marlin-0.marlin-internal:8558}:profile:http1: linkerd2_proxy_transport::connect: Connecting peer.addr=127.0.0.1:8558\n
Jan 15 16:34:57.892 WARN  ThreadId(01) inbound:accept{peer.addr=10.110.16.96:47812 target.addr=10.110.80.70:8558}:http{v=h2}: linkerd2_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)\n
Jan 15 16:35:00.697 DEBUG ThreadId(01) inbound:accept{peer.addr=10.110.16.96:35650 target.addr=10.110.80.70:50051}:http{v=h2}:http2{name=marlin-0.marlin-internal:50051}:profile: linkerd2_reconnect::service: Failed to connect error=Connection refused (os error 111)\n

This error is pretty straightforward; the application isn't accepting these connections so the proxy is unable to forward requests. Is it possible that your application is not listening on 127.0.0.1:8558 and 127.0.0.1:50051? I.e., perhaps it is only bound on the public interface so it's not accepting connections on the loopback interface? For example, if the server is implemented in Go, I'd look for net.Listen or http.ListenAndServe.

If your application does not listen on localhost, communication would work without the proxy; but the proxy forwards all inbound traffic on localhost to prevent creating traffic loops, which are all too possible in more complex container networking configurations.

@karsten42
Copy link

For us at least part of the issue was actually in our code where we initialised our grpc server with the setting

// MaxConcurrentStreams returns a ServerOption that will apply a limit on the number
// of concurrent streams to each ServerTransport.
func MaxConcurrentStreams(n uint32) ServerOption {
	return newFuncServerOption(func(o *serverOptions) {
		o.maxConcurrentStreams = n
	})
}

where n was 1. 🤦
So, no wonder really that after a while the linkerd proxy can't establish connections anymore.
Removing this server option solved the issue at hand and the proxy has been working smoothly ever since.

We saw another service where this was happening yesterday which didn't have this grpc server setting though. We will continue to investigate this next week.

@epinzur
Copy link

epinzur commented Jan 15, 2021

@olix0r

In my issue, the server container is still starting as these requests are coming in. It makes sense that the linkerd-proxy is unable to forward the requests while the server container is starting, but it seems that even after the server is up, linkerd-proxy continues to block drop all requests to the pod for some time.

The server is written in scala... I'll check into how it is listening.

In the pod configuration, we are exposing the following ports on the server container:

ports:
    - containerPort: 8558
      name: management
      protocol: TCP
    - containerPort: 50051
      name: grpc
      protocol: TCP
    - containerPort: 9095
      name: prometheus
      protocol: TCP

@epinzur
Copy link

epinzur commented Jan 15, 2021

@olix0r It looks like we are listening on 0.0.0.0. Do we need to move to 127.0.0.1 for linkerd-proxy to function correctly?

Thanks.

@olix0r
Copy link
Member

olix0r commented Jan 15, 2021

@rltvty No, 0.0.0.0 should bind on all interfaces...

There is a default connect timeout of 100ms--I'd be surprised if you were hitting that on a pod where the application is effectively idle, but we could try increasing that limit. I believe that we'd have to use linkerd inject --manual and then manually edit the yaml to include a LINKERD2_PROXY_INBOUND_CONNECT_TIMEOUT environment variable.

It would also probably be pretty illuminating to capture PCAP using either the linkerd debug container or ksniff.

@epinzur
Copy link

epinzur commented Jan 15, 2021

I haven't seen the issue now since I added the linkerd debug sidecar container. And removing it didn't restore the issue either. I'll try to get some more logs next time the issue comes up.

@olix0r
Copy link
Member

olix0r commented Jan 15, 2021

We've merged linkerd/linkerd2-proxy#847 and linkerd/linkerd2-proxy#848 to main to help improve failfast diagnostics. I've built and pushed a container image for the proxy that includes these changes. It can be used with control planes from 2.9.x or recent edge releases.

spec:
  template:
    metadata:
      annotations:
        config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
        config.linkerd.io/proxy-version: main.79c95af8

@epinzur
Copy link

epinzur commented Jan 16, 2021

@olix0r I tried using the new build, and my services are failing on startup. Logs are in this gist.

I included both the logs from the service container itself and the linkerd-proxy container, for two of my services.

egret is a go-lang service that is failing to connect (on port 5432) to an RDS database on startup. marlin is a scala service that is failing to connect (on port 9042) to a Cassandra database inside the Kubernetes cluster on startup. When operational, egret is a client of the gRPC server marlin.

Both of these services worked "most of the time" with 2.9.0, 2.9.1, 2.9.2 builds, except the proxy would occasionally go into the "max-concurrency exhausted" state. With this test build, I can't get the services to start successfully.

Please let me know if you need any additional info.

@olix0r
Copy link
Member

olix0r commented Jan 16, 2021

Thanks for the helpful data! My observations:

egret

inbound

All of the inbound traffic to egret appears to be prometheus scrapes on port 9100; and it appears that all of these requests fail due to Connection refused errors.

To test this, I'd kubectl exec into the pod and try to curl -v localhost:9100/metrics. This should be, effectively, exactly what the proxy is doing. Does it work?

If not, is it possible that the metrics server doesn't start listening until some initialization is complete? Perhaps it doesn't serve until the database connection is established, for instance?

outbound

On the outbound side we see the application trying to connect to two different types of endpoints:

port 5432 (rds)

These connections are hitting a protocol detection timeout:

[    11.422586s]  INFO ThreadId(01) outbound:accept{peer.addr=10.110.91.24:37066 target.addr=10.110.50.173:5432}: linkerd_app_core::serve: Connection closed error=Protocol detection timeout after: 8B after 10.001269961s

Because the proxy transparently detects HTTP traffic, it may fail to allow connections for some applications. See the docs for more information. The most reliable way to fix this is to simply disable the proxy on these connections by setting a workload annotation like:

config.linkerd.io/skip-outbound-ports: "5432"

This will configure iptables to bypass the proxy on these connections.

port 8200

It also looks like your application is establishing non-HTTP connections on port 8200. We are able to establish connections on these endpoints, though it appears that one of the peers is closing the connection fairly quickly. I don't have enough information to know whether this is by design or whether this indicates some problem:

[     0.957520s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.110.91.24:36794 target.addr=10.110.87.72:8200}: linkerd_detect: Detected protocol=None elapsed=13.706µs
[     0.957535s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.110.91.24:36794 target.addr=10.110.87.72:8200}: linkerd_cache: Caching new service target=(None, Logical { orig_dst: 10.110.87.72:8200, protocol: (), profile: Some(..) })
[     0.957612s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.110.91.24:36794 target.addr=10.110.87.72:8200}:tcp: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
[     0.957624s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.110.91.24:36794 target.addr=10.110.87.72:8200}:tcp: linkerd_proxy_transport::connect: Connecting peer.addr=10.110.87.72:8200
[     0.958029s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.110.91.24:36794 target.addr=10.110.87.72:8200}:tcp: linkerd_proxy_transport::connect: Connected local.addr=10.110.91.24:36798 keepalive=Some(10s)
[     0.958057s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.110.91.24:36794 target.addr=10.110.87.72:8200}:tcp: linkerd_proxy_transport::metrics: client connection open
[     1.352683s]  INFO ThreadId(01) outbound:accept{peer.addr=10.110.91.24:36794 target.addr=10.110.87.72:8200}: linkerd_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)

ksniff data would be useful to digging in to which peer is shutting down the connection. We can probably also improve the proxy's error reporting in this regard...

marlin

The marlin logs are not at the debug level, so we don't have as much insight into what's going on here. It would probably be helpful to increase the log level via the workload annotation config.linkerd.io/proxy-log-level: linkerd=trace,info. Trace logging should give us a lot more data, but the first few seconds of information are going to be all we really need to see.

Here's what we see, though:

inbound

Like the egret pod, we see Connection refused errors on localhost. Does this application bind its listeners immediately, or is this blocked on some initialization logic? For instance, might it not serve requests until the cassandra client is established?

port 8558 (http healtcheck)

We see more Connection refused errors, immediately, on port 8558:

[     0.487099s]  WARN ThreadId(01) inbound:accept{peer.addr=10.110.16.96:34556 target.addr=10.110.84.220:8558}: linkerd_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)
[     0.599162s]  WARN ThreadId(01) inbound:accept{peer.addr=10.110.16.96:34560 target.addr=10.110.84.220:8558}: linkerd_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)

If you kubectl exec to the pod and run curl -v 127.0.0.1:8558/ready, what happens?

port 50051 (grpc healtcheck)

The log level is hiding more details about the errors, but it's probably safe to assume that they are the same Connection refused errors:

[     1.955723s]  WARN ThreadId(01) inbound:accept{peer.addr=10.110.16.96:33340 target.addr=10.110.84.220:50051}: linkerd_timeout::failfast: Logical entering failfast after 1s
[     1.956015s]  WARN ThreadId(01) inbound:accept{peer.addr=10.110.16.96:33340 target.addr=10.110.84.220:50051}: linkerd_app_core::errors: Failed to proxy request: Logical service in fail-fast

I'd be curious to see this pod's logs with trace logging enabled. But, I think if we can at least determine whether the above curl works, we can start narrowing down on why we're unable to establish these connections.

outbound

port 9042 (cassandra)

Again, this error indicates that one of the peers is disconnecting. Without more detailed logging, it's hard to make a guess at what's going on here. We can try to improve the proxy's diagnostics in this regard, but more logging would help clarify things.

One thing that stands out is that these errors are fairly sporadic:

[     8.511302s]  INFO ThreadId(01) outbound:accept{peer.addr=10.110.84.220:48354 target.addr=10.110.63.53:9042}: linkerd_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)
[     9.013378s]  INFO ThreadId(01) outbound:accept{peer.addr=10.110.84.220:32956 target.addr=10.110.28.150:9042}: linkerd_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)
[   121.940884s]  INFO ThreadId(01) outbound:accept{peer.addr=10.110.84.220:49986 target.addr=10.110.63.53:9042}: linkerd_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)
[   242.072544s]  INFO ThreadId(01) outbound:accept{peer.addr=10.110.84.220:51320 target.addr=10.110.81.195:9042}: linkerd_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)
[   361.307869s]  INFO ThreadId(01) outbound:accept{peer.addr=10.110.84.220:52324 target.addr=10.110.56.17:9042}: linkerd_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)
[   361.816767s]  INFO ThreadId(01) outbound:accept{peer.addr=10.110.84.220:59366 target.addr=10.110.20.223:9042}: linkerd_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)
[   481.308787s]  INFO ThreadId(01) outbound:accept{peer.addr=10.110.84.220:54072 target.addr=10.110.56.17:9042}: linkerd_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)
[   481.827436s]  INFO ThreadId(01) outbound:accept{peer.addr=10.110.84.220:32884 target.addr=10.110.20.223:9042}: linkerd_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)
[   482.380836s]  INFO ThreadId(01) outbound:accept{peer.addr=10.110.84.220:36812 target.addr=10.110.95.48:9042}: linkerd_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)

With the most recent edge releases, there's another thing you can try here; though I don't have a strong signal about whether it would help.

You can set the config.linkerd.io/opaque-ports: "9042" annotation on the cassandra workload. This will cause proxies that communicate with it to bypass protocol detection completely on these connections. We could also try using the outbound-skip-ports approach on this pod to totally take the proxy out of the equation when communicating with Cassandra.

To summarize

  • Skip the RDS port on the egret pod.
  • Try curling the HTTP ready/metrics endpoints from within the pod. This can help tell us whether the pod/application is in a state where it's actually refusing connections, or whether this behavior is specific to the proxy.
  • Consider marking cassandra as opaque or skipping it on the marlin pod.
  • Enable trace-level logging if we need to get more info from the proxy.

@olix0r
Copy link
Member

olix0r commented Jan 16, 2021

I've opened linkerd/linkerd2-proxy#852 to help give us more information about which peer is failing when we see errors like Transport endpoint is not connected.

@epinzur
Copy link

epinzur commented Jan 16, 2021

I've verified that egret doesn't start listening on any ports until after verifying the database connection.

Instructing the linkerd-proxy to skip outbound port 5432 allows the service to startup successfully.

I might not have a chance to look at marlin until tomorrow.

@epinzur
Copy link

epinzur commented Jan 16, 2021

I found some time to play with marlin.

I updated its linkerd-proxy to skip outbound port 9042 for Cassandra, and also updated the linkerd-proxy on Cassandra to skip inbound on 9042. Now marlin is starting successfully.

Is there any documentation with some general guidance on which ports should skip the proxy?

Initially, I was thinking it made sense to have the proxy handle all traffic to gain the benefits of auto mTLS and span-id injection for tracing. But now I'm realizing that a lot of traffic (like databases) already have built-in TLS solutions, and don't use http/http2 protocols so they aren't compatible with span-id injection.

Should linkerd-proxy connections be limited to http/http2 traefik?

@olix0r
Copy link
Member

olix0r commented Jan 17, 2021

Is there any documentation with some general guidance on which ports should skip the proxy?

https://linkerd.io/2/features/protocol-detection/ -- we ship with a standard set of skip ports to get many common cases; but clearly we don't catch all of them.

Initially, I was thinking it made sense to have the proxy handle all traffic to gain the benefits of auto mTLS and span-id injection for tracing. But now I'm realizing that a lot of traffic (like databases) already have built-in TLS solutions, and don't use http/http2 protocols so they aren't compatible with span-id injection.

Correct, tracing doesn't really work with arbitrary TCP protocols.

As for mTLS, the opaque ports feature I mentioned is intended to allow the proxy to transport arbitrary protocols over mTLS. This currently only works with resources that are running in your cluster (so if you're using RDS from your cloud provider, that won't work here). Though, it could potentially work for your cassandra use case.

Should linkerd-proxy connections be limited to http/http2 traefik?

Linkerd adds the most value for HTTP. However, in 2.9 we started supporting more features (mTLS, traffic split, etc) for non-HTTP traffic. In 2.10, we'll introduce the new opaque ports feature to broaden this support to non-detectable protocols; and we'll also start supporting non-HTTP traffic in multicluster configurations.

@epinzur
Copy link

epinzur commented Jan 18, 2021

Thanks @olix0r! I'm beginning to think our issues with "proxy max-concurrency exhausted" are just due us configuring linkerd incorrectly.

I'm pushing out your suggested changes to more of our test environments today. I'll let you know if I encounter any additional issues.

@epinzur
Copy link

epinzur commented Jan 18, 2021

@olix0r we are still running into fail-fast issues on marlin and we are currently only proxying gRPC calls. I think the issue is that our service doesn't initially respond to requests until a startup sequence is complete.

Is there some way we can configure linkerd to not drop into fail-fast in this scenario? Like, wait some time before giving up after container startup?

olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Jan 19, 2021
When a Service is in failfast, the inner service is only polled as new
requests are processed. This means it's theoretically possible for
certain service tasks to be starved.

This change ensures that these layers are paired with a `SpawnReady`
layer to ensure that the inner service is always driven to readiness.

This could potentially explain behavior as described in
linkerd/linkerd2#5183; though we don't have strong evidence to support
that. This seems like a healthy defensive measure, in any case.

This change also improves stack commentary to favor larger descriptive
comments over layer-level annotations.

While auditing services for readiness, an unnecessary buffer has been
removed from the ingress HTTP stack.
olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Jan 19, 2021
When a Service is in failfast, the inner service is only polled as new
requests are processed. This means it's theoretically possible for
certain service tasks to be starved.

This change ensures that these layers are paired with a `SpawnReady`
layer to ensure that the inner service is always driven to readiness.

This could potentially explain behavior as described in
linkerd/linkerd2#5183; though we don't have strong evidence to support
that. This seems like a healthy defensive measure, in any case.

This change also improves stack commentary to favor larger descriptive
comments over layer-level annotations.

While auditing services for readiness, an unnecessary buffer has been
removed from the ingress HTTP stack.
@olix0r
Copy link
Member

olix0r commented Jan 19, 2021

@rltvty A few questions:

Does this only occur at startup or does this also occur after the process has successfully processed requests?

Does the proxy ever recover (if given enough time) or does it stay stuck this way after the process has initialized? You could check this by execing into the pod and looking at ss -an |grep LISTEN | grep $port (or netstat -an if ss is not available).

If the proxy gets "stuck" in this state, it would be awesome if you can test this proxy version:

config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
config.linkerd.io/proxy-version: spawn-failfast.5257b3e1

This may not fix the behavior you're seeing, but I found a place where, at least theoretically, a service could get stuck in failfast on the inbound side.

How would this situation differ when Linkerd isn't present? I assume that the client would just get a TCP Connection Refused error and try reconnecting?

How does the client connect to marlin? Is it connecting to a ClusterIP `Service? Or does the application connect to individual pods?

Does marlin have readiness probes configured? Generally, I would not expect pods to be discoverable until readiness probe succeeds. If they're not present, I strongly recommend configuring them--with our without Linkerd, it's a generally important feature of Kubernetes.

@RagingPuppies
Copy link

Hi all,
i would like to contribute some information with my side withGRPC + linkerd + fail-fast and proxy exhaustion exception with .net core.

Infra
[ConsoleApp written in .net core. ] - > uses 1 connection with 1 concurrent with grpc -> [HAPROXY with H2 enabled with 3 worker nodes as backend with port 443(ingress)] -> nginx ingress controller with linkerd injected -> kubernetes service[clusterIP] -> ,net core server(Pod) with linkerd injected.

I produced the issue with 2 scenarios so far:

  1. redeploy a pod results in:
    ingress linkerd error:
    [ 70370.592490s] WARN ThreadId(01) outbound:accept{peer.addr=10.42.44.149:48502 target.addr=10.42.217.47:5000}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast
    [ 70374.308413s] INFO ThreadId(01) outbound: linkerd2_proxy_api_resolve::resolve: No endpoints

pod error:
[ 78.453426s] INFO ThreadId(01) inbound:accept{peer.addr=10.42.217.15:36800 target.addr=10.42.217.24:5000}: linkerd2_app_core::serve: Connection closed error=connection error: Broken pipe (os error 32)

client error:
Error Grpc.Core.RpcException: Status(StatusCode=Unavailable, Detail="proxy max-concurrency exhausted")

on having 20 connection with 400 concurrent configured on the client,
i will extract logs later on today, the service in fail-fast happens on the linkerd server side usually

@olix0r
Copy link
Member

olix0r commented Jan 19, 2021

@RagingPuppies

Your error includes:

[ 70374.308413s] INFO ThreadId(01) outbound: linkerd2_proxy_api_resolve::resolve: No endpoints

Are you sure there are actually endpoints in the service at the time when the error is logged?

If you are running a control plane from version stable-2.9.x or a recent edge release, it may be helpful to test the above-mentioned proxy version by setting annotations like:

config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
config.linkerd.io/proxy-version: spawn-failfast.5257b3e1
config.linkerd.io/proxy-log-level: linkerd=debug,warn

If none of this helps, it would probably be best to open a new issue so we get into more of the details of your specific issue.

@RagingPuppies
Copy link

@olix0r i had 2 pods with ready status under that deployment, so it won't make much of sense

olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Jan 19, 2021
When a Service is in failfast, the inner service is only polled as new
requests are processed. This means it's theoretically possible for
certain service tasks to be starved.

This change ensures that these layers are paired with a `SpawnReady`
layer to ensure that the inner service is always driven to readiness.

This could potentially explain behavior as described in
linkerd/linkerd2#5183; though we don't have strong evidence to support
that. This seems like a healthy defensive measure, in any case.

This change also improves stack commentary to favor larger descriptive
comments over layer-level annotations.

While auditing services for readiness, an unnecessary buffer has been
removed from the ingress HTTP stack.

Finally, this change updates the inbound connect timeout to 300ms to
accommodate especially slow endpoints.
@epinzur
Copy link

epinzur commented Jan 19, 2021

@olix0r Now that we are configuring linkerd-proxy properly, the issue seems to primarily happen during the launch of a new cluster of pods.

Initializing and getting to a quorum state on a new cluster takes about 3 minutes. After reaching quorum the first time, a single pod takes about 45 seconds to restart and rejoin the cluster successfully.

Yesterday in this scenario, I was seeing that the linkerd-proxy started blocking requests to the pods on the gRPC port (with fail-fast) shortly after launching the pods. I waited about 30 minutes to see if the proxy would start re-allowing requests to the pods, but it never happened.

Then I rolled each of the pods in the cluster individually, and the proxy did not block requests on gRPC after the pods restarted.

Note that we are currently only using linkerd to proxy the gRPC port that is served by marlin. All other traffic to/from the service is currently being skipped via annotations.

If linkerd wasn't present, these requests would still fail, due to connection refused or connection timeout errors. However, after the service starts successfully the requests would then return successfully. With the linkerd-proxy, the requests seem to fail indefinitely. This is the issue. Linkerd never seems to allow requests back to the service to re-check if it should be blocking.

@epinzur
Copy link

epinzur commented Jan 19, 2021

oh, I'll try the new build too. thanks!

@olix0r
Copy link
Member

olix0r commented Jan 19, 2021

@rltvty Thanks for sharing. Let us know how the new build works. If you're still seeing problems, I think we have enough information to try to reproduce this in a test configuration.

@epinzur
Copy link

epinzur commented Jan 20, 2021

@olix0r I don't see much improvement with the new build. Our pods are still not recovering on startup.

I also noticed another thing... The marlin pods communicate with each other using UDP on port 25520. When the linkerd-proxy is disabled, this works without issue. But when the proxy is enabled, the communication fails even if we tell the proxy to skip port 25520 in both directions.

I can see the UDP traffic in the linkerd-debug sidecar, but I don't see any logs for it in the linkerd-proxy, even with debug & warn enabled.

Any ideas on how to resolve this?

@olix0r
Copy link
Member

olix0r commented Jan 20, 2021

I don't see much improvement with the new build. Our pods are still not recovering on startup.

Bummer. We'll take a deeper look at trying to reproduce this. (Specifically, a pod that receives requests before the application has initialized).

I also noticed another thing... The marlin pods communicate with each other using UDP on port 25520. When the linkerd-proxy is disabled, this works without issue. But when the proxy is enabled, the communication fails even if we tell the proxy to skip port 25520 in both directions.

Linkerd shouldn't touch UDP at all right now. It makes sense that you wouldn't see anything in the proxy's logs because these packets should never hit the proxy. This could potentially point to some issue with our iptables initialization, but I'd be pretty surprised if that was the case. If you can't figure this out, it's probably best to open a new issue focused on that distinct problem.

@olix0r olix0r added this to the stable-2.10 milestone Jan 21, 2021
@olix0r
Copy link
Member

olix0r commented Jan 25, 2021

We continue to work on diagnosing and reproducing this issue. We've made some changes to the inbound stack that may better highlight what's going on. I'd recommend testing with the following workload annotations:

config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
config.linkerd.io/proxy-version: main.b8c2aa71
config.linkerd.io/proxy-log-level: linkerd_stack::switch_ready=debug,linkerd_reconnect=debug,linkerd_transport::connect=debug,linkerd=info,warn

attn: @rltvty

@epinzur
Copy link

epinzur commented Jan 29, 2021

Sorry that I have been unable to test this as of yet. Maybe next week.

@wmorgan
Copy link
Member

wmorgan commented Apr 28, 2021

Is this in the class of issues that was fixed in 2.10?

@olix0r
Copy link
Member

olix0r commented Apr 30, 2021

I'm going to close this issue, as the proxy can no longer return this error. There are hopefully more descriptive errors and diagnostics now that we can use to debug this class of timeout. Please feel free to open new issues if undiagnosable failfast errors are encountered.

@olix0r olix0r closed this as completed Apr 30, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants