-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"HTTP Logical service in fail-fast" logs in linkerd-proxy #5599
Comments
Thanks @hadican; we've been looking into similar reports of inbound proxies getting stuck in failfast. It would be helpful to see debug logs from this proxy. You can enable dbeug logging by adding the following annotation to your pod template: It may be the case that our connect timeout is too aggressive and the proxy is unable to establish a connection, or this could point to a more subtle bug... |
Note that the period between
Just in case you copy and pasted what @olix0r wrote directly :) |
I've tried it back then, to see if I can catch some thing. Here are the logs: Not sure if I catched the correct ones.
|
I've made some changes to the HTTP Logical stack that could potentially address this and, if not, should help us have some clearer diagnostics around this part of the stack. If you can set the following pod annotations and share logs from the first ~30s of the pod's runtime, that may help shed light on the issue:
|
Hi @olix0r, I've tried with your recent build and here are the logs of proxy container from the beginning:
I've waited more than 30 secs, waited for 2-3 mins but didn't recover. I'll check again, but did not see any error other than "HTTP Logical service in fail-fast". I injected with:
Also, one thing that I noticed; in the logs the version seem as One more thing, I think this happens right after whole pod switch, but not sure, couldn't track the logs simultaneously. Maybe after the last node terminated, it does not discover new pods? 🤔 |
Hi @hadican, just provide a idea to help you troubleshooting: |
Thanks. Looking at this, I realize i mis-specified the log configuration. It would be great if you can try again with:
It's curious to me that we don't see any logging from the reconnect module, though we see connection refused errors. It's also interesting that we see attempts at 4s, 6s, and then at 55s -- there's a 50s gap there. What I'm trying to understand from the logs is whether we're trying to connect or whether there's something blocking the reconnect logic from being applied.
No, that's fine. When we build development proxy images, the base image isn't always the latest; but since we're replacing the proxy binary, it's fine. The base image provides the start script that generates a private key, etc, that we use to bootstrap identity. Thanks for your help tracking this down! |
Hi @olix0r, Here is the log of start:
After a while...
It was working fine, then went into fail-fast. These logs are after 1-2 mins:
|
Hi @liuerfire, thanks for the tip. We don't use the CNI plugin. However, do you have any other tips? 🤔 |
@hadican Thanks, this is helpful. Just for the sake of ruling out configuration issues: is it possible that your gRPC server is setting a stream concurrency limit? |
Hi @olix0r, I believe it's 10. |
Hi @olix0r, any updates on this issue? 😔 |
@hadican We've recently spent some time trying to reproduce this using demo apps and we've been unable to trigger this behavior. Specifically, we modified our server's container to block for 2 minutes before starting the application. The proxy encounters connection refused errors and goes into failfast. However, once the application starts, we leave the failfast state and process requests as expected. You mention above that you have |
Coming here from #5183. Now we are seeing the A bit of context to the service in question:
I used the configuration @olix0r suggested here. The proxy image we are using is: I put the logs from the startup of the service and after it fully booted up in this gist. One log message caught my eye, which is: We currently have a version running of this service without any issues which only has one port that serves Hope this helps. |
This means that the proxy's client for talking to the injected service didn't become ready (i.e. the proxy was't able to connect to the service) after 500 ms, and so the proxy switched to start eagerly failing requests until the client connection to the service becomes usable. It's emitted here: https://github.com/linkerd/linkerd2-proxy/blob/b8c2aa7168d93eda3ea5f2e3b7fc6a9910cf0c4f/linkerd/stack/src/switch_ready.rs#L127 |
I've been trying to reproduce this using a simple demo client/service, and varying the amount of time the demo service waits before accepting connections, the latency distribution of the demo service, the number of clients, and the gRPC per-connection concurrency limit on the server. So far, I'm still not having any luck — the proxy still comes out of fail-fast as soon as the demo service comes up. I have a feeling there's probably some missing piece involved that my test setup isn't capturing, but I'm not sure what it is. @karsten42 and @hadican, can you think of any other details we might be missing? |
@hadican thanks — that's also useful to know as we try to narrow down what's going on in the proxy. Do you have a rough estimate of the request load on the service? |
@hawkw I played around some more with the setup I described here.
I have also seen this issue when the port 8085 wasn't active but not as reliably as the setup just described. |
We have a similar problem where two pods which talk to each other over http are in the service mesh. We get intermittent 503s after 3-5 minutes after deploying them on the mesh. Here are the logs from the pod initiating the connection:
What is a logical service and how does the proxy decide to put it in failfast? Also, this only happens when the client is also in the mesh. If we take the client out the mesh, we see zero 503s. |
@hawkw Hi, we set "max_concurrent_streams" config at server side. Each pod has around 20RPS. |
Our issue was caused by #5684. Might be worth checking the latency of |
I hit this error while checking multi-cluster capabilities of Linkerd. I also spotted To me it happen during test flight of this great demo from https://github.com/olix0r/l2-k3d-multi I made few runs and investigated a bit with Then I read the docs regarding exporting services and realized that annotations for export were wrong. In the demo repo it is using the one below https://github.com/olix0r/l2-k3d-multi/blob/master/base/export_service.yml
while the docs are stating that it needs a label
and I was able to get the multicluster demo working with traffic splits. 🛰️
Hope this helps in investigation. |
Thanks @svilensabev -- i'll make sure we update that repo to use the newer labels! |
@hadican I'm going to close this issue due to inactivity for now. If you're still experiencing this, I recommend trying stable-2.10.0 or the latest edge release and re-opening this issue if you still see the problem. |
@adleong I use stable-2.10.0 and I see the same logs on all linkerd-proxy containers
|
@pepsi1k the The If you continue to run into issues, please open a new issue so that we can start from a fresh description of the issue. Thanks! |
Bug Report
In our system, we have 400+ lightweight pods, which are making requests with gRPC to 6 powerful pods (runs on preemtible node) with 3-4 RPS for each. (All 6 pods have Nvidia GPU) After 1-2 mins of successful deployment and auto Linkerd proxy injection with
kubectl -n the_namespace get deploy/the_deployment -o yaml | linkerd inject - | kubectl apply -f -
, we are seeing hundreds ofFailed to proxy request: HTTP Logical service in fail-fast
errors. I've checked the source code oflinkerd-proxy
repository, I found the place where this message is printed but couldn't understand the error, and the message. I mean, what isHTTP Logical service
? Below you can find a snippet of logs.I've waited for 10-15 mins, but system didn't recover. So I reverted and uninject proxy from pods.
In
linkerd-proxy
source code, I think there was a comment saying "the inner service is not ready", but readiness/liveness was passing. I thought this might be about dispatch timeout, wanted to increase it but it's not configurable. (config.rs
) I thought it might be related with all this high load. I installed with HA mode, nothing is changed. I increased both CPU and memory proxy request/limits, but again nothing is changed.All after these actions we're getting fail fast errors. What is fail fast condition? What is "HTTP logical"? 🧐 How can we resolve this?
linkerd check
output:Environment
Additional context
Let me know if you need further information.
The text was updated successfully, but these errors were encountered: