-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul Connect - Envoy Sidecars failing to bootstrap listeners after certain # of services #8627
Comments
Diving in a bit further: We weren't sure if Envoy was having issues talking to Consul through the
Taking a deeper look at stats, I noticed Envoy's initial xds request is stuck in pending:
It seems the bidi stream is being created, connects fine, but Consul never returns any new services and it hangs. |
More networking info from the envoy sidecar. From nomad-client host:
netstat dump:
Pretty convinced the bidi stream is hanging. Nomad is establishing conns to consul over 8502:
|
A note here - we had Consul agent running via systemd, with:
We changed this to:
Then restarted consul, and things started picking up listeners again. Wondering if Consul hit ulimits. |
Update: And after fully pulling back the cluster again, new Envoy proxies in new allocations are not gathering listeners again - the same issue. Adjusting ulimits has not solved it. |
Any update on this by chance? |
We're definitely still seeing this, and a few updates: The sidecars are failing health checks:
If you look at the envoy sidecar proxy, it's not getting its listeners:
The domain socket for the Consul API is active:
The only relevant log we see is:
According to https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/draining, this happens when "Individual listeners are being modified or removed via LDS." We note the timing of this happens nearly immediately after the container comes up, which means the listeners are never populated, which means the Envoy sidecar likely never fully initializes and will then fail its checks. So for some reason, Envoy is failing to gather the listeners from Consul's API. Any ideas, or things we can try/inspect further? |
We believe this might have actually been caused with a bootstrap deadlock in Consul: hashicorp/consul#9689 - since applying the patch there I don't think we've seen future occurrences of this. |
Going to close this, as @chrisboulton said, after hashicorp/consul#9689 we have not seen this re-occur |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.12.1 (14a6893)
Operating system and Environment details
Running nomad cluster on Debian Buster with Consul+Vault, Consul Connect enabled.
Versions:
3-0debian-busterIssue
We're having an issue where envoy proxies, after a certain number of them spun up across a nomad cluster, become disconnected from Consul and stop pulling services as clusters or setting up appropriate listeners. (Note all logs here have edited names+public IPs. We can send more detailed job/alloc logs via email if needed.)
This basically causes all traffic across Consul Connect to fail, as all envoy listeners are cleared out.
This should, from our job file, be:
We proxy out http1/grpc traffic via those upstreams, and then use service-routers with path prefixes + headers to determine the appropriate routing. This worked, until we hit an approximately high number of sidecars.
However, the envoy proxies dont seem to be measuring any failures:
Allocation ID
fb594d5d-3d4f-edc0-8d8b-8184b54c71f7
shows two port bindings:The biggest red flag we see is in syslog, from Consul: (note that in reproduction attempts I will see these failures in logs prior to Envoy failing to pickup listeners from Consul)
I do notice that the socket is there:
And that it is connectable (
nc -U
also works):Consistently seeing these errors for all proxy sidecars. An envoy bootstrap file:
Envoy admin clusters endpoint:
The concerning thing in there is the
success_rate::-1
, but not sure if that's a false flag?Consul agent config:
Reproduction steps
This has been very difficult to reproduce. Essentially we've seen it start to occur at around 100-200+ sidecars across a nomad cluster (8+ clients).
Job file (if appropriate)
Can be sent via email. Here's some appropriate parts:
Connect stanza:
Service network block:
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: