-
Notifications
You must be signed in to change notification settings - Fork 691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FilterMisdirectedRequests filter causes segfault in envoy #2662
Comments
Adding a more detailed backtrace using the envoy-alpine-debug-dev container:
|
Looks like it failed to allocate a new Lua state, filed envoyproxy/envoy#11948. Faulting address 0x8 looks a lot like dereferencing NULL. Not able to reproduce locally (so far). |
I've had Contour up to 3k HTTP proxies with no envoy crashes in both kind and GCP clusters. |
I can reproduce on GKE with 3K HTTPProxies and setting the Envoy concurrency to 40. Setting vm.max_map_count = 1966080 has no effect, max maps for the envoy process are around 2.7K, so tweaking max_map_count is unlikely to help. |
This was reproduced on a n1-standard-2 instance. If I flip to n1-highmem-8, I don't reproduce any more. Envoy has around 8G RSS in this config. |
This seems really similar to envoyproxy/envoy#10865, but I have not found any reason to believe that issue isn't fully fixed. Proximate cause is |
Did some additional testing as well and do not see the issue when setting a lower The hosts we run these on have 16 vCPUs and plenty of RAM. When envoy segfaults, other containers on the node are unaffected. |
I hacked out a test harness that creates a configurable could of Lua filters in Envoy. Unable to reproduce the crash on my Fedora dev machine (32G Intel NUC) even with ~400K Lua states (10000 HTTP Connection Managers * 40 threads). |
Is it possible the upstream builds for envoy use moonjit vs luajit? It looks like |
IIUC, switching to moonjit needs a Bazel build flag and I don't see that anywhere in the Envoy build repository. |
I checked the Envoy tags, which I should have done in the first place. Envoy 1.14.4 doesn't have the fix for envoyproxy/envoy#10865, but Envoy 1.15.0 does. I expect that the problem here is resolved by envoyproxy/envoy#10865. |
After running with envoy 1.15.0 for 12h+, we no longer see a segfault! I think we'll plan to run contour 1.6.x with envoy 1.15.0 until 1.7 is released. |
This updates projectcontour#2662. This updates projectcontour#2673. Signed-off-by: James Peach <jpeach@vmware.com>
I think this is all fixed up so I'm going to close. If that's not the case, please reopen @brenix. |
What steps did you take and what happened:
As part of upgrading from contour-1.4.0/envoy-1.14.1 to contour-1.6.1/envoy-1.14.3 in one of our clusters, the envoy instances started experiencing a segfault after 1-2m which led to them being in a crashloopbackoff state
Rolling back envoy to the previous release (1.14.1) did not work
Rolling back contour to the previous release (1.4.0) worked and no longer received a segfault.
Upgrading contour from 1.4.0 to 1.5.1 with envoy 1.14.3 also experienced a segfault
Per a slack conversation, I built a custom contour (
release-1.6
branch) container with theAddFilter(envoy.FilterMisdirectedRequests(vh.VirtualHost.Name)).
line removed and have been running it successfully without seeing any segfault.It appears that the workaround introduced in #2483 seems to cause the segfault that we are seeing. Our cluster which was seeing these issues does have several hundred certificates including a few which are wildcards. Therefore I'm not sure what the best solution is here
Additional details/counts regarding this cluster:
certificates: ~400 issued through letsencrypt
httpproxy resources: ~700
ingress resources: 21
request rate: ~80 per minute
What did you expect to happen:
Upgrading from Contour 1.4.0 to 1.6.1 (following the upgrade docs) should continue to work as expected
Environment:
kubectl version
): 1.18.5The text was updated successfully, but these errors were encountered: