FilterMisdirectedRequests filter causes segfault in envoy #2662

brenix · 2020-07-07T18:34:51Z

What steps did you take and what happened:

As part of upgrading from contour-1.4.0/envoy-1.14.1 to contour-1.6.1/envoy-1.14.3 in one of our clusters, the envoy instances started experiencing a segfault after 1-2m which led to them being in a crashloopbackoff state

[2020-07-07 17:03:48.670][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x8
[2020-07-07 17:03:48.670][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2020-07-07 17:03:48.670][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 8fed4856a7cfe79cf60aa3682eff3ae55b231e49/1.14.3/Clean/RELEASE/BoringSSL
[2020-07-07 17:03:48.672][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #0: __restore_rt [0x7fa180938390]
[2020-07-07 17:03:48.672][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #1: [0x55daefe97c67]
[2020-07-07 17:03:48.672][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #2: [0x55daefe2dd57]
[2020-07-07 17:03:48.672][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #3: [0x55daefe2e5dc]
[2020-07-07 17:03:48.672][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #4: [0x55daf0632658]
[2020-07-07 17:03:48.672][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #5: [0x55daf0633888]
[2020-07-07 17:03:48.672][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #6: [0x55daf06ac706]
[2020-07-07 17:03:48.673][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #7: [0x55daf0aed956]
[2020-07-07 17:03:48.673][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #8: [0x55daf0aec4de]
[2020-07-07 17:03:48.673][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #9: [0x55daf06a21e4]
[2020-07-07 17:03:48.673][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #10: [0x55daf0b90833]
[2020-07-07 17:03:48.673][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #11: start_thread [0x7fa18092e6ba]

Rolling back envoy to the previous release (1.14.1) did not work
Rolling back contour to the previous release (1.4.0) worked and no longer received a segfault.
Upgrading contour from 1.4.0 to 1.5.1 with envoy 1.14.3 also experienced a segfault

Per a slack conversation, I built a custom contour (release-1.6 branch) container with the AddFilter(envoy.FilterMisdirectedRequests(vh.VirtualHost.Name)). line removed and have been running it successfully without seeing any segfault.

It appears that the workaround introduced in #2483 seems to cause the segfault that we are seeing. Our cluster which was seeing these issues does have several hundred certificates including a few which are wildcards. Therefore I'm not sure what the best solution is here

Additional details/counts regarding this cluster:

certificates: ~400 issued through letsencrypt
httpproxy resources: ~700
ingress resources: 21
request rate: ~80 per minute

What did you expect to happen:

Upgrading from Contour 1.4.0 to 1.6.1 (following the upgrade docs) should continue to work as expected

Environment:

Contour version: 1.6.1
Envoy version: 1.14.3
Kubernetes version: (use kubectl version): 1.18.5
Kubernetes installer & version: kubeadm
Cloud provider or hardware configuration: AWS

The text was updated successfully, but these errors were encountered:

brenix · 2020-07-07T23:30:11Z

Adding a more detailed backtrace using the envoy-alpine-debug-dev container:

[2020-07-07 23:24:58.471][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x8
[2020-07-07 23:24:58.471][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2020-07-07 23:24:58.471][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 8fed4856a7cfe79cf60aa3682eff3ae55b231e49/1.14.3/Clean/RELEASE/BoringSSL
[2020-07-07 23:24:58.473][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #0: [0x7f289191d3d0]
[2020-07-07 23:24:58.481][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #1: luaL_openlibs [0x5611320c7c67]
[2020-07-07 23:24:58.490][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #2: Envoy::Extensions::Filters::Common::Lua::ThreadLocalState::LuaThreadLocal::LuaThreadLocal() [0x56113205dd57]
[2020-07-07 23:24:58.498][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #3: std::__1::__function::__func<>::operator()() [0x56113205e5dc]
[2020-07-07 23:24:58.506][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #4: std::__1::__function::__func<>::operator()() [0x561132862658]
[2020-07-07 23:24:58.514][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #5: std::__1::__function::__func<>::operator()() [0x561132863888]
[2020-07-07 23:24:58.525][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #6: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x5611328dc706]
[2020-07-07 23:24:58.536][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #7: event_process_active_single_queue [0x561132d1d956]

jpeach · 2020-07-08T07:51:01Z

Looks like it failed to allocate a new Lua state, filed envoyproxy/envoy#11948. Faulting address 0x8 looks a lot like dereferencing NULL. Not able to reproduce locally (so far).

jpeach · 2020-07-08T22:12:03Z

I've had Contour up to 3k HTTP proxies with no envoy crashes in both kind and GCP clusters.

jpeach · 2020-07-09T02:04:08Z

I can reproduce on GKE with 3K HTTPProxies and setting the Envoy concurrency to 40. Setting vm.max_map_count = 1966080 has no effect, max maps for the envoy process are around 2.7K, so tweaking max_map_count is unlikely to help.

jpeach · 2020-07-09T03:45:19Z

I can reproduce on GKE with 3K HTTPProxies and setting the Envoy concurrency to 40. Setting vm.max_map_count = 1966080 has no effect, max maps for the envoy process are around 2.7K, so tweaking max_map_count is unlikely to help.

This was reproduced on a n1-standard-2 instance. If I flip to n1-highmem-8, I don't reproduce any more. Envoy has around 8G RSS in this config.

jpeach · 2020-07-09T04:02:03Z

Adding a more detailed backtrace using the envoy-alpine-debug-dev container:

[2020-07-07 23:24:58.471][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x8
[2020-07-07 23:24:58.471][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2020-07-07 23:24:58.471][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 8fed4856a7cfe79cf60aa3682eff3ae55b231e49/1.14.3/Clean/RELEASE/BoringSSL
[2020-07-07 23:24:58.473][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #0: [0x7f289191d3d0]
[2020-07-07 23:24:58.481][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #1: luaL_openlibs [0x5611320c7c67]
[2020-07-07 23:24:58.490][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #2: Envoy::Extensions::Filters::Common::Lua::ThreadLocalState::LuaThreadLocal::LuaThreadLocal() [0x56113205dd57]
[2020-07-07 23:24:58.498][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #3: std::__1::__function::__func<>::operator()() [0x56113205e5dc]
[2020-07-07 23:24:58.506][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #4: std::__1::__function::__func<>::operator()() [0x561132862658]
[2020-07-07 23:24:58.514][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #5: std::__1::__function::__func<>::operator()() [0x561132863888]
[2020-07-07 23:24:58.525][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #6: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x5611328dc706]
[2020-07-07 23:24:58.536][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #7: event_process_active_single_queue [0x561132d1d956]

This seems really similar to envoyproxy/envoy#10865, but I have not found any reason to believe that issue isn't fully fixed. Proximate cause is luaL_newstate failing which is usually some memory exhaustion condition.

brenix · 2020-07-09T16:54:35Z

Did some additional testing as well and do not see the issue when setting a lower --concurrency configuration for envoy. Also confirmed that increasing vm.max_map_count makes no difference.

The hosts we run these on have 16 vCPUs and plenty of RAM. When envoy segfaults, other containers on the node are unaffected.

jpeach · 2020-07-10T07:39:03Z

I hacked out a test harness that creates a configurable could of Lua filters in Envoy. Unable to reproduce the crash on my Fedora dev machine (32G Intel NUC) even with ~400K Lua states (10000 HTTP Connection Managers * 40 threads).

brenix · 2020-07-10T19:32:04Z

Is it possible the upstream builds for envoy use moonjit vs luajit? It looks like XCFLAGS+= -DLUAJIT_ENABLE_GC64 was not added to the moonjit patch in one of the upstream PRs. I've tried digging through some of the docker builds but can't find anything specific.

jpeach · 2020-07-11T00:31:42Z

Is it possible the upstream builds for envoy use moonjit vs luajit? It looks like XCFLAGS+= -DLUAJIT_ENABLE_GC64 was not added to the moonjit patch in one of the upstream PRs. I've tried digging through some of the docker builds but can't find anything specific.

IIUC, switching to moonjit needs a Bazel build flag and I don't see that anywhere in the Envoy build repository.

jpeach · 2020-07-13T01:37:09Z

I checked the Envoy tags, which I should have done in the first place. Envoy 1.14.4 doesn't have the fix for envoyproxy/envoy#10865, but Envoy 1.15.0 does. I expect that the problem here is resolved by envoyproxy/envoy#10865.

brenix · 2020-07-13T15:29:09Z

After running with envoy 1.15.0 for 12h+, we no longer see a segfault! I think we'll plan to run contour 1.6.x with envoy 1.15.0 until 1.7 is released.

jpeach · 2020-07-13T21:05:59Z

xref envoyproxy/envoy#12065

This updates projectcontour#2662. This updates projectcontour#2673. Signed-off-by: James Peach <jpeach@vmware.com>

This updates #2662. This updates #2673. Signed-off-by: James Peach <jpeach@vmware.com>

stevesloka · 2020-09-29T14:26:44Z

I think this is all fixed up so I'm going to close. If that's not the case, please reopen @brenix.

brenix changed the title ~~FilterMisdirectedRequests filter causes segfault in envoy (contour-1.5+)~~ FilterMisdirectedRequests filter causes segfault in envoy Jul 7, 2020

stevesloka added area/deployment Issues or PRs related to deployment tooling or infrastructure. kind/regression Categorizes issue or PR as related to a regression from a prior release. labels Jul 8, 2020

jpeach added a commit to jpeach/contour that referenced this issue Jul 13, 2020

examples/contour: upgrade to Envoy v1.15.0

95d5c5b

This updates projectcontour#2662. This updates projectcontour#2673. Signed-off-by: James Peach <jpeach@vmware.com>

jpeach mentioned this issue Jul 13, 2020

examples/contour: upgrade to Envoy v1.15.0 #2682

Merged

jpeach added a commit that referenced this issue Jul 13, 2020

examples/contour: upgrade to Envoy v1.15.0 (#2682)

45a0926

This updates #2662. This updates #2673. Signed-off-by: James Peach <jpeach@vmware.com>

moderation mentioned this issue Sep 21, 2020

Guidance on optimal --concurrency level for Envoy in large core environments #2929

Closed

stevesloka closed this as completed Sep 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilterMisdirectedRequests filter causes segfault in envoy #2662

FilterMisdirectedRequests filter causes segfault in envoy #2662

brenix commented Jul 7, 2020 •

edited

Loading

brenix commented Jul 7, 2020

jpeach commented Jul 8, 2020

jpeach commented Jul 8, 2020

jpeach commented Jul 9, 2020

jpeach commented Jul 9, 2020

jpeach commented Jul 9, 2020

brenix commented Jul 9, 2020

jpeach commented Jul 10, 2020

brenix commented Jul 10, 2020

jpeach commented Jul 11, 2020

jpeach commented Jul 13, 2020

brenix commented Jul 13, 2020

jpeach commented Jul 13, 2020

stevesloka commented Sep 29, 2020

FilterMisdirectedRequests filter causes segfault in envoy #2662

FilterMisdirectedRequests filter causes segfault in envoy #2662

Comments

brenix commented Jul 7, 2020 • edited Loading

brenix commented Jul 7, 2020

jpeach commented Jul 8, 2020

jpeach commented Jul 8, 2020

jpeach commented Jul 9, 2020

jpeach commented Jul 9, 2020

jpeach commented Jul 9, 2020

brenix commented Jul 9, 2020

jpeach commented Jul 10, 2020

brenix commented Jul 10, 2020

jpeach commented Jul 11, 2020

jpeach commented Jul 13, 2020

brenix commented Jul 13, 2020

jpeach commented Jul 13, 2020

stevesloka commented Sep 29, 2020

brenix commented Jul 7, 2020 •

edited

Loading