Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FilterMisdirectedRequests filter causes segfault in envoy #2662

Closed
brenix opened this issue Jul 7, 2020 · 14 comments
Closed

FilterMisdirectedRequests filter causes segfault in envoy #2662

brenix opened this issue Jul 7, 2020 · 14 comments
Labels
area/deployment Issues or PRs related to deployment tooling or infrastructure. kind/regression Categorizes issue or PR as related to a regression from a prior release.

Comments

@brenix
Copy link

brenix commented Jul 7, 2020

What steps did you take and what happened:

  • As part of upgrading from contour-1.4.0/envoy-1.14.1 to contour-1.6.1/envoy-1.14.3 in one of our clusters, the envoy instances started experiencing a segfault after 1-2m which led to them being in a crashloopbackoff state

    [2020-07-07 17:03:48.670][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x8
    [2020-07-07 17:03:48.670][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
    [2020-07-07 17:03:48.670][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 8fed4856a7cfe79cf60aa3682eff3ae55b231e49/1.14.3/Clean/RELEASE/BoringSSL
    [2020-07-07 17:03:48.672][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #0: __restore_rt [0x7fa180938390]
    [2020-07-07 17:03:48.672][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #1: [0x55daefe97c67]
    [2020-07-07 17:03:48.672][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #2: [0x55daefe2dd57]
    [2020-07-07 17:03:48.672][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #3: [0x55daefe2e5dc]
    [2020-07-07 17:03:48.672][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #4: [0x55daf0632658]
    [2020-07-07 17:03:48.672][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #5: [0x55daf0633888]
    [2020-07-07 17:03:48.672][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #6: [0x55daf06ac706]
    [2020-07-07 17:03:48.673][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #7: [0x55daf0aed956]
    [2020-07-07 17:03:48.673][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #8: [0x55daf0aec4de]
    [2020-07-07 17:03:48.673][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #9: [0x55daf06a21e4]
    [2020-07-07 17:03:48.673][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #10: [0x55daf0b90833]
    [2020-07-07 17:03:48.673][34][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #11: start_thread [0x7fa18092e6ba]
    
  • Rolling back envoy to the previous release (1.14.1) did not work

  • Rolling back contour to the previous release (1.4.0) worked and no longer received a segfault.

  • Upgrading contour from 1.4.0 to 1.5.1 with envoy 1.14.3 also experienced a segfault

Per a slack conversation, I built a custom contour (release-1.6 branch) container with the AddFilter(envoy.FilterMisdirectedRequests(vh.VirtualHost.Name)). line removed and have been running it successfully without seeing any segfault.

It appears that the workaround introduced in #2483 seems to cause the segfault that we are seeing. Our cluster which was seeing these issues does have several hundred certificates including a few which are wildcards. Therefore I'm not sure what the best solution is here

Additional details/counts regarding this cluster:

certificates: ~400 issued through letsencrypt
httpproxy resources: ~700
ingress resources: 21
request rate: ~80 per minute

What did you expect to happen:

Upgrading from Contour 1.4.0 to 1.6.1 (following the upgrade docs) should continue to work as expected

Environment:

  • Contour version: 1.6.1
  • Envoy version: 1.14.3
  • Kubernetes version: (use kubectl version): 1.18.5
  • Kubernetes installer & version: kubeadm
  • Cloud provider or hardware configuration: AWS
@brenix brenix changed the title FilterMisdirectedRequests filter causes segfault in envoy (contour-1.5+) FilterMisdirectedRequests filter causes segfault in envoy Jul 7, 2020
@brenix
Copy link
Author

brenix commented Jul 7, 2020

Adding a more detailed backtrace using the envoy-alpine-debug-dev container:

[2020-07-07 23:24:58.471][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x8
[2020-07-07 23:24:58.471][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2020-07-07 23:24:58.471][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 8fed4856a7cfe79cf60aa3682eff3ae55b231e49/1.14.3/Clean/RELEASE/BoringSSL
[2020-07-07 23:24:58.473][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #0: [0x7f289191d3d0]
[2020-07-07 23:24:58.481][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #1: luaL_openlibs [0x5611320c7c67]
[2020-07-07 23:24:58.490][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #2: Envoy::Extensions::Filters::Common::Lua::ThreadLocalState::LuaThreadLocal::LuaThreadLocal() [0x56113205dd57]
[2020-07-07 23:24:58.498][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #3: std::__1::__function::__func<>::operator()() [0x56113205e5dc]
[2020-07-07 23:24:58.506][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #4: std::__1::__function::__func<>::operator()() [0x561132862658]
[2020-07-07 23:24:58.514][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #5: std::__1::__function::__func<>::operator()() [0x561132863888]
[2020-07-07 23:24:58.525][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #6: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x5611328dc706]
[2020-07-07 23:24:58.536][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #7: event_process_active_single_queue [0x561132d1d956]

@jpeach
Copy link
Contributor

jpeach commented Jul 8, 2020

Looks like it failed to allocate a new Lua state, filed envoyproxy/envoy#11948. Faulting address 0x8 looks a lot like dereferencing NULL. Not able to reproduce locally (so far).

@stevesloka stevesloka added area/deployment Issues or PRs related to deployment tooling or infrastructure. kind/regression Categorizes issue or PR as related to a regression from a prior release. labels Jul 8, 2020
@jpeach
Copy link
Contributor

jpeach commented Jul 8, 2020

I've had Contour up to 3k HTTP proxies with no envoy crashes in both kind and GCP clusters.

@jpeach
Copy link
Contributor

jpeach commented Jul 9, 2020

I can reproduce on GKE with 3K HTTPProxies and setting the Envoy concurrency to 40. Setting vm.max_map_count = 1966080 has no effect, max maps for the envoy process are around 2.7K, so tweaking max_map_count is unlikely to help.

@jpeach
Copy link
Contributor

jpeach commented Jul 9, 2020

I can reproduce on GKE with 3K HTTPProxies and setting the Envoy concurrency to 40. Setting vm.max_map_count = 1966080 has no effect, max maps for the envoy process are around 2.7K, so tweaking max_map_count is unlikely to help.

This was reproduced on a n1-standard-2 instance. If I flip to n1-highmem-8, I don't reproduce any more. Envoy has around 8G RSS in this config.

@jpeach
Copy link
Contributor

jpeach commented Jul 9, 2020

Adding a more detailed backtrace using the envoy-alpine-debug-dev container:

[2020-07-07 23:24:58.471][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x8
[2020-07-07 23:24:58.471][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2020-07-07 23:24:58.471][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 8fed4856a7cfe79cf60aa3682eff3ae55b231e49/1.14.3/Clean/RELEASE/BoringSSL
[2020-07-07 23:24:58.473][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #0: [0x7f289191d3d0]
[2020-07-07 23:24:58.481][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #1: luaL_openlibs [0x5611320c7c67]
[2020-07-07 23:24:58.490][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #2: Envoy::Extensions::Filters::Common::Lua::ThreadLocalState::LuaThreadLocal::LuaThreadLocal() [0x56113205dd57]
[2020-07-07 23:24:58.498][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #3: std::__1::__function::__func<>::operator()() [0x56113205e5dc]
[2020-07-07 23:24:58.506][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #4: std::__1::__function::__func<>::operator()() [0x561132862658]
[2020-07-07 23:24:58.514][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #5: std::__1::__function::__func<>::operator()() [0x561132863888]
[2020-07-07 23:24:58.525][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #6: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x5611328dc706]
[2020-07-07 23:24:58.536][38][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #7: event_process_active_single_queue [0x561132d1d956]

This seems really similar to envoyproxy/envoy#10865, but I have not found any reason to believe that issue isn't fully fixed. Proximate cause is luaL_newstate failing which is usually some memory exhaustion condition.

@brenix
Copy link
Author

brenix commented Jul 9, 2020

Did some additional testing as well and do not see the issue when setting a lower --concurrency configuration for envoy. Also confirmed that increasing vm.max_map_count makes no difference.

The hosts we run these on have 16 vCPUs and plenty of RAM. When envoy segfaults, other containers on the node are unaffected.

@jpeach
Copy link
Contributor

jpeach commented Jul 10, 2020

I hacked out a test harness that creates a configurable could of Lua filters in Envoy. Unable to reproduce the crash on my Fedora dev machine (32G Intel NUC) even with ~400K Lua states (10000 HTTP Connection Managers * 40 threads).

@brenix
Copy link
Author

brenix commented Jul 10, 2020

Is it possible the upstream builds for envoy use moonjit vs luajit? It looks like XCFLAGS+= -DLUAJIT_ENABLE_GC64 was not added to the moonjit patch in one of the upstream PRs. I've tried digging through some of the docker builds but can't find anything specific.

@jpeach
Copy link
Contributor

jpeach commented Jul 11, 2020

Is it possible the upstream builds for envoy use moonjit vs luajit? It looks like XCFLAGS+= -DLUAJIT_ENABLE_GC64 was not added to the moonjit patch in one of the upstream PRs. I've tried digging through some of the docker builds but can't find anything specific.

IIUC, switching to moonjit needs a Bazel build flag and I don't see that anywhere in the Envoy build repository.

@jpeach
Copy link
Contributor

jpeach commented Jul 13, 2020

I checked the Envoy tags, which I should have done in the first place. Envoy 1.14.4 doesn't have the fix for envoyproxy/envoy#10865, but Envoy 1.15.0 does. I expect that the problem here is resolved by envoyproxy/envoy#10865.

@brenix
Copy link
Author

brenix commented Jul 13, 2020

After running with envoy 1.15.0 for 12h+, we no longer see a segfault! I think we'll plan to run contour 1.6.x with envoy 1.15.0 until 1.7 is released.

@jpeach
Copy link
Contributor

jpeach commented Jul 13, 2020

xref envoyproxy/envoy#12065

jpeach added a commit to jpeach/contour that referenced this issue Jul 13, 2020
This updates projectcontour#2662.
This updates projectcontour#2673.

Signed-off-by: James Peach <jpeach@vmware.com>
jpeach added a commit that referenced this issue Jul 13, 2020
This updates #2662.
This updates #2673.

Signed-off-by: James Peach <jpeach@vmware.com>
@stevesloka
Copy link
Member

I think this is all fixed up so I'm going to close. If that's not the case, please reopen @brenix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/deployment Issues or PRs related to deployment tooling or infrastructure. kind/regression Categorizes issue or PR as related to a regression from a prior release.
Projects
None yet
Development

No branches or pull requests

3 participants