Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy OOM problem with TLS #3592

Closed
thileesf opened this issue Jun 12, 2018 · 19 comments
Closed

Envoy OOM problem with TLS #3592

thileesf opened this issue Jun 12, 2018 · 19 comments
Labels
question Questions that are neither investigations, bugs, nor enhancements

Comments

@thileesf
Copy link

thileesf commented Jun 12, 2018

Hi, we are running into envoy.server.memory_heap_size unbounded growth problem, causing the kernel to kill envoy process. Below are the details:

We had a lightstep-collector deployment in a 4 node cluster (r4.large EC2 instances, with 15G RAM), each of which also had envoyproxy running on it (without processing any traffic). The collectors directly received system traces (from about 40 servers) on a secure port over grpc. The hosts were running with about 9GB system memory utilization.

Then we added a TLS listener to the envoyproxy on the same 4 hosts to intercept the traces and route them to the localhost lightstep collector, on a plaintext port. The traffic was correctly going via envoy to the collector. But after a point envoy crashed due to out of memory.

Looking at envoy.server.memory_heap_size we see it linearly increasing at 100MB/hr for about 6 hours, and then going up at a faster rate to reach ~9GB in under 12 hours, at which point the kernel killed envoy due to out of memory (system & the other processes accounting for remaining memory).

Is there a memory leak in envoy or is there a config I can set to throttle envoy or control memory buffers? I am not reporting a bug because I am not sure if this is a config issue.

Over the whole period, the CPU utilization on the 4 hosts was fairly low, hovering around 18%.

The relevant envoy config is:

static_resources:
  listeners:
  # Local Application gRPC listener
  - address:
      socket_address:
        address: 0.0.0.0
        port_value: 15149
    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          codec_type: auto
          stat_prefix: grpc
          route_config:
            name: local_grpc_route
            virtual_hosts:
            - name: service
              domains:
              - "*"
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: grpc
          http_filters:
          - name: envoy.router
            config: {}
          access_log:
          - name: envoy.file_access_log
            config:
              path: /var/log/envoy/grpc_access.log
      tls_context:
        common_tls_context:
          tls_certificates:
          - certificate_chain:
              filename: /path/to.bundle.pem
            private_key:
              filename: /path/to.key.pem

  clusters:
  - name: grpc
    connect_timeout: 0.25s
    type: static
    lb_policy: round_robin
    http2_protocol_options: {}
    hosts:
    - socket_address:
        address: 127.0.0.1
        port_value: 5150
@mattklein123 mattklein123 added the question Questions that are neither investigations, bugs, nor enhancements label Jun 12, 2018
@mattklein123
Copy link
Member

It's hard to say without knowing where the memory is being used. I would take a look at some stats to see what is growing, and you can also enable heap profiles via tcmalloc: https://gperftools.github.io/gperftools/heapprofile.html

@thileesf thileesf changed the title Envoy OOM problem Envoy OOM problem with TLS Jun 12, 2018
@thileesf
Copy link
Author

thileesf commented Jun 12, 2018

Thanks @mattklein123.
Is there a particular stat you suggest that I looked at?
Most stats I sampled seem normal, except this:
envoy.listener.0.0.0.0_15149.ssl.ciphers.ECDHE-RSA-AES128-GCM-SHA256:
screen shot 2018-06-11 at 6 20 11 pm
With the corresponding OOM restarts:
screen shot 2018-06-11 at 6 19 11 pm

@mattklein123
Copy link
Member

@thileesf the stat you point out is a counter, so it loosely tracks with the rate of incoming connections. It's not a gauge. For stats I would be looking at increasing active connections, increasing buffered data, etc. Those might tell you if there is some type of unbounded growth that could be fixed by config.

Otherwise, it's a memory leak (though I don't know of any open leak issues). For that, I would just startup the tcmalloc heap profiler and take differential heap dumps and then compare them. It should be fairly obvious what the issue is.

@thileesf
Copy link
Author

@mattklein123 thank you for the pointer. I ran with heap profiler, and it shows evbuffer_chain_new is using a large amount of mem. Is this normal?

Total: 3620.0 MB
  3620.6 100.0% 100.0%   3620.6 100.0% evbuffer_chain_new
     0.5   0.0% 100.0%      0.5   0.0% nghttp2_nv_array_copy
     0.4   0.0% 100.0%      0.7   0.0% Envoy::Http::Http2::ConnectionImpl::StreamImpl::StreamImpl
     0.4   0.0% 100.1%      1.1   0.0% Envoy::Http::Http2::ClientConnectionImpl::newStream
     0.2   0.0% 100.1%      0.2   0.0% evbuffer_new
     0.1   0.0% 100.1%      0.1   0.0% submit_headers_shared_nva (inline)
     0.1   0.0% 100.1%      0.1   0.0% __gnu_cxx::new_allocator::allocate

The full file is attached here:
envoy_profile_differential.txt
envoy_profile_differential.pdf

Note that envoy hasn't crashed yet, but it is getting close to exhausting system memory. I'll update with a newer differential when it gets to OOM.

@mattklein123
Copy link
Member

@thileesf this looks to me like there is a large amount of data that is being buffered that is not being drained. I think it's unlikely there is a leak here though it's possible. Did you check buffered data stats to see if that accounts for the memory? Do you have a lot of stalled streams that aren't getting written out somehow due to no timeouts or otherwise? cc @alyssawilk @PiotrSikora

@alyssawilk
Copy link
Contributor

Yeah, that's really frequently OOMing - as Matt says I'd surprised if we have a leak that bad that no one else has picked up on. I'd definitely check your connection stats - presumably you added the new listener to handle new traffic flows so maybe there's just too many connections?

If it is just one direction draining slowly you could make the per-connection buffer limits smaller. #373 will also help, and we finally have someone to work on that who has time to actually pick it up, so there should be progress soon!

@PiotrSikora
Copy link
Contributor

I doubt that's it, but seeing as this is TLS & libevent related, could you try reverting #2680? Maybe there is some issue with evbuffer_pullup().

@thileesf
Copy link
Author

This is the revision we are working with:

$ /usr/local/bin/envoy --version

/usr/local/bin/envoy  version: 4dd49d8809f7aaa580538b3c228dd99a2fae92a4/1.6.0/Clean/DEBUG

@htuch
Copy link
Member

htuch commented Jun 13, 2018

@eziskind for visibility as well.

@thileesf
Copy link
Author

@mattklein123 @alyssawilk it does not look like there are a lot of connections. The buffer stats also do not show anything that can explain the multi-GB growth. See the screenshots below (the green line is the envoy being heap profiled). The processes consistently hit OOM in 11-12 hours.

  1. server.envoy_heapsize to see when we hit OOM (rapid mem use started around 4:45PM UTC, process ended around 7PM UTC)

server envoy_heapsize

  1. server.total_connections during this period: the ~120 connections look reasonable to me.

server total_connections

  1. http.grpc.downstream_cx_rx_bytes_buffered during this period. Refer to the initial description where I've attached the envoy config file we use. The ~35MB doesn't reflect the multi-GB growth.

http grpc downstream_cx_rx_bytes_buffered

  1. cluster.grpc.upstream_cx_rx_bytes_buffered. Again refer to the config file for grpc cluster definition. This looks a bit fishy because the ~150 bytes seems much lower than the incoming ~35MB, but considering we are routing to 127.0.0.1 this may be expected.

cluster grpc upstream_cx_rx_bytes_buffered

  1. cluster.grpc.upstream_rq_timeout shows something interesting, but I am not sure if this is a symptom or cause.

cluster grpc upstream_rq_timeout

I am also adding couple of other stats that seem interesting, but these may not be relevant.

  1. cluster.grpc.upstream_flow_control_backed_up_total

cluster grpc upstream_flow_control_backed_up_total

  1. cluster.grpc.http2.tx_reset

cluster grpc http2 tx_reset

@thileesf
Copy link
Author

@mattklein123 how do I check "if there are a lot of stalled streams that aren't getting written out somehow due to no timeouts" as in your comment?

Let me also see if I can adjust the per-connection buffer limits to be smaller, as @alyssawilk mentioned.

@mattklein123
Copy link
Member

I have two questions:

  1. You said you had a config that worked before you turned on TLS, are you absolutely positive that TLS is the only thing that changed? What is the other config that worked? I'm pretty doubtful TLS is involves here at all.
  2. Can you please provide a couple of complete /stats dumps over time that we can compare when the memory growth starts up.

@thileesf
Copy link
Author

thileesf commented Jun 14, 2018

  1. No, I said that the traffic was going directly to the collector before, and that Envoy was not intercepting incoming traffic before. When I redirected the traffic via envoy, I also turned on the TLS, and then envoy started OOM-ing.

screen shot 2018-06-13 at 5 21 47 pm

  1. I need to re-run envoy to capture the /stats at different points of time. Let me update this issue when I get the data.

@thileesf
Copy link
Author

thileesf commented Jun 14, 2018

Btw, I see a similar (closed) issue #1684. Did you find the root cause for this? I ask because #1684 also talks about grpc.

Also, in case it is useful, we are running envoy with hot-restarter.

@mattklein123
Copy link
Member

@thileesf it's highly likely that the upstream (collector) is not keeping up and is backing up, causing a lot of buffering in Envoy. With the full stat dump it should be more clear what needs to be tuned. I really doubt this is an actual leak.

@thileesf
Copy link
Author

@mattklein123 we are investigating the collector also. I'll update here if/when we find any issues.

In the mean time, here are the envoy /stats you wanted to see: captured over the envoy run time, till it OOM-ed:
envoy_stats_oom.tar.gz

@mattklein123
Copy link
Member

@thileesf how big is each of the requests you are sending to the collector? By my math they are very large. Also, please start running master and provide fresh differential profiles and stat dumps. It's not clear to me what's going on here but I don't want to do further debugging unless you are running master. Thank you!

@mattklein123
Copy link
Member

Closing, please reopen if this still shows up on master and you have the answers to the above questions.

@mattalberts
Copy link

mattalberts commented Jul 3, 2018

@mattklein123
I've linked a discussion I started on the contour repository related to the observed memory consumption of envoy when adding tls routes. It feels related to this issue. I've posted heap tracing reported memory leaks. If I believe output of pprof top, I'm seeing GBs of leaks?

Welcome to pprof!  For help, type 'help'.
(pprof) top 
Total: 9982.5 MB
Leak of 697222424 bytes in 831016 objects allocated from:
	@ 0090f485 unknown
	@ 00000000008df612 BUF_memdup ??:0
	@ 000000000091b217 CRYPTO_BUFFER_new ??:0
	@ 00000000008ba473 bssl::x509_to_buffer ssl_x509.cc:0
	@ 00000000008baa39 ssl_use_certificate ssl_x509.cc:0
	@ 0000000000688f0c Envoy::Ssl::ContextImpl::ContextImpl /proc/self/cwd/source/common/ssl/context_impl.cc:144
	@ 0000000000689e06 Envoy::Ssl::ServerContextImpl::ServerContextImpl /proc/self/cwd/source/common/ssl/context_impl.cc:439
	@ 000000000068c6af Envoy::Ssl::ContextManagerImpl::createSslServerContext /proc/self/cwd/source/common/ssl/context_manager_impl.cc:75
	@ 000000000056e9bc Envoy::Ssl::ServerSslSocketFactory::ServerSslSocketFactory /proc/self/cwd/source/common/ssl/ssl_socket.cc:375
	@ 000000000051f83c Envoy::Server::Configuration::DownstreamSslSocketFactory::createTransportSocketFactory /proc/self/cwd/source/server/config/network/ssl_socket.cc:36
	@ 000000000055a02c Envoy::Server::ListenerImpl::ListenerImpl /proc/self/cwd/source/server/listener_manager_impl.cc:203
	@ 000000000055b575 Envoy::Server::ListenerManagerImpl::addOrUpdateListener /proc/self/cwd/source/server/listener_manager_impl.cc:330
	@ 000000000076b0dd Envoy::Server::LdsApi::onConfigUpdate /proc/self/cwd/source/server/lds_api.cc:59
	@ 000000000076cce0 Envoy::Config::GrpcMuxSubscriptionImpl::onConfigUpdate /proc/self/cwd/bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:53
	@ 00000000007722b3 Envoy::Config::GrpcMuxImpl::onReceiveMessage /proc/self/cwd/source/common/config/grpc_mux_impl.cc:174
	@ 000000000076f622 Envoy::Grpc::TypedAsyncStreamCallbacks::onReceiveMessageUntyped /proc/self/cwd/bazel-out/k8-opt/bin/include/envoy/grpc/_virtual_includes/async_client_interface/envoy/grpc/async_client.h:172
	@ 0000000000788c65 Envoy::Grpc::AsyncStreamImpl::onData /proc/self/cwd/source/common/grpc/async_client_impl.cc:131
	@ 000000000078d93b Envoy::Http::AsyncStreamImpl::encodeData /proc/self/cwd/source/common/http/async_client_impl.cc:108
	@ 00000000006b2a93 Envoy::Http::Http2::ConnectionImpl::onFrameReceived /proc/self/cwd/source/common/http/http2/codec_impl.cc:445
	@ 00000000006b58b6 nghttp2_session_on_data_received /tmp/nghttp2.dep.build/nghttp2-1.29.0/lib/nghttp2_session.c:4881
	@ 00000000006b94e1 nghttp2_session_mem_recv /tmp/nghttp2.dep.build/nghttp2-1.29.0/lib/nghttp2_session.c:6443
	@ 00000000006b1aee Envoy::Http::Http2::ConnectionImpl::dispatch /proc/self/cwd/source/common/http/http2/codec_impl.cc:302
	@ 000000000066634e Envoy::Http::CodecClient::onData /proc/self/cwd/source/common/http/codec_client.cc:115
	@ 00000000006664cc Envoy::Http::CodecClient::CodecReadFilter::onData /proc/self/cwd/bazel-out/k8-opt/bin/source/common/http/_virtual_includes/codec_client_lib/common/http/codec_client.h:159
	@ 000000000056dc66 Envoy::Network::FilterManagerImpl::onContinueReading /proc/self/cwd/source/common/network/filter_manager_impl.cc:56
	@ 000000000056c5ce Envoy::Network::ConnectionImpl::onReadReady /proc/self/cwd/source/common/network/connection_impl.cc:443
	@ 000000000056cded Envoy::Network::ConnectionImpl::onFileEvent /proc/self/cwd/source/common/network/connection_impl.cc:419
	@ 0000000000566307 _FUN /proc/self/cwd/source/common/event/file_event_impl.cc:61
	@ 00000000008a1d11 event_process_active_single_queue.isra.29 /tmp/libevent.dep.build/libevent-2.1.8-stable/event.c:1639
	@ 00000000008a246e event_base_loop /tmp/libevent.dep.build/libevent-2.1.8-stable/event.c:1961
	@ 000000000054dcdd Envoy::Server::InstanceImpl::run /proc/self/cwd/source/server/server.cc:356
	@ 0000000000464850 Envoy::MainCommonBase::run /proc/self/cwd/source/exe/main_common.cc:83
Leak of 498609600 bytes in 831016 objects allocated from:
	@ 0090f485 unknown
	@ 00000000008ae16c SSL_CTX_new ??:0
	@ 00000000006888ae Envoy::Ssl::ContextImpl::ContextImpl /proc/self/cwd/source/common/ssl/context_impl.cc:34
	@ 0000000000689e06 Envoy::Ssl::ServerContextImpl::ServerContextImpl /proc/self/cwd/source/common/ssl/context_impl.cc:439
	@ 000000000068c6af Envoy::Ssl::ContextManagerImpl::createSslServerContext /proc/self/cwd/source/common/ssl/context_manager_impl.cc:75
	@ 000000000056e9bc Envoy::Ssl::ServerSslSocketFactory::ServerSslSocketFactory /proc/self/cwd/source/common/ssl/ssl_socket.cc:375
	@ 000000000051f83c Envoy::Server::Configuration::DownstreamSslSocketFactory::createTransportSocketFactory /proc/self/cwd/source/server/config/network/ssl_socket.cc:36
	@ 000000000055a02c Envoy::Server::ListenerImpl::ListenerImpl /proc/self/cwd/source/server/listener_manager_impl.cc:203
	@ 000000000055b575 Envoy::Server::ListenerManagerImpl::addOrUpdateListener /proc/self/cwd/source/server/listener_manager_impl.cc:330
	@ 000000000076b0dd Envoy::Server::LdsApi::onConfigUpdate /proc/self/cwd/source/server/lds_api.cc:59
	@ 000000000076cce0 Envoy::Config::GrpcMuxSubscriptionImpl::onConfigUpdate /proc/self/cwd/bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:53
	@ 00000000007722b3 Envoy::Config::GrpcMuxImpl::onReceiveMessage /proc/self/cwd/source/common/config/grpc_mux_impl.cc:174
	@ 000000000076f622 Envoy::Grpc::TypedAsyncStreamCallbacks::onReceiveMessageUntyped /proc/self/cwd/bazel-out/k8-opt/bin/include/envoy/grpc/_virtual_includes/async_client_interface/envoy/grpc/async_client.h:172
	@ 0000000000788c65 Envoy::Grpc::AsyncStreamImpl::onData /proc/self/cwd/source/common/grpc/async_client_impl.cc:131
	@ 000000000078d93b Envoy::Http::AsyncStreamImpl::encodeData /proc/self/cwd/source/common/http/async_client_impl.cc:108
	@ 00000000006b2a93 Envoy::Http::Http2::ConnectionImpl::onFrameReceived /proc/self/cwd/source/common/http/http2/codec_impl.cc:445
	@ 00000000006b58b6 nghttp2_session_on_data_received /tmp/nghttp2.dep.build/nghttp2-1.29.0/lib/nghttp2_session.c:4881
	@ 00000000006b94e1 nghttp2_session_mem_recv /tmp/nghttp2.dep.build/nghttp2-1.29.0/lib/nghttp2_session.c:6443
	@ 00000000006b1aee Envoy::Http::Http2::ConnectionImpl::dispatch /proc/self/cwd/source/common/http/http2/codec_impl.cc:302
	@ 000000000066634e Envoy::Http::CodecClient::onData /proc/self/cwd/source/common/http/codec_client.cc:115
	@ 00000000006664cc Envoy::Http::CodecClient::CodecReadFilter::onData /proc/self/cwd/bazel-out/k8-opt/bin/source/common/http/_virtual_includes/codec_client_lib/common/http/codec_client.h:159
	@ 000000000056dc66 Envoy::Network::FilterManagerImpl::onContinueReading /proc/self/cwd/source/common/network/filter_manager_impl.cc:56
	@ 000000000056c5ce Envoy::Network::ConnectionImpl::onReadReady /proc/self/cwd/source/common/network/connection_impl.cc:443
	@ 000000000056cded Envoy::Network::ConnectionImpl::onFileEvent /proc/self/cwd/source/common/network/connection_impl.cc:419
	@ 0000000000566307 _FUN /proc/self/cwd/source/common/event/file_event_impl.cc:61
	@ 00000000008a1d11 event_process_active_single_queue.isra.29 /tmp/libevent.dep.build/libevent-2.1.8-stable/event.c:1639
	@ 00000000008a246e event_base_loop /tmp/libevent.dep.build/libevent-2.1.8-stable/event.c:1961
	@ 000000000054dcdd Envoy::Server::InstanceImpl::run /proc/self/cwd/source/server/server.cc:356
	@ 0000000000464850 Envoy::MainCommonBase::run /proc/self/cwd/source/exe/main_common.cc:83
	@ 00000000004156c8 main /proc/self/cwd/source/exe/main.cc:30
	@ 00007f114a497b8d unknown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions that are neither investigations, bugs, nor enhancements
Projects
None yet
Development

No branches or pull requests

6 participants