Segfaults when CDS data contains conflicting clusters with the same name #3214

jmillikin-stripe · 2018-04-25T22:56:30Z

If a CDS response contains multiple clusters with different properties but the same name, they each get quickly added and removed in the cluster manager. Something in Envoy isn't properly handling this, and we got a bunch of segfaults from an accidental duplication in our config.

The stack traces are different each time, possibly due to version differences or slightly different execution paths. We got a core dump from one, and were able to catch two others in Valgrind.

Configs+commands for reproducing the issue:

$ valgrind ./envoy-static --config-path local-sds.yaml --v2-config-only --service-cluster cluster-foo --service-node node-foo

# local-sds.yaml
admin:
  access_log_path: "/dev/null"
  address:
    socket_address:
      address: "127.0.0.1"
      port_value: 9000

static_resources:
  listeners:
    name: "main-listener"
    address:
      socket_address:
        address: "127.0.0.1"
        port_value: 9001
    filter_chains:
      filters:
        - name: "envoy.http_connection_manager"
          config:
            codec_type: "AUTO"
            stat_prefix: "outbound_http"
            access_log:
              - name: "envoy.file_access_log"
                config:
                  path: "/dev/null"
            http_filters:
              - name: envoy.router
                config: {}
            rds:
              route_config_name: "default"
              config_source:
                api_config_source:
                  api_type: "REST_LEGACY"
                  cluster_names: ["cds-source"]
                  refresh_delay:
                    seconds: 1

  clusters:
    name: cds-source
    connect_timeout: 5s
    type: strict_dns
    lb_policy: round_robin
    hosts:
      - socket_address:
          address: "127.0.0.1"
          port_value: 9002

dynamic_resources:
  cds_config:
    api_config_source:
      api_type: "REST_LEGACY"
      cluster_names: ["cds-source"]
      refresh_delay:
        seconds: 1
  deprecated_v1:
    sds_config:
      api_config_source:
        api_type: "REST_LEGACY"
        cluster_names: ["cds-source"]
        refresh_delay:
          seconds: 1

CDS config:

{
  "clusters": [
    {
      "name": "cluster-from-cds",
      "hosts": [{"url": "tcp://127.0.0.1:8078"}],
      "connect_timeout_ms": 10000,
      "type": "static",
      "lb_type": "round_robin",
      "max_requests_per_connection": 1
    },
    {
      "name": "cluster-from-cds",
      "hosts": [{"url": "tcp://127.0.0.1:8079"}],
      "connect_timeout_ms": 10000,
      "type": "static",
      "lb_type": "round_robin",
      "max_requests_per_connection": 1
    },
    {
      "name": "cluster-from-cds",
      "hosts": [{"url": "tcp://127.0.0.1:8080"}],
      "connect_timeout_ms": 10000,
      "type": "static",
      "lb_type": "round_robin",
      "max_requests_per_connection": 1
    }
  ]
}

RDS config:

{
  "virtual_hosts": [{
    "name": "vhost-from-cds",
    "domains": ["*"],
    "routes": [{
      "prefix": "/",
      "cluster": "cluster-from-cds"
    }]
  }]
}

Stack trace 1 (core + gdb, envoy version 97b69ce)

(gdb) bt
#0  0x00007fb1bf9f81fb in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00000000008f7a23 in Envoy::SignalAction::sigHandler (sig=11, info=<optimized out>, context=<optimized out>) at external/envoy/source/exe/signal_action.cc:39
#2  <signal handler called>
#3  0x0000000007f28720 in ?? ()
#4  0x000000000064013c in Envoy::Ssl::ContextImpl::logHandshake (this=0x43df340, ssl=0x7ebfe08) at external/envoy/source/common/ssl/context_impl.cc:268
#5  0x000000000053f3a3 in Envoy::Ssl::SslSocket::doHandshake (this=this@entry=0x446f650) at external/envoy/source/common/ssl/ssl_socket.cc:96
#6  0x000000000053f91b in Envoy::Ssl::SslSocket::doWrite (this=0x446f650, write_buffer=...) at external/envoy/source/common/ssl/ssl_socket.cc:143
#7  0x000000000053cd48 in Envoy::Network::ConnectionImpl::onWriteReady (this=this@entry=0x434e280) at external/envoy/source/common/network/connection_impl.cc:421
#8  0x000000000053d078 in Envoy::Network::ConnectionImpl::onFileEvent (this=0x434e280, events=3) at external/envoy/source/common/network/connection_impl.cc:364
#9  0x0000000000536b38 in operator() (__args#0=3, this=<optimized out>) at /usr/include/c++/5/functional:2267
#10 operator() (__closure=0x0, arg=<optimized out>, what=<optimized out>) at external/envoy/source/common/event/file_event_impl.cc:60
#11 Envoy::Event::FileEventImpl::<lambda(int, short int, void*)>::_FUN(int, short, void *) () at external/envoy/source/common/event/file_event_impl.cc:61
#12 0x000000000092abd2 in event_persist_closure (ev=<optimized out>, base=0x265e2c0) at event.c:1580
#13 event_process_active_single_queue (base=base@entry=0x265e2c0, max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0, activeq=<optimized out>) at event.c:1639
#14 0x000000000092b32f in event_process_active (base=0x265e2c0) at event.c:1738
#15 event_base_loop (base=0x265e2c0, flags=0) at event.c:1961
#16 0x000000000051e868 in Envoy::Server::InstanceImpl::run (this=this@entry=0x7ffdc2090720) at external/envoy/source/server/server.cc:362
#17 0x000000000045c7ad in Envoy::main_common (options=...) at external/envoy/source/exe/main_common.cc:91
#18 0x00000000004153be in main (argc=16, argv=0x7ffdc2090d78) at external/envoy/source/exe/main.cc:54

Stack trace 2 (valgrind, envoy version 97b69ce) (note, this one using a TLS-enabled cluster list because we thought the crash was related to BoringSSL)

[2018-04-25 22:04:12.628][96556][info][upstream] external/envoy/source/common/upstream/cluster_manager_impl.cc:355] add/update cluster cluster-from-cds
==96556== Thread 19:
==96556== Invalid read of size 8
==96556==    at 0x65F9CB: runLowWatermarkCallbacks (codec_helper.h:18)
==96556==    by 0x65F9CB: Envoy::Http::Http1::ClientConnectionImpl::onBelowLowWatermark() (codec_impl.cc:684)
==96556==    by 0x5FCDE5: complete (timespan.h:24)
==96556==    by 0x5FCDE5: Envoy::Http::Http2::ConnPoolImpl::onConnectionEvent(Envoy::Http::Http2::ConnPoolImpl::ActiveClient&, Envoy::Network::ConnectionEvent) (conn_pool.cc:143)
==96556==    by 0x539634: Envoy::Network::ConnectionImpl::raiseEvent(Envoy::Network::ConnectionEvent) (connection_impl.cc:249)
==96556==    by 0x53F3B1: Envoy::Ssl::SslSocket::doHandshake() (ssl_socket.cc:97)
==96556==    by 0x53F91A: Envoy::Ssl::SslSocket::doWrite(Envoy::Buffer::Instance&) (ssl_socket.cc:143)
==96556==    by 0x53CD47: Envoy::Network::ConnectionImpl::onWriteReady() (connection_impl.cc:421)
==96556==    by 0x53D077: Envoy::Network::ConnectionImpl::onFileEvent(unsigned int) (connection_impl.cc:364)
==96556==    by 0x536B37: Envoy::Event::FileEventImpl::assignEvents(unsigned int)::{lambda(int, short, void*)#1}::_FUN(int, short, void*) (functional:2267)
==96556==    by 0x92ABD1: event_persist_closure (event.c:1580)
==96556==    by 0x92ABD1: event_process_active_single_queue.isra.29 (event.c:1639)
==96556==    by 0x92B32E: event_process_active (event.c:1738)
==96556==    by 0x92B32E: event_base_loop (event.c:1961)
==96556==    by 0x532208: Envoy::Server::WorkerImpl::threadRoutine(Envoy::Server::GuardDog&) (worker_impl.cc:94)
==96556==    by 0x934A7D: operator() (functional:2267)
==96556==    by 0x934A7D: operator() (thread.cc:20)
==96556==    by 0x934A7D: Envoy::Thread::Thread::Thread(std::function<void ()>)::{lambda(void*)#1}::_FUN(void*) (thread.cc:22)
==96556==  Address 0x45503063c303e51 is not stack'd, malloc'd or (recently) free'd

Stack trace 3 (valgrind, envoy version a328bc5)

[2018-04-25 22:41:17.238][59208][info][upstream] external/envoy/source/common/upstream/cluster_manager_impl.cc:392] warming cluster cluster-from-cds complete
==59208== Thread 6:
==59208== Jump to the invalid address stated on the next line
==59208==    at 0x1E: ???
==59208==    by 0x73E567: Envoy::Http::Http2::ConnPoolImpl::newStream(Envoy::Http::StreamDecoder&, Envoy::Http::ConnectionPool::Callbacks&) (conn_pool.cc:89)
==59208==    by 0x770360: encodeHeaders (router.cc:853)
==59208==    by 0x770360: Envoy::Router::Filter::decodeHeaders(Envoy::Http::HeaderMap&, bool) (router.cc:313)
==59208==    by 0x7A88A3: Envoy::Http::ConnectionManagerImpl::ActiveStream::decodeHeaders(Envoy::Http::ConnectionManagerImpl::ActiveStreamDecoderFilter*, Envoy::Http::HeaderMap&, bool) (conn_manager_impl.cc:665)
==59208==    by 0x7A9936: Envoy::Http::ConnectionManagerImpl::ActiveStream::decodeHeaders(std::unique_ptr<Envoy::Http::HeaderMap, std::default_delete<Envoy::Http::HeaderMap> >&&, bool) (conn_manager_impl.cc:593)
==59208==    by 0x7E03C9: Envoy::Http::Http1::ServerConnectionImpl::onMessageComplete() (codec_impl.cc:549)
==59208==    by 0x7DC89D: operator() (codec_impl.cc:270)
==59208==    by 0x7DC89D: Envoy::Http::Http1::ConnectionImpl::{lambda(http_parser*)#7}::_FUN(http_parser*) (codec_impl.cc:272)
==59208==    by 0x7E285F: http_parser_execute (in /pay/home/jmillikin/build-b887d182a3da1cda937a4fc063280a9c940accea/envoy-stripe)
==59208==    by 0x7DDFE0: Envoy::Http::Http1::ConnectionImpl::dispatchSlice(char const*, unsigned long) (codec_impl.cc:327)
==59208==    by 0x7E016A: Envoy::Http::Http1::ConnectionImpl::dispatch(Envoy::Buffer::Instance&) (codec_impl.cc:316)
==59208==    by 0x7A646F: Envoy::Http::ConnectionManagerImpl::onData(Envoy::Buffer::Instance&, bool) (conn_manager_impl.cc:226)
==59208==    by 0x64F8E6: Envoy::Network::FilterManagerImpl::onContinueReading(Envoy::Network::FilterManagerImpl::ActiveReadFilter*) (filter_manager_impl.cc:56)
==59208==  Address 0x1e is not stack'd, malloc'd or (recently) free'd
==59208== 
[2018-04-25 22:41:18.548][59586][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] Caught Segmentation fault, suspect faulting address 0x1e

The text was updated successfully, but these errors were encountered:

mattklein123 · 2018-04-26T13:55:22Z

@jmillikin-stripe I think it's probably not related, but can you check to make sure the issue still happens outside of the range specified in this issue? #3223 It's possible this is the same issue.

jmillikin-stripe · 2018-04-26T17:29:22Z

I can still get it to crash as of 0cadce3 (current HEAD, merged an hour ago).

[2018-04-26 17:26:58.829][64660][info][upstream] external/envoy/source/common/upstream/cluster_manager_impl.cc:392] warming cluster cluster-from-cds complete
==64660== Thread 7:
==64660== Invalid read of size 8
==64660==    at 0x73DE4B: complete (timespan.h:24)
==64660==    by 0x73DE4B: Envoy::Http::Http2::ConnPoolImpl::onConnectionEvent(Envoy::Http::Http2::ConnPoolImpl::ActiveClient&, Envoy::Network::ConnectionEvent) (conn_pool.cc:143)
==64660==    by 0x64E894: Envoy::Network::ConnectionImpl::raiseEvent(Envoy::Network::ConnectionEvent) (connection_impl.cc:275)
==64660==    by 0x651080: Envoy::Ssl::SslSocket::doHandshake() (ssl_socket.cc:102)
==64660==    by 0x651627: Envoy::Ssl::SslSocket::doWrite(Envoy::Buffer::Instance&, bool) (ssl_socket.cc:148)
==64660==    by 0x64E553: Envoy::Network::ConnectionImpl::onWriteReady() (connection_impl.cc:477)
==64660==    by 0x64EA67: Envoy::Network::ConnectionImpl::onFileEvent(unsigned int) (connection_impl.cc:412)
==64660==    by 0x647F37: Envoy::Event::FileEventImpl::assignEvents(unsigned int)::{lambda(int, short, void*)#1}::_FUN(int, short, void*) (functional:2267)
==64660==    by 0x97A1A1: event_persist_closure (event.c:1580)
==64660==    by 0x97A1A1: event_process_active_single_queue.isra.29 (event.c:1639)
==64660==    by 0x97A8FE: event_process_active (event.c:1738)
==64660==    by 0x97A8FE: event_base_loop (event.c:1961)
==64660==    by 0x6437D5: Envoy::Server::WorkerImpl::threadRoutine(Envoy::Server::GuardDog&) (worker_impl.cc:94)
==64660==    by 0x98414D: operator() (functional:2267)
==64660==    by 0x98414D: operator() (thread.cc:22)
==64660==    by 0x98414D: Envoy::Thread::Thread::Thread(std::function<void ()>)::{lambda(void*)#1}::_FUN(void*) (thread.cc:24)
==64660==    by 0x524B183: start_thread (pthread_create.c:312)
==64660==  Address 0x13c is not stack'd, malloc'd or (recently) free'd
==64660==

mattklein123 · 2018-04-26T17:30:06Z

Current head is broken. Can you try before the range I mentioned?

jmillikin-stripe · 2018-04-26T17:55:43Z

I can reproduce the crash at 14c0c07 (from the 23rd, before the stats changes).

julia-stripe · 2018-04-28T16:05:06Z

After more spelunking, I believe this may be related to this comment in thread_local_store.h:

NOTE: It is theoretically possible that when a scope is deleted, it could be reallocated
with the same address, and a cache flush operation could race and delete cache data
for the new scope. This is extremely unlikely, and if it happens the cache will be
repopulated on the next access.

What I'm seeing (after adding some print statements) is:

new scope 0x3accb60
using scope 0x3accb60
using scope 0x3accb60
new scope 0x3accb60
using scope 0x3accb60
using scope 0x3accb60
erasing scope 0x3accb60
using scope 0x3accb60
>>> segfault <<<

The pointer printed out here is to the ScopeImpl.

where those print statements come from:

"new scope" is in createNewScope
"using scope" is in stats() on ClusterInfoImpl
"erasing scope" is in clearScopeFromCaches

The segfault happens on this line (in .inc()). https://git.corp.stripe.com/stripe-internal/envoy/blob/97b69ce6a507471180c0585b6742c627962b433b/source/common/http/http1/conn_pool.cc#L294, but the specific line isn't that relevant, it's just the first time a statistic for that cluster happens to be incremented.

The segfault always seems to follow this exact pattern -- clearScopeFromCaches is called, and then on the next attempt to increment a counter in the scope it just erased, the program segfaults. I tested reproducing the segfault about 10 times and every time that was what happened.

mattklein123 · 2018-04-28T16:08:10Z

@julia-stripe I think it's a different issue from the histogram issue. If possible can you provide a self contained repro or a core dump w/ debug symbols?

julia-stripe · 2018-04-28T16:09:55Z

@mattklein123 not sure what you're asking for -- the instructions to reproduce this segfault are in @jmillikin-stripe's first comment, and they're what I'm using to test. (except without using valgrind)

I also think it's unrelated to the histogram issue.

mattklein123 · 2018-04-28T16:14:34Z

We don't have access to the management server that you are using. A totally self contained repro even if using a dummy management server in a docker container would make it easier to debug, or a full core dump w/ symbols would be pretty helpful.

julia-stripe · 2018-04-28T16:22:04Z

I packaged the files in the first comment into a zip file to make it easier to reproduce server.zip

here are instructions:

wget https://github.com/envoyproxy/envoy/files/1957908/server.zip
unzip server.zip
# run `curl` in another terminal to exercise Envoy
# while true; do curl localhost:9001; sleep 1; done &
python -m SimpleHTTPServer 9002 & # this is the management server
./envoy-static --config-path local-sds.yaml --v2-config-only --service-cluster cluster-foo --service-node node-foo

julia-stripe · 2018-04-28T16:30:15Z

I'm also happy to put a core dump + debug symbols somewhere but the Envoy binary with the symbols is 200MB which makes it a bit hard to send. let me know if you want it and I'll figure out somewhere to upload it!

mattklein123 · 2018-04-28T16:41:19Z

@julia-stripe thanks. I have a vague idea of what the issue is but the repro will help. Will take a look.

The memory allocator can reuse a memory address immediately, leading to a situation in which TLS references are used for an already destroyed scope. Instead, an incrementing ID is used as the cache key. Fixes #3214 Signed-off-by: Matt Klein <mklein@lyft.com>

…#3253) The memory allocator can reuse a memory address immediately, leading to a situation in which TLS references are used for an already destroyed scope. Instead, an incrementing ID is used as the cache key. Fixes #3214 Signed-off-by: Matt Klein <mklein@lyft.com>

…envoyproxy#3253) The memory allocator can reuse a memory address immediately, leading to a situation in which TLS references are used for an already destroyed scope. Instead, an incrementing ID is used as the cache key. Fixes envoyproxy#3214 Signed-off-by: Matt Klein <mklein@lyft.com> Signed-off-by: Rama <rama.rao@salesforce.com>

mattklein123 added the bug label Apr 25, 2018

mattklein123 added this to the 1.7.0 milestone Apr 25, 2018

mattklein123 added the help wanted Needs help! label Apr 25, 2018

mattklein123 self-assigned this Apr 28, 2018

mattklein123 removed the help wanted Needs help! label Apr 28, 2018

mattklein123 mentioned this issue Apr 29, 2018

stats: fix scope cache consistency issue when scope memory is recyled #3253

Merged

mattklein123 closed this as completed in #3253 May 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfaults when CDS data contains conflicting clusters with the same name #3214

Segfaults when CDS data contains conflicting clusters with the same name #3214

jmillikin-stripe commented Apr 25, 2018 •

edited

Loading

mattklein123 commented Apr 26, 2018

jmillikin-stripe commented Apr 26, 2018

mattklein123 commented Apr 26, 2018

jmillikin-stripe commented Apr 26, 2018

julia-stripe commented Apr 28, 2018 •

edited

Loading

mattklein123 commented Apr 28, 2018

julia-stripe commented Apr 28, 2018 •

edited

Loading

mattklein123 commented Apr 28, 2018

julia-stripe commented Apr 28, 2018 •

edited

Loading

julia-stripe commented Apr 28, 2018

mattklein123 commented Apr 28, 2018

Segfaults when CDS data contains conflicting clusters with the same name #3214

Segfaults when CDS data contains conflicting clusters with the same name #3214

Comments

jmillikin-stripe commented Apr 25, 2018 • edited Loading

mattklein123 commented Apr 26, 2018

jmillikin-stripe commented Apr 26, 2018

mattklein123 commented Apr 26, 2018

jmillikin-stripe commented Apr 26, 2018

julia-stripe commented Apr 28, 2018 • edited Loading

mattklein123 commented Apr 28, 2018

julia-stripe commented Apr 28, 2018 • edited Loading

mattklein123 commented Apr 28, 2018

julia-stripe commented Apr 28, 2018 • edited Loading

julia-stripe commented Apr 28, 2018

mattklein123 commented Apr 28, 2018

jmillikin-stripe commented Apr 25, 2018 •

edited

Loading

julia-stripe commented Apr 28, 2018 •

edited

Loading

julia-stripe commented Apr 28, 2018 •

edited

Loading

julia-stripe commented Apr 28, 2018 •

edited

Loading