Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfaults when CDS data contains conflicting clusters with the same name #3214

Closed
jmillikin-stripe opened this issue Apr 25, 2018 · 11 comments · Fixed by #3253
Closed

Segfaults when CDS data contains conflicting clusters with the same name #3214

jmillikin-stripe opened this issue Apr 25, 2018 · 11 comments · Fixed by #3253
Assignees
Labels
Milestone

Comments

@jmillikin-stripe
Copy link
Contributor

jmillikin-stripe commented Apr 25, 2018

If a CDS response contains multiple clusters with different properties but the same name, they each get quickly added and removed in the cluster manager. Something in Envoy isn't properly handling this, and we got a bunch of segfaults from an accidental duplication in our config.

The stack traces are different each time, possibly due to version differences or slightly different execution paths. We got a core dump from one, and were able to catch two others in Valgrind.


Configs+commands for reproducing the issue:

$ valgrind ./envoy-static --config-path local-sds.yaml --v2-config-only --service-cluster cluster-foo --service-node node-foo
# local-sds.yaml
admin:
  access_log_path: "/dev/null"
  address:
    socket_address:
      address: "127.0.0.1"
      port_value: 9000

static_resources:
  listeners:
    name: "main-listener"
    address:
      socket_address:
        address: "127.0.0.1"
        port_value: 9001
    filter_chains:
      filters:
        - name: "envoy.http_connection_manager"
          config:
            codec_type: "AUTO"
            stat_prefix: "outbound_http"
            access_log:
              - name: "envoy.file_access_log"
                config:
                  path: "/dev/null"
            http_filters:
              - name: envoy.router
                config: {}
            rds:
              route_config_name: "default"
              config_source:
                api_config_source:
                  api_type: "REST_LEGACY"
                  cluster_names: ["cds-source"]
                  refresh_delay:
                    seconds: 1

  clusters:
    name: cds-source
    connect_timeout: 5s
    type: strict_dns
    lb_policy: round_robin
    hosts:
      - socket_address:
          address: "127.0.0.1"
          port_value: 9002

dynamic_resources:
  cds_config:
    api_config_source:
      api_type: "REST_LEGACY"
      cluster_names: ["cds-source"]
      refresh_delay:
        seconds: 1
  deprecated_v1:
    sds_config:
      api_config_source:
        api_type: "REST_LEGACY"
        cluster_names: ["cds-source"]
        refresh_delay:
          seconds: 1

CDS config:

{
  "clusters": [
    {
      "name": "cluster-from-cds",
      "hosts": [{"url": "tcp://127.0.0.1:8078"}],
      "connect_timeout_ms": 10000,
      "type": "static",
      "lb_type": "round_robin",
      "max_requests_per_connection": 1
    },
    {
      "name": "cluster-from-cds",
      "hosts": [{"url": "tcp://127.0.0.1:8079"}],
      "connect_timeout_ms": 10000,
      "type": "static",
      "lb_type": "round_robin",
      "max_requests_per_connection": 1
    },
    {
      "name": "cluster-from-cds",
      "hosts": [{"url": "tcp://127.0.0.1:8080"}],
      "connect_timeout_ms": 10000,
      "type": "static",
      "lb_type": "round_robin",
      "max_requests_per_connection": 1
    }
  ]
}

RDS config:

{
  "virtual_hosts": [{
    "name": "vhost-from-cds",
    "domains": ["*"],
    "routes": [{
      "prefix": "/",
      "cluster": "cluster-from-cds"
    }]
  }]
}

Stack trace 1 (core + gdb, envoy version 97b69ce)

(gdb) bt
#0  0x00007fb1bf9f81fb in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00000000008f7a23 in Envoy::SignalAction::sigHandler (sig=11, info=<optimized out>, context=<optimized out>) at external/envoy/source/exe/signal_action.cc:39
#2  <signal handler called>
#3  0x0000000007f28720 in ?? ()
#4  0x000000000064013c in Envoy::Ssl::ContextImpl::logHandshake (this=0x43df340, ssl=0x7ebfe08) at external/envoy/source/common/ssl/context_impl.cc:268
#5  0x000000000053f3a3 in Envoy::Ssl::SslSocket::doHandshake (this=this@entry=0x446f650) at external/envoy/source/common/ssl/ssl_socket.cc:96
#6  0x000000000053f91b in Envoy::Ssl::SslSocket::doWrite (this=0x446f650, write_buffer=...) at external/envoy/source/common/ssl/ssl_socket.cc:143
#7  0x000000000053cd48 in Envoy::Network::ConnectionImpl::onWriteReady (this=this@entry=0x434e280) at external/envoy/source/common/network/connection_impl.cc:421
#8  0x000000000053d078 in Envoy::Network::ConnectionImpl::onFileEvent (this=0x434e280, events=3) at external/envoy/source/common/network/connection_impl.cc:364
#9  0x0000000000536b38 in operator() (__args#0=3, this=<optimized out>) at /usr/include/c++/5/functional:2267
#10 operator() (__closure=0x0, arg=<optimized out>, what=<optimized out>) at external/envoy/source/common/event/file_event_impl.cc:60
#11 Envoy::Event::FileEventImpl::<lambda(int, short int, void*)>::_FUN(int, short, void *) () at external/envoy/source/common/event/file_event_impl.cc:61
#12 0x000000000092abd2 in event_persist_closure (ev=<optimized out>, base=0x265e2c0) at event.c:1580
#13 event_process_active_single_queue (base=base@entry=0x265e2c0, max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0, activeq=<optimized out>) at event.c:1639
#14 0x000000000092b32f in event_process_active (base=0x265e2c0) at event.c:1738
#15 event_base_loop (base=0x265e2c0, flags=0) at event.c:1961
#16 0x000000000051e868 in Envoy::Server::InstanceImpl::run (this=this@entry=0x7ffdc2090720) at external/envoy/source/server/server.cc:362
#17 0x000000000045c7ad in Envoy::main_common (options=...) at external/envoy/source/exe/main_common.cc:91
#18 0x00000000004153be in main (argc=16, argv=0x7ffdc2090d78) at external/envoy/source/exe/main.cc:54

Stack trace 2 (valgrind, envoy version 97b69ce) (note, this one using a TLS-enabled cluster list because we thought the crash was related to BoringSSL)

[2018-04-25 22:04:12.628][96556][info][upstream] external/envoy/source/common/upstream/cluster_manager_impl.cc:355] add/update cluster cluster-from-cds
==96556== Thread 19:
==96556== Invalid read of size 8
==96556==    at 0x65F9CB: runLowWatermarkCallbacks (codec_helper.h:18)
==96556==    by 0x65F9CB: Envoy::Http::Http1::ClientConnectionImpl::onBelowLowWatermark() (codec_impl.cc:684)
==96556==    by 0x5FCDE5: complete (timespan.h:24)
==96556==    by 0x5FCDE5: Envoy::Http::Http2::ConnPoolImpl::onConnectionEvent(Envoy::Http::Http2::ConnPoolImpl::ActiveClient&, Envoy::Network::ConnectionEvent) (conn_pool.cc:143)
==96556==    by 0x539634: Envoy::Network::ConnectionImpl::raiseEvent(Envoy::Network::ConnectionEvent) (connection_impl.cc:249)
==96556==    by 0x53F3B1: Envoy::Ssl::SslSocket::doHandshake() (ssl_socket.cc:97)
==96556==    by 0x53F91A: Envoy::Ssl::SslSocket::doWrite(Envoy::Buffer::Instance&) (ssl_socket.cc:143)
==96556==    by 0x53CD47: Envoy::Network::ConnectionImpl::onWriteReady() (connection_impl.cc:421)
==96556==    by 0x53D077: Envoy::Network::ConnectionImpl::onFileEvent(unsigned int) (connection_impl.cc:364)
==96556==    by 0x536B37: Envoy::Event::FileEventImpl::assignEvents(unsigned int)::{lambda(int, short, void*)#1}::_FUN(int, short, void*) (functional:2267)
==96556==    by 0x92ABD1: event_persist_closure (event.c:1580)
==96556==    by 0x92ABD1: event_process_active_single_queue.isra.29 (event.c:1639)
==96556==    by 0x92B32E: event_process_active (event.c:1738)
==96556==    by 0x92B32E: event_base_loop (event.c:1961)
==96556==    by 0x532208: Envoy::Server::WorkerImpl::threadRoutine(Envoy::Server::GuardDog&) (worker_impl.cc:94)
==96556==    by 0x934A7D: operator() (functional:2267)
==96556==    by 0x934A7D: operator() (thread.cc:20)
==96556==    by 0x934A7D: Envoy::Thread::Thread::Thread(std::function<void ()>)::{lambda(void*)#1}::_FUN(void*) (thread.cc:22)
==96556==  Address 0x45503063c303e51 is not stack'd, malloc'd or (recently) free'd

Stack trace 3 (valgrind, envoy version a328bc5)

[2018-04-25 22:41:17.238][59208][info][upstream] external/envoy/source/common/upstream/cluster_manager_impl.cc:392] warming cluster cluster-from-cds complete
==59208== Thread 6:
==59208== Jump to the invalid address stated on the next line
==59208==    at 0x1E: ???
==59208==    by 0x73E567: Envoy::Http::Http2::ConnPoolImpl::newStream(Envoy::Http::StreamDecoder&, Envoy::Http::ConnectionPool::Callbacks&) (conn_pool.cc:89)
==59208==    by 0x770360: encodeHeaders (router.cc:853)
==59208==    by 0x770360: Envoy::Router::Filter::decodeHeaders(Envoy::Http::HeaderMap&, bool) (router.cc:313)
==59208==    by 0x7A88A3: Envoy::Http::ConnectionManagerImpl::ActiveStream::decodeHeaders(Envoy::Http::ConnectionManagerImpl::ActiveStreamDecoderFilter*, Envoy::Http::HeaderMap&, bool) (conn_manager_impl.cc:665)
==59208==    by 0x7A9936: Envoy::Http::ConnectionManagerImpl::ActiveStream::decodeHeaders(std::unique_ptr<Envoy::Http::HeaderMap, std::default_delete<Envoy::Http::HeaderMap> >&&, bool) (conn_manager_impl.cc:593)
==59208==    by 0x7E03C9: Envoy::Http::Http1::ServerConnectionImpl::onMessageComplete() (codec_impl.cc:549)
==59208==    by 0x7DC89D: operator() (codec_impl.cc:270)
==59208==    by 0x7DC89D: Envoy::Http::Http1::ConnectionImpl::{lambda(http_parser*)#7}::_FUN(http_parser*) (codec_impl.cc:272)
==59208==    by 0x7E285F: http_parser_execute (in /pay/home/jmillikin/build-b887d182a3da1cda937a4fc063280a9c940accea/envoy-stripe)
==59208==    by 0x7DDFE0: Envoy::Http::Http1::ConnectionImpl::dispatchSlice(char const*, unsigned long) (codec_impl.cc:327)
==59208==    by 0x7E016A: Envoy::Http::Http1::ConnectionImpl::dispatch(Envoy::Buffer::Instance&) (codec_impl.cc:316)
==59208==    by 0x7A646F: Envoy::Http::ConnectionManagerImpl::onData(Envoy::Buffer::Instance&, bool) (conn_manager_impl.cc:226)
==59208==    by 0x64F8E6: Envoy::Network::FilterManagerImpl::onContinueReading(Envoy::Network::FilterManagerImpl::ActiveReadFilter*) (filter_manager_impl.cc:56)
==59208==  Address 0x1e is not stack'd, malloc'd or (recently) free'd
==59208== 
[2018-04-25 22:41:18.548][59586][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] Caught Segmentation fault, suspect faulting address 0x1e
@mattklein123 mattklein123 added this to the 1.7.0 milestone Apr 25, 2018
@mattklein123 mattklein123 added the help wanted Needs help! label Apr 25, 2018
@mattklein123
Copy link
Member

@jmillikin-stripe I think it's probably not related, but can you check to make sure the issue still happens outside of the range specified in this issue? #3223 It's possible this is the same issue.

@jmillikin-stripe
Copy link
Contributor Author

I can still get it to crash as of 0cadce3 (current HEAD, merged an hour ago).

[2018-04-26 17:26:58.829][64660][info][upstream] external/envoy/source/common/upstream/cluster_manager_impl.cc:392] warming cluster cluster-from-cds complete
==64660== Thread 7:
==64660== Invalid read of size 8
==64660==    at 0x73DE4B: complete (timespan.h:24)
==64660==    by 0x73DE4B: Envoy::Http::Http2::ConnPoolImpl::onConnectionEvent(Envoy::Http::Http2::ConnPoolImpl::ActiveClient&, Envoy::Network::ConnectionEvent) (conn_pool.cc:143)
==64660==    by 0x64E894: Envoy::Network::ConnectionImpl::raiseEvent(Envoy::Network::ConnectionEvent) (connection_impl.cc:275)
==64660==    by 0x651080: Envoy::Ssl::SslSocket::doHandshake() (ssl_socket.cc:102)
==64660==    by 0x651627: Envoy::Ssl::SslSocket::doWrite(Envoy::Buffer::Instance&, bool) (ssl_socket.cc:148)
==64660==    by 0x64E553: Envoy::Network::ConnectionImpl::onWriteReady() (connection_impl.cc:477)
==64660==    by 0x64EA67: Envoy::Network::ConnectionImpl::onFileEvent(unsigned int) (connection_impl.cc:412)
==64660==    by 0x647F37: Envoy::Event::FileEventImpl::assignEvents(unsigned int)::{lambda(int, short, void*)#1}::_FUN(int, short, void*) (functional:2267)
==64660==    by 0x97A1A1: event_persist_closure (event.c:1580)
==64660==    by 0x97A1A1: event_process_active_single_queue.isra.29 (event.c:1639)
==64660==    by 0x97A8FE: event_process_active (event.c:1738)
==64660==    by 0x97A8FE: event_base_loop (event.c:1961)
==64660==    by 0x6437D5: Envoy::Server::WorkerImpl::threadRoutine(Envoy::Server::GuardDog&) (worker_impl.cc:94)
==64660==    by 0x98414D: operator() (functional:2267)
==64660==    by 0x98414D: operator() (thread.cc:22)
==64660==    by 0x98414D: Envoy::Thread::Thread::Thread(std::function<void ()>)::{lambda(void*)#1}::_FUN(void*) (thread.cc:24)
==64660==    by 0x524B183: start_thread (pthread_create.c:312)
==64660==  Address 0x13c is not stack'd, malloc'd or (recently) free'd
==64660== 

@mattklein123
Copy link
Member

Current head is broken. Can you try before the range I mentioned?

@jmillikin-stripe
Copy link
Contributor Author

I can reproduce the crash at 14c0c07 (from the 23rd, before the stats changes).

@julia-stripe
Copy link
Contributor

julia-stripe commented Apr 28, 2018

After more spelunking, I believe this may be related to this comment in thread_local_store.h:

  • NOTE: It is theoretically possible that when a scope is deleted, it could be reallocated
    with the same address, and a cache flush operation could race and delete cache data
    for the new scope. This is extremely unlikely, and if it happens the cache will be
    repopulated on the next access.

What I'm seeing (after adding some print statements) is:

new scope 0x3accb60
using scope 0x3accb60
using scope 0x3accb60
new scope 0x3accb60
using scope 0x3accb60
using scope 0x3accb60
erasing scope 0x3accb60
using scope 0x3accb60
>>> segfault <<< 

The pointer printed out here is to the ScopeImpl.

where those print statements come from:

The segfault happens on this line (in .inc()). https://git.corp.stripe.com/stripe-internal/envoy/blob/97b69ce6a507471180c0585b6742c627962b433b/source/common/http/http1/conn_pool.cc#L294, but the specific line isn't that relevant, it's just the first time a statistic for that cluster happens to be incremented.

The segfault always seems to follow this exact pattern -- clearScopeFromCaches is called, and then on the next attempt to increment a counter in the scope it just erased, the program segfaults. I tested reproducing the segfault about 10 times and every time that was what happened.

@mattklein123
Copy link
Member

@julia-stripe I think it's a different issue from the histogram issue. If possible can you provide a self contained repro or a core dump w/ debug symbols?

@julia-stripe
Copy link
Contributor

julia-stripe commented Apr 28, 2018

@mattklein123 not sure what you're asking for -- the instructions to reproduce this segfault are in @jmillikin-stripe's first comment, and they're what I'm using to test. (except without using valgrind)

I also think it's unrelated to the histogram issue.

@mattklein123
Copy link
Member

We don't have access to the management server that you are using. A totally self contained repro even if using a dummy management server in a docker container would make it easier to debug, or a full core dump w/ symbols would be pretty helpful.

@julia-stripe
Copy link
Contributor

julia-stripe commented Apr 28, 2018

I packaged the files in the first comment into a zip file to make it easier to reproduce server.zip

here are instructions:

wget https://github.com/envoyproxy/envoy/files/1957908/server.zip
unzip server.zip
# run `curl` in another terminal to exercise Envoy
# while true; do curl localhost:9001; sleep 1; done &
python -m SimpleHTTPServer 9002 & # this is the management server
./envoy-static --config-path local-sds.yaml --v2-config-only --service-cluster cluster-foo --service-node node-foo

@julia-stripe
Copy link
Contributor

I'm also happy to put a core dump + debug symbols somewhere but the Envoy binary with the symbols is 200MB which makes it a bit hard to send. let me know if you want it and I'll figure out somewhere to upload it!

@mattklein123
Copy link
Member

@julia-stripe thanks. I have a vague idea of what the issue is but the repro will help. Will take a look.

@mattklein123 mattklein123 self-assigned this Apr 28, 2018
@mattklein123 mattklein123 removed the help wanted Needs help! label Apr 28, 2018
mattklein123 added a commit that referenced this issue Apr 29, 2018
The memory allocator can reuse a memory address immediately, leading to
a situation in which TLS references are used for an already destroyed
scope. Instead, an incrementing ID is used as the cache key.

Fixes #3214

Signed-off-by: Matt Klein <mklein@lyft.com>
mattklein123 added a commit that referenced this issue May 1, 2018
…#3253)

The memory allocator can reuse a memory address immediately, leading to
a situation in which TLS references are used for an already destroyed
scope. Instead, an incrementing ID is used as the cache key.

Fixes #3214

Signed-off-by: Matt Klein <mklein@lyft.com>
ramaraochavali pushed a commit to ramaraochavali/envoy that referenced this issue May 3, 2018
…envoyproxy#3253)

The memory allocator can reuse a memory address immediately, leading to
a situation in which TLS references are used for an already destroyed
scope. Instead, an incrementing ID is used as the cache key.

Fixes envoyproxy#3214

Signed-off-by: Matt Klein <mklein@lyft.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants