-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfaults when CDS data contains conflicting clusters with the same name #3214
Comments
@jmillikin-stripe I think it's probably not related, but can you check to make sure the issue still happens outside of the range specified in this issue? #3223 It's possible this is the same issue. |
I can still get it to crash as of 0cadce3 (current HEAD, merged an hour ago).
|
Current head is broken. Can you try before the range I mentioned? |
I can reproduce the crash at 14c0c07 (from the 23rd, before the stats changes). |
After more spelunking, I believe this may be related to this comment in thread_local_store.h:
What I'm seeing (after adding some print statements) is:
The pointer printed out here is to the where those print statements come from:
The segfault happens on this line (in The segfault always seems to follow this exact pattern -- |
@julia-stripe I think it's a different issue from the histogram issue. If possible can you provide a self contained repro or a core dump w/ debug symbols? |
@mattklein123 not sure what you're asking for -- the instructions to reproduce this segfault are in @jmillikin-stripe's first comment, and they're what I'm using to test. (except without using valgrind) I also think it's unrelated to the histogram issue. |
We don't have access to the management server that you are using. A totally self contained repro even if using a dummy management server in a docker container would make it easier to debug, or a full core dump w/ symbols would be pretty helpful. |
I packaged the files in the first comment into a zip file to make it easier to reproduce server.zip here are instructions:
|
I'm also happy to put a core dump + debug symbols somewhere but the Envoy binary with the symbols is 200MB which makes it a bit hard to send. let me know if you want it and I'll figure out somewhere to upload it! |
@julia-stripe thanks. I have a vague idea of what the issue is but the repro will help. Will take a look. |
The memory allocator can reuse a memory address immediately, leading to a situation in which TLS references are used for an already destroyed scope. Instead, an incrementing ID is used as the cache key. Fixes #3214 Signed-off-by: Matt Klein <mklein@lyft.com>
…envoyproxy#3253) The memory allocator can reuse a memory address immediately, leading to a situation in which TLS references are used for an already destroyed scope. Instead, an incrementing ID is used as the cache key. Fixes envoyproxy#3214 Signed-off-by: Matt Klein <mklein@lyft.com> Signed-off-by: Rama <rama.rao@salesforce.com>
If a CDS response contains multiple clusters with different properties but the same name, they each get quickly added and removed in the cluster manager. Something in Envoy isn't properly handling this, and we got a bunch of segfaults from an accidental duplication in our config.
The stack traces are different each time, possibly due to version differences or slightly different execution paths. We got a core dump from one, and were able to catch two others in Valgrind.
Configs+commands for reproducing the issue:
CDS config:
RDS config:
Stack trace 1 (core + gdb, envoy version 97b69ce)
Stack trace 2 (valgrind, envoy version 97b69ce) (note, this one using a TLS-enabled cluster list because we thought the crash was related to BoringSSL)
Stack trace 3 (valgrind, envoy version a328bc5)
The text was updated successfully, but these errors were encountered: