-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory corruption in histogram change #3223
Comments
Have you found a way to repro? |
No, it's a low rate crash. No repro currently, just load/time. |
Can you recover any stacktrace? |
I'm going to spin up running all tests under valgrind, but in terms of bug-finding, we may also need to look at the circlhist C impl, or the interaction with it, as opposed to just the diffs in Envoy's repo. |
valgrind is noisy, but I have a pending branch somewhere to allow our rampant mismatched/new/malloc/free calls (or those in dependencies). But nevertheless, running it changed the timing, and this assert popped up: [2018-04-26 14:11:43.056][24][critical][assert] source/common/stats/thread_local_store.cc:103] assert failure: !merge_in_progress_ this was in metrics_service_integration_test. This looks familiar, @ramaraochavali |
I have lots of stack traces, but none of them are a smoking gun. When I get
to the office in a few hours I will dig in more.
…On Thu, Apr 26, 2018, 7:05 AM Joshua Marantz ***@***.***> wrote:
I'm going to spin up running all tests under valgrind, but in terms of
bug-finding, we may also need to look at the circlhist C impl, or the
interaction with it, as opposed to just the diffs in Envoy's repo.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#3223 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGA17LBnJvXSn74GgqtkqoIlKkeSXkCiks5tsdQNgaJpZM4TlOlp>
.
|
Yeah, The !merge_in_progress_ ASSERT was added on the last day. I will dig
deep in to see why MetricsService integration is running merge twice and
update this back.
On Thu, Apr 26, 2018 at 7:46 PM, Matt Klein <notifications@github.com>
wrote:
… I have lots of stack traces, but none of them are a smoking gun. When I get
to the office in a few hours I will dig in more.
On Thu, Apr 26, 2018, 7:05 AM Joshua Marantz ***@***.***>
wrote:
> I'm going to spin up running all tests under valgrind, but in terms of
> bug-finding, we may also need to look at the circlhist C impl, or the
> interaction with it, as opposed to just the diffs in Envoy's repo.
>
> —
> You are receiving this because you were assigned.
> Reply to this email directly, view it on GitHub
> <#3223 (comment)
>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/
AGA17LBnJvXSn74GgqtkqoIlKkeSXkCiks5tsdQNgaJpZM4TlOlp>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3223 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AQaGqD0ngc6IUo0yresClCEXsP2N8Fudks5tsdbTgaJpZM4TlOlp>
.
|
Thanks @ramaraochavali -- Probably we wanted to use either |
Yes. the original implementation was to tie it with !shutting_down_ and skip the callback. But in the last minute we changed it to ASSERT. |
Yeah be we should still leave in the conditional, especially as this is not a RELEASE_ASSERT. Otherwise if it happens under load then the hard-to-define behavior leaks through.. Of course there's no guarantee this is actually the issue Matt saw. |
Yeah. Let us see. |
@jmarantz @ramaraochavali I have more crash dumps sporadically rolling in and I'm almost positive now the issue is in the histogram code. I already see at least one issue through code inspection but there might be more. Let me investigate a bit and I will get back. @jmarantz @ramaraochavali any work that can be done to get valgrind working, load test, etc. would be pretty nice. |
^ is a failure on a debug build running on a production box. This looks like a real assert. |
@mrice32 this is almost a smoking gun for my theory. An alternate PR could be quickly constructed that just blocks the overlapped merge, which might make the problem go away. Or if Matt merges #3228 either in the repo or just locally for his testing, then it would actually show as an assert. @mattklein123 wdyt? |
@jmarantz I'm running debug build now w/ asserts. Let's just debug a bit. I think we can figure this out relatively soon. |
@postwait can you take a look at ^ stack trace and see if that quickly means anything to you? I'm looking at the code now. |
I do see some issues in the original commit that need to be fixed on the Envoy side, but I kind of doubt those are causing the crashes I'm seeing. The above stack trace/assert is highly suspicious and I suspect we might be hitting a bug in the library where accumulation is not working correctly in some case. I'm going to recompile w/ symbols and see if I can get a core dump with symbols. |
I'm able to repro this somewhat regularly on one of our production hosts. I'm going to deploy a build with debug symbols and get a core dump which should allow me to inspect the histograms and see what is going on. Following the library code is pretty difficult for the uninitiated. |
I've not seen this issue before.. To summarize the library intent. You wish to merge a set of histograms with (possibly) varying bins. We iterate through the ordered bins on each counting uniques... We assign that count and alllocate that number of bins to a target histogram. We iterate against accumulating those bins into the new target histogram. This could happen if one of the source histograms were modified in another thread while this operation was happening... All inputs to the merge function are unprotected must be handled by the calling app (must guarantee that the will not be modified by another thread). All this said, I'll pour over this some more and see if I can find issues. This merge function is used extensively in our database to perform very large reductions of time-offset histograms and we've never seen a crash like this one. Understanding a bit more context might be helpful, do you use something like backtrace.io to catch faults? This would allow a bit more insight into the stack vars and such. |
@postwait I'm gathering a core dump w/ full symbols. I've been looking at the library code and I also agree this is likely a concurrency issue. I'm trying to track down where/how this might be happening. |
@mattklein123 thanks much for digging in to this. any pointers from you that I should look at to help this debugging?
|
Neither of those seem likely to me; I just looked at those.
I think this is some subtle threading thing. There's an inherent safety in
the fact that most data is thread-local, except during merges. Because the
merges are posted to the main thread there is no possibility of actually
doing them concurrently. However, @mrice32 and I were discussing whether
more barriers were needed to transmit data cross-thread reliably, in
particular I was interested in the views different CPUs might have
of ThreadLocalHistogramImpl::current_active_
…On Thu, Apr 26, 2018 at 8:01 PM Rama Chavali ***@***.***> wrote:
@mattklein123 <https://github.com/mattklein123> thanks much for digging
in to this. any pointers from you that I should look at to help this
debugging?
Based on the stack, I am suspecting couple of things
1. The runOnAllThreads might be calling the post_complete_callback
before all workers finished ? so the worker thread's currentIndex is still
pointing to the histogram (so it is writing to it) and merge is also
working on the same histogram?
2. the merge method micro optimization that we did in the last minute,
instead of creating temp array we just passed direct pointer - this should
not cause issue but one area of code to look for. @jmarantz
<https://github.com/jmarantz> WDYT?
To eliminate the suspicion of concurrency issue, should we just
temporarily introduce a lock in the merge method and see if the problem
goes away?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3223 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB2kPSUNMhYA-z1MOH0tTbr8W1jolOtWks5tsl_TgaJpZM4TlOlp>
.
|
Assuming the posting machinery is working correctly, current_active_ should not need to be atomic, but obviously it can't hurt to make it so. I'm working on getting a TSAN/ASAN build running in our production environment. This should make it pretty clear as to whether this is a threading issue or not. I already tried some locking and it still crashed which is somewhat surprising, so it's still not out of the realm of possibility that it's a library issue, though from my looking at the code I find that unlikely. |
@mattklein123 Thank you. Let us see what TSAN/ASAN shows us. FWIW, Today I tried with TSAN on Mac with 50 threads of the same service sending requests continuously. TSAN did not report any errors. Not sure if this load is sufficient to trigger that condition. I am still puzzled as to why locking did not help though it is inefficient. Will wait for Matt's results on TSAN/ASAN run on production. |
Update, |
ASSERT hit on ASAN build also. This is very confusing. Will keep poking. |
I'd love to explore a core to look at the internals of all merging histograms. |
@postwait I can get you a core dump. Let me email you offline. |
Just a small update here. I've still been plugging away at this. I've tried the following things:
I'm now running a test using extra assertions provided by @postwait. The rough things I've tried are shown here: 7499284 I will keep thinking about this, but I've been through the Envoy side code really carefully and given the extra locking it's really difficult see where the corruption would be coming from (also no hits on ASAN/TSAN). |
@mattklein123 @postwait Couple of thoughts
|
I applied the following diff on top of your current master:
I'm able to cause the last assert in that diff to hit:
From my read of the code there is nothing in One thing I noticed is that I'm almost 100% positive at this point that there is no threading issue happening on the Envoy side. |
@mattklein123 thanks for the update. I think complete reallocate might just help if target histogram is the problem which seems to be the case as I mentioned above... |
Yes, given that I now I think this is some type of library overflow issue, I would not be surprised if reallocation fixes the issue. I'm running a test now and will report back. |
And If source is not the problem, I think you are right 100% that there is no threading issue on the Envoy side. |
This patch looks like it fixes the problem:
I'm pretty positive at this point that The reason I think this always shows up on the same stat in my debug (istener.0.0.0.0_9300.downstream_cx_length_ms), is that this is one of our edge boxes and the connection length ms stat is going to have an almost endless array of values. (In our env ~0 - 900,000). Over time I think we get into a situation in which only using hist_clear() leads to bin growth until we overflow. @postwait let us know how else we can help debug. |
@mattklein123 thanks for the update. Good to know reallocation at least fixed the issue. Should we remove the locks we added for recordValue and merge and see because they seem to be not required. |
Once the underlying library issue is understood/fixed I will cleanup the change, test, and do a PR. There are some fixes that need to go in. |
sure. SGTM. |
Small correction, looks like |
I'll have some other engineers here look at this as well. We'll try to both statically analyze and repeat. |
@mattklein123 Awesome news. Thank you. @postwait thanks for fixing it so quickly. |
This change unreverts: #3130 #3183 #3219 Also fixes #3223 Please see the 2nd commit for the actual changes. The changes are: Bring in a new histogram library version with the fix (and more debug checking). Fix a small issue with scope iteration during merging. Remove de-dup for histograms until we iterate to shared storage for overlapping scope histograms. Personally, I found this very confusing when debugging and I think the way I changed it is better for now given the code we have. Signed-off-by: Matt Klein <mklein@lyft.com>
This change unreverts: Also fixes envoyproxy#3223 Please see the 2nd commit for the actual changes. The changes are: Bring in a new histogram library version with the fix (and more debug checking). Fix a small issue with scope iteration during merging. Remove de-dup for histograms until we iterate to shared storage for overlapping scope histograms. Personally, I found this very confusing when debugging and I think the way I changed it is better for now given the code we have. Signed-off-by: Matt Klein <mklein@lyft.com>
There is a memory corruption issue somewhere in the following commit range:
b9e7961...a328bc5
The crash signatures are not consistent, but when it first happened I was pretty sure it was somehow related to e293ffb.
However, I reverted that, and we are still seeing crashes. (I also spent 2 hours looking at the code and it seems fine.)
Given the changes in the above range, I think the next most likely candidate is the histogram change: 9a1e49f
I'm going to spend time today digging into this, but I wanted to open this in case anyone else wants to scan through the diffs and help with bug spotting.
cc @envoyproxy/maintainers @jmarantz @mrice32 @ramaraochavali
The text was updated successfully, but these errors were encountered: