memory corruption in histogram change #3223

mattklein123 · 2018-04-26T13:54:36Z

There is a memory corruption issue somewhere in the following commit range:
b9e7961...a328bc5

The crash signatures are not consistent, but when it first happened I was pretty sure it was somehow related to e293ffb.

However, I reverted that, and we are still seeing crashes. (I also spent 2 hours looking at the code and it seems fine.)

Given the changes in the above range, I think the next most likely candidate is the histogram change: 9a1e49f

I'm going to spend time today digging into this, but I wanted to open this in case anyone else wants to scan through the diffs and help with bug spotting.

cc @envoyproxy/maintainers @jmarantz @mrice32 @ramaraochavali

The text was updated successfully, but these errors were encountered:

jmarantz · 2018-04-26T13:56:10Z

Have you found a way to repro?

mattklein123 · 2018-04-26T13:57:18Z

No, it's a low rate crash. No repro currently, just load/time.

jmarantz · 2018-04-26T14:03:05Z

Can you recover any stacktrace?

jmarantz · 2018-04-26T14:04:55Z

I'm going to spin up running all tests under valgrind, but in terms of bug-finding, we may also need to look at the circlhist C impl, or the interaction with it, as opposed to just the diffs in Envoy's repo.

jmarantz · 2018-04-26T14:16:33Z

valgrind is noisy, but I have a pending branch somewhere to allow our rampant mismatched/new/malloc/free calls (or those in dependencies). But nevertheless, running it changed the timing, and this assert popped up:

[2018-04-26 14:11:43.056][24][critical][assert] source/common/stats/thread_local_store.cc:103] assert failure: !merge_in_progress_

this was in metrics_service_integration_test. This looks familiar, @ramaraochavali

mattklein123 · 2018-04-26T14:16:40Z

I have lots of stack traces, but none of them are a smoking gun. When I get to the office in a few hours I will dig in more.

…

On Thu, Apr 26, 2018, 7:05 AM Joshua Marantz ***@***.***> wrote: I'm going to spin up running all tests under valgrind, but in terms of bug-finding, we may also need to look at the circlhist C impl, or the interaction with it, as opposed to just the diffs in Envoy's repo. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#3223 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGA17LBnJvXSn74GgqtkqoIlKkeSXkCiks5tsdQNgaJpZM4TlOlp> .

ramaraochavali · 2018-04-26T14:29:20Z

Yeah, The !merge_in_progress_ ASSERT was added on the last day. I will dig deep in to see why MetricsService integration is running merge twice and update this back. On Thu, Apr 26, 2018 at 7:46 PM, Matt Klein <notifications@github.com> wrote:

…

I have lots of stack traces, but none of them are a smoking gun. When I get to the office in a few hours I will dig in more. On Thu, Apr 26, 2018, 7:05 AM Joshua Marantz ***@***.***> wrote: > I'm going to spin up running all tests under valgrind, but in terms of > bug-finding, we may also need to look at the circlhist C impl, or the > interaction with it, as opposed to just the diffs in Envoy's repo. > > — > You are receiving this because you were assigned. > Reply to this email directly, view it on GitHub > <#3223 (comment) >, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ AGA17LBnJvXSn74GgqtkqoIlKkeSXkCiks5tsdQNgaJpZM4TlOlp> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3223 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AQaGqD0ngc6IUo0yresClCEXsP2N8Fudks5tsdbTgaJpZM4TlOlp> .

jmarantz · 2018-04-26T15:16:10Z

Thanks @ramaraochavali -- Probably we wanted to use either RELEASE_ASSERT, or also leave !merge_in_proress_ in the conditional with !shutting_down_

ramaraochavali · 2018-04-26T15:18:38Z

Yes. the original implementation was to tie it with !shutting_down_ and skip the callback. But in the last minute we changed it to ASSERT.

jmarantz · 2018-04-26T15:26:27Z

Yeah be we should still leave in the conditional, especially as this is not a RELEASE_ASSERT. Otherwise if it happens under load then the hard-to-define behavior leaks through.. Of course there's no guarantee this is actually the issue Matt saw.

ramaraochavali · 2018-04-26T15:29:32Z

Yeah. Let us see.

mattklein123 · 2018-04-26T16:28:42Z

@jmarantz @ramaraochavali I have more crash dumps sporadically rolling in and I'm almost positive now the issue is in the histogram code. I already see at least one issue through code inspection but there might be more. Let me investigate a bit and I will get back. @jmarantz @ramaraochavali any work that can be done to get valgrind working, load test, etc. would be pretty nice.

jmarantz · 2018-04-26T16:52:22Z

PRs issued for RELEASE_ASSERT (#3228) and valgrind helper (#3227)

mattklein123 · 2018-04-26T16:55:48Z

#0  0x00007f99a2f671fb in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x00000000010affa3 in Envoy::SignalAction::sigHandler(int, siginfo_t*, void*) ()
#2  <signal handler called>
#3  0x00007f99a28bec37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#4  0x00007f99a28c2028 in __GI_abort () at abort.c:89
#5  0x00007f99a28b7bf6 in __assert_fail_base (fmt=0x7f99a2a0c058 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x151538e "tgtidx < tgt->allocd", file=file@entry=0x151524f "external/com_github_circonus_labs_libcircllhist/src/circllhist.c", line=line@entry=844, function=function@entry=0x15153a3 "void internal_bucket_accum(histogram_t *, int, histogram_t *, int)") at assert.c:92
#6  0x00007f99a28b7ca2 in __GI___assert_fail (assertion=0x151538e "tgtidx < tgt->allocd", file=0x151524f "external/com_github_circonus_labs_libcircllhist/src/circllhist.c", line=844, function=0x15153a3 "void internal_bucket_accum(histogram_t *, int, histogram_t *, int)") at assert.c:101
#7  0x0000000000fcef2c in internal_bucket_accum ()
#8  0x0000000000fcee57 in hist_needed_merge_size_fc ()
#9  0x0000000000fcea77 in hist_accumulate ()
#10 0x00000000009d9538 in Envoy::Stats::ThreadLocalHistogramImpl::merge(histogram*) ()
#11 0x00000000009d9ca7 in Envoy::Stats::ParentHistogramImpl::merge() ()
#12 0x00000000009d74e4 in Envoy::Stats::ThreadLocalStoreImpl::mergeInternal(std::function<void ()>) ()
#13 0x00000000009dac00 in Envoy::Stats::ThreadLocalStoreImpl::mergeHistograms(std::function<void ()>)::$_2::operator()() const ()
#14 0x00000000009da9ed in std::_Function_handler<void (), Envoy::Stats::ThreadLocalStoreImpl::mergeHistograms(std::function<void ()>)::$_2>::_M_invoke(std::_Any_data const&) ()
#15 0x000000000061ac4e in std::function<void ()>::operator()() const ()
#16 0x000000000098bec7 in Envoy::Event::DispatcherImpl::runPostCallbacks() ()
#17 0x000000000098c8d8 in Envoy::Event::DispatcherImpl::DispatcherImpl(std::unique_ptr<Envoy::Buffer::WatermarkFactory, std::default_delete<Envoy::Buffer::WatermarkFactory> >&&)::$_1::operator()() const ()
#18 0x000000000098c78d in std::_Function_handler<void (), Envoy::Event::DispatcherImpl::DispatcherImpl(std::unique_ptr<Envoy::Buffer::WatermarkFactory, std::default_delete<Envoy::Buffer::WatermarkFactory> >&&)::$_1>::_M_invoke(std::_Any_data const&) ()
#19 0x000000000061ac4e in std::function<void ()>::operator()() const ()
#20 0x00000000009921fd in Envoy::Event::TimerImpl::TimerImpl(Envoy::Event::DispatcherImpl&, std::function<void ()>)::$_0::operator()(int, short, void*) const ()
#21 0x00000000009921c8 in Envoy::Event::TimerImpl::TimerImpl(Envoy::Event::DispatcherImpl&, std::function<void ()>)::$_0::__invoke(int, short, void*) ()
#22 0x0000000000f93343 in event_process_active_single_queue ()
#23 0x0000000000f9036e in event_base_loop ()
#24 0x000000000098be1d in Envoy::Event::DispatcherImpl::run(Envoy::Event::Dispatcher::RunType) ()
#25 0x000000000092f21b in Envoy::Server::InstanceImpl::run() ()
#26 0x0000000000588758 in Envoy::MainCommonBase::run() ()
#27 0x000000000058731c in Envoy::MainCommon::run() ()
#28 0x0000000000586bcd in main ()

^ is a failure on a debug build running on a production box. This looks like a real assert.

jmarantz · 2018-04-26T16:58:34Z

@mrice32 this is almost a smoking gun for my theory. An alternate PR could be quickly constructed that just blocks the overlapped merge, which might make the problem go away. Or if Matt merges #3228 either in the repo or just locally for his testing, then it would actually show as an assert. @mattklein123 wdyt?

mattklein123 · 2018-04-26T16:59:27Z

@jmarantz I'm running debug build now w/ asserts. Let's just debug a bit. I think we can figure this out relatively soon.

mattklein123 · 2018-04-26T17:08:13Z

@postwait can you take a look at ^ stack trace and see if that quickly means anything to you? I'm looking at the code now.

mattklein123 · 2018-04-26T17:17:59Z

I do see some issues in the original commit that need to be fixed on the Envoy side, but I kind of doubt those are causing the crashes I'm seeing. The above stack trace/assert is highly suspicious and I suspect we might be hitting a bug in the library where accumulation is not working correctly in some case. I'm going to recompile w/ symbols and see if I can get a core dump with symbols.

mattklein123 · 2018-04-26T18:28:10Z

I'm able to repro this somewhat regularly on one of our production hosts. I'm going to deploy a build with debug symbols and get a core dump which should allow me to inspect the histograms and see what is going on. Following the library code is pretty difficult for the uninitiated.

postwait · 2018-04-26T19:18:00Z

I've not seen this issue before.. To summarize the library intent.

You wish to merge a set of histograms with (possibly) varying bins.

We iterate through the ordered bins on each counting uniques...

We assign that count and alllocate that number of bins to a target histogram.

We iterate against accumulating those bins into the new target histogram.

This could happen if one of the source histograms were modified in another thread while this operation was happening... All inputs to the merge function are unprotected must be handled by the calling app (must guarantee that the will not be modified by another thread).

All this said, I'll pour over this some more and see if I can find issues. This merge function is used extensively in our database to perform very large reductions of time-offset histograms and we've never seen a crash like this one. Understanding a bit more context might be helpful, do you use something like backtrace.io to catch faults? This would allow a bit more insight into the stack vars and such.

mattklein123 · 2018-04-26T19:20:00Z

@postwait I'm gathering a core dump w/ full symbols. I've been looking at the library code and I also agree this is likely a concurrency issue. I'm trying to track down where/how this might be happening.

ramaraochavali · 2018-04-27T00:01:14Z

@mattklein123 thanks much for digging in to this. any pointers from you that I should look at to help this debugging?
Based on the stack, I am suspecting couple of things

The runOnAllThreads might be calling the post_complete_callback before all workers finished ? so the worker thread's currentIndex is still pointing to the histogram (so it is writing to it) and merge is also working on the same histogram?
the merge method micro optimization that we did in the last minute, instead of creating temp array we just passed direct pointer - this should not cause issue but one area of code to look for. @jmarantz WDYT?
To eliminate the suspicion of concurrency issue, should we just temporarily introduce a lock in the merge method and see if the problem goes away?

jmarantz · 2018-04-27T01:24:14Z

Neither of those seem likely to me; I just looked at those. I think this is some subtle threading thing. There's an inherent safety in the fact that most data is thread-local, except during merges. Because the merges are posted to the main thread there is no possibility of actually doing them concurrently. However, @mrice32 and I were discussing whether more barriers were needed to transmit data cross-thread reliably, in particular I was interested in the views different CPUs might have of ThreadLocalHistogramImpl::current_active_

…

On Thu, Apr 26, 2018 at 8:01 PM Rama Chavali ***@***.***> wrote: @mattklein123 <https://github.com/mattklein123> thanks much for digging in to this. any pointers from you that I should look at to help this debugging? Based on the stack, I am suspecting couple of things 1. The runOnAllThreads might be calling the post_complete_callback before all workers finished ? so the worker thread's currentIndex is still pointing to the histogram (so it is writing to it) and merge is also working on the same histogram? 2. the merge method micro optimization that we did in the last minute, instead of creating temp array we just passed direct pointer - this should not cause issue but one area of code to look for. @jmarantz <https://github.com/jmarantz> WDYT? To eliminate the suspicion of concurrency issue, should we just temporarily introduce a lock in the merge method and see if the problem goes away? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3223 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB2kPSUNMhYA-z1MOH0tTbr8W1jolOtWks5tsl_TgaJpZM4TlOlp> .

mattklein123 · 2018-04-27T04:19:14Z

Assuming the posting machinery is working correctly, current_active_ should not need to be atomic, but obviously it can't hurt to make it so. I'm working on getting a TSAN/ASAN build running in our production environment. This should make it pretty clear as to whether this is a threading issue or not. I already tried some locking and it still crashed which is somewhat surprising, so it's still not out of the realm of possibility that it's a library issue, though from my looking at the code I find that unlikely.

ramaraochavali · 2018-04-27T04:45:31Z

@mattklein123 Thank you. Let us see what TSAN/ASAN shows us. FWIW, Today I tried with TSAN on Mac with 50 threads of the same service sending requests continuously. TSAN did not report any errors. Not sure if this load is sufficient to trigger that condition. I am still puzzled as to why locking did not help though it is inefficient. Will wait for Matt's results on TSAN/ASAN run on production.

mattklein123 · 2018-04-27T16:18:00Z

Update, 2018-04-27T15:57:02.42857 envoy: external/com_github_circonus_labs_libcircllhist/src/circllhist.c:844: void internal_bucket_accum(histogram_t *, int, histogram_t *, int): Assertion `tgtidx < tgt->allocd' failed. fired on TSAN build with no TSAN error. Going to try ASAN now.

mattklein123 · 2018-04-27T17:05:02Z

ASSERT hit on ASAN build also. This is very confusing. Will keep poking.

postwait · 2018-04-27T17:13:41Z

I'd love to explore a core to look at the internals of all merging histograms.

mattklein123 · 2018-04-27T17:24:49Z

@postwait I can get you a core dump. Let me email you offline.

mattklein123 · 2018-04-27T22:27:46Z

Just a small update here. I've still been plugging away at this. I've tried the following things:

Making current_active_ atomic
Adding complete locking around record/merge
Some small other fixes around scope listing during merge
Using "fast" histograms

I'm now running a test using extra assertions provided by @postwait. The rough things I've tried are shown here: 7499284

I will keep thinking about this, but I've been through the Envoy side code really carefully and given the extra locking it's really difficult see where the corruption would be coming from (also no hits on ASAN/TSAN).

ramaraochavali · 2018-04-28T00:57:02Z

@mattklein123 @postwait Couple of thoughts

If the issue is happening with source histogram, can we try by cloning the histogram using hist_clone and passing the cloned histogram to merge - This would completely eliminate the doubt that is being modified by some one else while hist_accumulate is being executed. While the lock that @mattklein123 has introduced does the same, just a thought to try from the library perspective. Since the interval histogram may not have huge amount of values clone should be faster. @postwait WDYT?
If the assert is due to target histogram, based on conversation I suppose this is not the problem. If this is problem, we could try reallocating it before merge (instead of hist_clear()).

mattklein123 · 2018-04-28T11:12:25Z

@postwait:

I applied the following diff on top of your current master:

mklein@localhost:~/Source/libcircllhist$ git diff HEAD~1
diff --git a/src/circllhist.c b/src/circllhist.c
index 23ad46f..ceaf202 100644
--- a/src/circllhist.c
+++ b/src/circllhist.c
@@ -54,14 +54,14 @@ static union {
 } private_nan_union = { .private_nan_internal_rep = 0x7fffffffffffffff };
 
 #define MAX_HIST_BINS (2 + 2 * 90 * 256)
-#ifdef DEBUG
+//#ifdef DEBUG
 #define assert_good_hist(h) do { \
   assert(h->allocd <= MAX_HIST_BINS); \
   assert(h->used <= h->allocd); \
 } while(0)
-#else
-#define assert_good_hist(h)
-#endif
+//#else
+//#define assert_good_hist(h)
+//#endif
 #define private_nan private_nan_union.private_nan_double_rep
 
 static double power_of_ten[256] = {
@@ -823,6 +823,9 @@ hist_needed_merge_size_fc(histogram_t **hist, int cnt,
                                     histogram_t *src, int srcidx),
                           histogram_t *tgt) {
   assert_good_hist(hist[0]);
+  if (tgt) {
+     assert_good_hist(tgt);
+  }
   unsigned short idx_static[8192];
   unsigned short *idx = idx_static;
   int i, count = 0;
@@ -831,6 +834,9 @@ hist_needed_merge_size_fc(histogram_t **hist, int cnt,
     if(!idx) return -1;
   }
   memset(idx, 0, cnt * sizeof(*idx));
+  if (tgt) {
+     assert_good_hist(tgt);
+  }
   while(1) {
     hist_bucket_t smallest = { .exp = 0, .val = 0 };
     for(i=0;i<cnt;i++)
@@ -944,6 +950,7 @@ hist_accumulate(histogram_t *tgt, const histogram_t* const *src, int cnt) {
   tgt->used = 0;
   if (! tgt->allocd) tgt->allocd = 1;
   tgt->bvs = tgt->allocator->calloc(tgt->allocd, sizeof(*tgt->bvs));
+  assert_good_hist(tgt);
   hist_needed_merge_size_fc(inclusive_src, cnt+1, internal_bucket_accum, tgt);
   if(oldtgtbuff) tgt->allocator->free(oldtgtbuff);
   if(inclusive_src != inclusive_src_static) free(inclusive_src);

I'm able to cause the last assert in that diff to hit:

(gdb) bt
#0  0x00007f60488701fb in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x00000000010b9193 in Envoy::SignalAction::sigHandler (sig=6, info=0x7f60490aa4b0, context=0x7f60490aa380) at external/envoy/source/exe/signal_action.cc:40
#2  <signal handler called>
#3  0x00007f60481c7c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#4  0x00007f60481cb028 in __GI_abort () at abort.c:89
#5  0x00007f60481c0bf6 in __assert_fail_base (fmt=0x7f6048315058 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x151e990 "tgt->allocd <= MAX_HIST_BINS", file=file@entry=0x151e6b0 "external/com_github_circonus_labs_libcircllhist/src/circllhist.c", line=line@entry=953, 
    function=function@entry=0x151ea4a "int hist_accumulate(histogram_t *, const histogram_t *const *, int)") at assert.c:92
#6  0x00007f60481c0ca2 in __GI___assert_fail (assertion=0x151e990 "tgt->allocd <= MAX_HIST_BINS", file=0x151e6b0 "external/com_github_circonus_labs_libcircllhist/src/circllhist.c", line=953, function=0x151ea4a "int hist_accumulate(histogram_t *, const histogram_t *const *, int)") at assert.c:101
#7  0x0000000000fd796e in hist_accumulate (tgt=0x3fb5da0, src=0x3fd7748, cnt=1) at external/com_github_circonus_labs_libcircllhist/src/circllhist.c:953
#8  0x00000000009dce99 in Envoy::Stats::ThreadLocalHistogramImpl::merge (this=0x3fd76d0, target=0x3fb5da0) at external/envoy/source/common/stats/thread_local_store.cc:340
#9  0x00000000009dd637 in Envoy::Stats::ParentHistogramImpl::merge (this=0x3fce6c0) at external/envoy/source/common/stats/thread_local_store.cc:377
#10 0x00000000009dabb4 in Envoy::Stats::ThreadLocalStoreImpl::mergeInternal(std::function<void ()>) (this=0x3c40e40, merge_complete_cb=...) at external/envoy/source/common/stats/thread_local_store.cc:122
#11 0x00000000009de510 in Envoy::Stats::ThreadLocalStoreImpl::mergeHistograms(std::function<void ()>)::$_2::operator()() const (this=0x8812960) at external/envoy/source/common/stats/thread_local_store.cc:115
#12 0x00000000009de2fd in std::_Function_handler<void (), Envoy::Stats::ThreadLocalStoreImpl::mergeHistograms(std::function<void ()>)::$_2>::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.1/../../../../include/c++/5.4.1/functional:1871
#13 0x000000000061c95e in std::function<void ()>::operator()() const (this=0x7ffe021187d8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.1/../../../../include/c++/5.4.1/functional:2267
#14 0x000000000098f677 in Envoy::Event::DispatcherImpl::runPostCallbacks (this=0x3c41200) at external/envoy/source/common/event/dispatcher_impl.cc:171
#15 0x0000000000990088 in Envoy::Event::DispatcherImpl::DispatcherImpl(std::unique_ptr<Envoy::Buffer::WatermarkFactory, std::default_delete<Envoy::Buffer::WatermarkFactory> >&&)::$_1::operator()() const (this=0x3c381e8) at external/envoy/source/common/event/dispatcher_impl.cc:36
#16 0x000000000098ff3d in std::_Function_handler<void (), Envoy::Event::DispatcherImpl::DispatcherImpl(std::unique_ptr<Envoy::Buffer::WatermarkFactory, std::default_delete<Envoy::Buffer::WatermarkFactory> >&&)::$_1>::_M_invoke(std::_Any_data const&) (__functor=...)
    at /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.1/../../../../include/c++/5.4.1/functional:1871
#17 0x000000000061c95e in std::function<void ()>::operator()() const (this=0x3c381e8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.1/../../../../include/c++/5.4.1/functional:2267
#18 0x00000000009959ad in Envoy::Event::TimerImpl::TimerImpl(Envoy::Event::DispatcherImpl&, std::function<void ()>)::$_0::operator()(int, short, void*) const (this=0x3ca04c0, arg=0x3c38160) at external/envoy/source/common/event/timer_impl.cc:15
#19 0x0000000000995978 in Envoy::Event::TimerImpl::TimerImpl(Envoy::Event::DispatcherImpl&, std::function<void ()>)::$_0::__invoke(int, short, void*) (arg=0x3c38160) at external/envoy/source/common/event/timer_impl.cc:15
#20 0x0000000000f9b6b3 in event_process_active_single_queue (base=0x3ca02c0, activeq=0x3c3e580, max_to_process=2147483647, endtime=0x0) at event.c:1646
#21 0x0000000000f986de in event_process_active (base=<optimized out>) at event.c:1741
#22 event_base_loop (base=0x3ca02c0, flags=<optimized out>) at event.c:1961
#23 0x000000000098f5cd in Envoy::Event::DispatcherImpl::run (this=0x3c41200, type=Envoy::Event::Dispatcher::Block) at external/envoy/source/common/event/dispatcher_impl.cc:161
#24 0x00000000009329cb in Envoy::Server::InstanceImpl::run (this=0x3cc5180) at external/envoy/source/server/server.cc:402
#25 0x0000000000589ba8 in Envoy::MainCommonBase::run (this=0x3c8c998) at external/envoy/source/exe/main_common.cc:83
#26 0x000000000058872c in Envoy::MainCommon::run (this=0x3c8c500) at bazel-out/k8-dbg/bin/external/envoy/source/exe/_virtual_includes/envoy_main_common_lib/exe/main_common.h:44
#27 0x0000000000587fdd in main (argc=18, argv=0x7ffe02118d88) at external/envoy/source/exe/main.cc:30
(gdb) p *tgt
$11 = {
  allocd = 46086, 
  used = 0, 
  fast = 0, 
  allocator = 0x1d68b00 <default_allocator>, 
  bvs = 0x988a000
}
(gdb) p *src[0]
$12 = {
  allocd = 500, 
  used = 418, 
  fast = 0, 
  allocator = 0x1d68b00 <default_allocator>, 
  bvs = 0x4cf5400
}
(gdb) p tgt_copy
$13 = {
  allocd = 46080, 
  used = 46080, 
  fast = 0, 
  allocator = 0x1d68b00 <default_allocator>, 
  bvs = 0x9818000
}

From my read of the code there is nothing in hist_needed_merge_size_fc() that is checking for the maximum histogram size but maybe I'm missing something. (For that matter I don't see anything obvious in hist_insert_raw() that checks bounds either.

One thing I noticed is that hist_clear() just sets all used counts to zero. I'm going to try an experiment now to see if I completely reallocate the histograms during each interval if the issue goes away.

I'm almost 100% positive at this point that there is no threading issue happening on the Envoy side.

ramaraochavali · 2018-04-28T11:37:19Z

@mattklein123 thanks for the update. I think complete reallocate might just help if target histogram is the problem which seems to be the case as I mentioned above...

mattklein123 · 2018-04-28T11:38:29Z

Yes, given that I now I think this is some type of library overflow issue, I would not be surprised if reallocation fixes the issue. I'm running a test now and will report back.

ramaraochavali · 2018-04-28T11:40:19Z

And If source is not the problem, I think you are right 100% that there is no threading issue on the Envoy side.

mattklein123 · 2018-04-28T12:11:27Z

This patch looks like it fixes the problem:

mklein@localhost:~/Source/envoy-private/envoy$ git diff HEAD~1
diff --git a/source/common/stats/thread_local_store.cc b/source/common/stats/thread_local_store.cc
index 0bc82f0..731dd0f 100644
--- a/source/common/stats/thread_local_store.cc
+++ b/source/common/stats/thread_local_store.cc
@@ -338,7 +338,8 @@ void ThreadLocalHistogramImpl::merge(histogram_t* target) {
   std::lock_guard<std::mutex> lock(lock_);
   histogram_t** other_histogram = &histograms_[otherHistogramIndex()];
   hist_accumulate(target, other_histogram, 1);
-  hist_clear(*other_histogram);
+  hist_free(*other_histogram);
+  *other_histogram = hist_alloc();
 }
 
 ParentHistogramImpl::ParentHistogramImpl(const std::string& name, Store& parent,
@@ -368,7 +369,8 @@ bool ParentHistogramImpl::used() const {
 void ParentHistogramImpl::merge() {
   std::unique_lock<std::mutex> lock(merge_lock_);
   if (usedLockHeld()) {
-    hist_clear(interval_histogram_);
+    hist_free(interval_histogram_);
+    interval_histogram_ = hist_alloc();
     // Here we could copy all the pointers to TLS histograms in the tls_histogram_ list,
     // then release the lock before we do the actual merge. However it is not a big deal
     // because the tls_histogram merge is not that expensive as it is a single histogram
@@ -380,7 +382,7 @@ void ParentHistogramImpl::merge() {
     lock.unlock();
     //hist_accumulate(cumulative_histogram_, &interval_histogram_, 1);
     //cumulative_statistics_.refresh(cumulative_histogram_);
-    //interval_statistics_.refresh(interval_histogram_);
+    interval_statistics_.refresh(interval_histogram_);
   }
 }

I'm pretty positive at this point that hist_needed_merge_size_fc (https://github.com/circonus-labs/libcircllhist/blob/master/src/circllhist.c#L821) can overflow count. Then when we go back into to do the copy we start stomping. @postwait this assert is not sufficient as it won't track overflow: https://github.com/circonus-labs/libcircllhist/blob/master/src/circllhist.c#L942. There would need to be a check inside hist_needed_merge_size_fc every time count is incremented.

The reason I think this always shows up on the same stat in my debug (istener.0.0.0.0_9300.downstream_cx_length_ms), is that this is one of our edge boxes and the connection length ms stat is going to have an almost endless array of values. (In our env ~0 - 900,000). Over time I think we get into a situation in which only using hist_clear() leads to bin growth until we overflow.

@postwait let us know how else we can help debug.

ramaraochavali · 2018-04-28T12:15:43Z

@mattklein123 thanks for the update. Good to know reallocation at least fixed the issue. Should we remove the locks we added for recordValue and merge and see because they seem to be not required.

mattklein123 · 2018-04-28T12:17:55Z

Once the underlying library issue is understood/fixed I will cleanup the change, test, and do a PR. There are some fixes that need to go in.

ramaraochavali · 2018-04-28T12:25:07Z

sure. SGTM.

mattklein123 · 2018-04-28T12:48:20Z

Small correction, looks like hist_needed_merge_size_fc is using int for count, so I doubt that is overflowing, so I guess the assert() is probably good. (Though it might be.) So it's a little less clear to me here where the issue is, but hopefully @postwait can spot it.

postwait · 2018-04-28T12:58:48Z

I'll have some other engineers here look at this as well. We'll try to both statically analyze and repeat.

mattklein123 · 2018-04-30T14:51:33Z

Update: @postwait has fixed an issue in the library and my sanity tests look good. I've unrolled all the extra locking I added and things appear to be running stably. @postwait lmk when the fix gets officially pushed and I clean up my un-revert branch and push a new PR. Thank you!!!

ramaraochavali · 2018-04-30T15:00:11Z

@mattklein123 Awesome news. Thank you. @postwait thanks for fixing it so quickly.

This change unreverts: #3130 #3183 #3219 Also fixes #3223 Please see the 2nd commit for the actual changes. The changes are: Bring in a new histogram library version with the fix (and more debug checking). Fix a small issue with scope iteration during merging. Remove de-dup for histograms until we iterate to shared storage for overlapping scope histograms. Personally, I found this very confusing when debugging and I think the way I changed it is better for now given the code we have. Signed-off-by: Matt Klein <mklein@lyft.com>

This change unreverts: Also fixes envoyproxy#3223 Please see the 2nd commit for the actual changes. The changes are: Bring in a new histogram library version with the fix (and more debug checking). Fix a small issue with scope iteration during merging. Remove de-dup for histograms until we iterate to shared storage for overlapping scope histograms. Personally, I found this very confusing when debugging and I think the way I changed it is better for now given the code we have. Signed-off-by: Matt Klein <mklein@lyft.com>

mattklein123 added the bug label Apr 26, 2018

mattklein123 added this to the 1.7.0 milestone Apr 26, 2018

mattklein123 self-assigned this Apr 26, 2018

mattklein123 mentioned this issue Apr 26, 2018

Segfaults when CDS data contains conflicting clusters with the same name #3214

Closed

jmarantz mentioned this issue Apr 26, 2018

stats: use a release-assert when histogram-merging tries to run concurrently #3228

Closed

mattklein123 changed the title ~~memory corruption in recent commits~~ memory corruption in histogram change Apr 27, 2018

mattklein123 mentioned this issue May 1, 2018

stats: unrevert and fix histograms #3262

Merged

htuch closed this as completed in #3262 May 3, 2018

memory corruption in histogram change #3223

memory corruption in histogram change #3223

Comments

mattklein123 commented Apr 26, 2018 • edited Loading

jmarantz commented Apr 26, 2018

mattklein123 commented Apr 26, 2018

jmarantz commented Apr 26, 2018

jmarantz commented Apr 26, 2018

jmarantz commented Apr 26, 2018

mattklein123 commented Apr 26, 2018 via email

ramaraochavali commented Apr 26, 2018 via email

jmarantz commented Apr 26, 2018

ramaraochavali commented Apr 26, 2018

jmarantz commented Apr 26, 2018

ramaraochavali commented Apr 26, 2018

mattklein123 commented Apr 26, 2018

jmarantz commented Apr 26, 2018

mattklein123 commented Apr 26, 2018 • edited Loading

jmarantz commented Apr 26, 2018

mattklein123 commented Apr 26, 2018

mattklein123 commented Apr 26, 2018

mattklein123 commented Apr 26, 2018 • edited Loading

mattklein123 commented Apr 26, 2018

postwait commented Apr 26, 2018

mattklein123 commented Apr 26, 2018

ramaraochavali commented Apr 27, 2018

jmarantz commented Apr 27, 2018 via email

mattklein123 commented Apr 27, 2018

ramaraochavali commented Apr 27, 2018 • edited Loading

mattklein123 commented Apr 27, 2018

mattklein123 commented Apr 27, 2018

postwait commented Apr 27, 2018

mattklein123 commented Apr 27, 2018

mattklein123 commented Apr 27, 2018

ramaraochavali commented Apr 28, 2018

mattklein123 commented Apr 28, 2018

ramaraochavali commented Apr 28, 2018

mattklein123 commented Apr 28, 2018

ramaraochavali commented Apr 28, 2018

mattklein123 commented Apr 28, 2018

ramaraochavali commented Apr 28, 2018

mattklein123 commented Apr 28, 2018

ramaraochavali commented Apr 28, 2018

mattklein123 commented Apr 28, 2018

postwait commented Apr 28, 2018

mattklein123 commented Apr 30, 2018

ramaraochavali commented Apr 30, 2018

mattklein123 commented Apr 26, 2018 •

edited

Loading

mattklein123 commented Apr 26, 2018 •

edited

Loading

mattklein123 commented Apr 26, 2018 •

edited

Loading

ramaraochavali commented Apr 27, 2018 •

edited

Loading