Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory corruption in histogram change #3223

Closed
mattklein123 opened this issue Apr 26, 2018 · 43 comments · Fixed by #3262
Closed

memory corruption in histogram change #3223

mattklein123 opened this issue Apr 26, 2018 · 43 comments · Fixed by #3262
Assignees
Labels
Milestone

Comments

@mattklein123
Copy link
Member

mattklein123 commented Apr 26, 2018

There is a memory corruption issue somewhere in the following commit range:
b9e7961...a328bc5

The crash signatures are not consistent, but when it first happened I was pretty sure it was somehow related to e293ffb.

However, I reverted that, and we are still seeing crashes. (I also spent 2 hours looking at the code and it seems fine.)

Given the changes in the above range, I think the next most likely candidate is the histogram change: 9a1e49f

I'm going to spend time today digging into this, but I wanted to open this in case anyone else wants to scan through the diffs and help with bug spotting.

cc @envoyproxy/maintainers @jmarantz @mrice32 @ramaraochavali

@jmarantz
Copy link
Contributor

Have you found a way to repro?

@mattklein123
Copy link
Member Author

No, it's a low rate crash. No repro currently, just load/time.

@jmarantz
Copy link
Contributor

Can you recover any stacktrace?

@jmarantz
Copy link
Contributor

I'm going to spin up running all tests under valgrind, but in terms of bug-finding, we may also need to look at the circlhist C impl, or the interaction with it, as opposed to just the diffs in Envoy's repo.

@jmarantz
Copy link
Contributor

valgrind is noisy, but I have a pending branch somewhere to allow our rampant mismatched/new/malloc/free calls (or those in dependencies). But nevertheless, running it changed the timing, and this assert popped up:

[2018-04-26 14:11:43.056][24][critical][assert] source/common/stats/thread_local_store.cc:103] assert failure: !merge_in_progress_

this was in metrics_service_integration_test. This looks familiar, @ramaraochavali

@mattklein123
Copy link
Member Author

mattklein123 commented Apr 26, 2018 via email

@ramaraochavali
Copy link
Contributor

ramaraochavali commented Apr 26, 2018 via email

@jmarantz
Copy link
Contributor

Thanks @ramaraochavali -- Probably we wanted to use either RELEASE_ASSERT, or also leave !merge_in_proress_ in the conditional with !shutting_down_

@ramaraochavali
Copy link
Contributor

Yes. the original implementation was to tie it with !shutting_down_ and skip the callback. But in the last minute we changed it to ASSERT.

@jmarantz
Copy link
Contributor

Yeah be we should still leave in the conditional, especially as this is not a RELEASE_ASSERT. Otherwise if it happens under load then the hard-to-define behavior leaks through.. Of course there's no guarantee this is actually the issue Matt saw.

@ramaraochavali
Copy link
Contributor

Yeah. Let us see.

@mattklein123
Copy link
Member Author

@jmarantz @ramaraochavali I have more crash dumps sporadically rolling in and I'm almost positive now the issue is in the histogram code. I already see at least one issue through code inspection but there might be more. Let me investigate a bit and I will get back. @jmarantz @ramaraochavali any work that can be done to get valgrind working, load test, etc. would be pretty nice.

@jmarantz
Copy link
Contributor

PRs issued for RELEASE_ASSERT (#3228) and valgrind helper (#3227)

@mattklein123
Copy link
Member Author

mattklein123 commented Apr 26, 2018

#0  0x00007f99a2f671fb in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x00000000010affa3 in Envoy::SignalAction::sigHandler(int, siginfo_t*, void*) ()
#2  <signal handler called>
#3  0x00007f99a28bec37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#4  0x00007f99a28c2028 in __GI_abort () at abort.c:89
#5  0x00007f99a28b7bf6 in __assert_fail_base (fmt=0x7f99a2a0c058 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x151538e "tgtidx < tgt->allocd", file=file@entry=0x151524f "external/com_github_circonus_labs_libcircllhist/src/circllhist.c", line=line@entry=844, function=function@entry=0x15153a3 "void internal_bucket_accum(histogram_t *, int, histogram_t *, int)") at assert.c:92
#6  0x00007f99a28b7ca2 in __GI___assert_fail (assertion=0x151538e "tgtidx < tgt->allocd", file=0x151524f "external/com_github_circonus_labs_libcircllhist/src/circllhist.c", line=844, function=0x15153a3 "void internal_bucket_accum(histogram_t *, int, histogram_t *, int)") at assert.c:101
#7  0x0000000000fcef2c in internal_bucket_accum ()
#8  0x0000000000fcee57 in hist_needed_merge_size_fc ()
#9  0x0000000000fcea77 in hist_accumulate ()
#10 0x00000000009d9538 in Envoy::Stats::ThreadLocalHistogramImpl::merge(histogram*) ()
#11 0x00000000009d9ca7 in Envoy::Stats::ParentHistogramImpl::merge() ()
#12 0x00000000009d74e4 in Envoy::Stats::ThreadLocalStoreImpl::mergeInternal(std::function<void ()>) ()
#13 0x00000000009dac00 in Envoy::Stats::ThreadLocalStoreImpl::mergeHistograms(std::function<void ()>)::$_2::operator()() const ()
#14 0x00000000009da9ed in std::_Function_handler<void (), Envoy::Stats::ThreadLocalStoreImpl::mergeHistograms(std::function<void ()>)::$_2>::_M_invoke(std::_Any_data const&) ()
#15 0x000000000061ac4e in std::function<void ()>::operator()() const ()
#16 0x000000000098bec7 in Envoy::Event::DispatcherImpl::runPostCallbacks() ()
#17 0x000000000098c8d8 in Envoy::Event::DispatcherImpl::DispatcherImpl(std::unique_ptr<Envoy::Buffer::WatermarkFactory, std::default_delete<Envoy::Buffer::WatermarkFactory> >&&)::$_1::operator()() const ()
#18 0x000000000098c78d in std::_Function_handler<void (), Envoy::Event::DispatcherImpl::DispatcherImpl(std::unique_ptr<Envoy::Buffer::WatermarkFactory, std::default_delete<Envoy::Buffer::WatermarkFactory> >&&)::$_1>::_M_invoke(std::_Any_data const&) ()
#19 0x000000000061ac4e in std::function<void ()>::operator()() const ()
#20 0x00000000009921fd in Envoy::Event::TimerImpl::TimerImpl(Envoy::Event::DispatcherImpl&, std::function<void ()>)::$_0::operator()(int, short, void*) const ()
#21 0x00000000009921c8 in Envoy::Event::TimerImpl::TimerImpl(Envoy::Event::DispatcherImpl&, std::function<void ()>)::$_0::__invoke(int, short, void*) ()
#22 0x0000000000f93343 in event_process_active_single_queue ()
#23 0x0000000000f9036e in event_base_loop ()
#24 0x000000000098be1d in Envoy::Event::DispatcherImpl::run(Envoy::Event::Dispatcher::RunType) ()
#25 0x000000000092f21b in Envoy::Server::InstanceImpl::run() ()
#26 0x0000000000588758 in Envoy::MainCommonBase::run() ()
#27 0x000000000058731c in Envoy::MainCommon::run() ()
#28 0x0000000000586bcd in main ()

^ is a failure on a debug build running on a production box. This looks like a real assert.

@jmarantz
Copy link
Contributor

@mrice32 this is almost a smoking gun for my theory. An alternate PR could be quickly constructed that just blocks the overlapped merge, which might make the problem go away. Or if Matt merges #3228 either in the repo or just locally for his testing, then it would actually show as an assert. @mattklein123 wdyt?

@mattklein123
Copy link
Member Author

@jmarantz I'm running debug build now w/ asserts. Let's just debug a bit. I think we can figure this out relatively soon.

@mattklein123
Copy link
Member Author

@postwait can you take a look at ^ stack trace and see if that quickly means anything to you? I'm looking at the code now.

@mattklein123
Copy link
Member Author

mattklein123 commented Apr 26, 2018

I do see some issues in the original commit that need to be fixed on the Envoy side, but I kind of doubt those are causing the crashes I'm seeing. The above stack trace/assert is highly suspicious and I suspect we might be hitting a bug in the library where accumulation is not working correctly in some case. I'm going to recompile w/ symbols and see if I can get a core dump with symbols.

@mattklein123
Copy link
Member Author

I'm able to repro this somewhat regularly on one of our production hosts. I'm going to deploy a build with debug symbols and get a core dump which should allow me to inspect the histograms and see what is going on. Following the library code is pretty difficult for the uninitiated.

@postwait
Copy link

I've not seen this issue before.. To summarize the library intent.

You wish to merge a set of histograms with (possibly) varying bins.

We iterate through the ordered bins on each counting uniques...

We assign that count and alllocate that number of bins to a target histogram.

We iterate against accumulating those bins into the new target histogram.

This could happen if one of the source histograms were modified in another thread while this operation was happening... All inputs to the merge function are unprotected must be handled by the calling app (must guarantee that the will not be modified by another thread).

All this said, I'll pour over this some more and see if I can find issues. This merge function is used extensively in our database to perform very large reductions of time-offset histograms and we've never seen a crash like this one. Understanding a bit more context might be helpful, do you use something like backtrace.io to catch faults? This would allow a bit more insight into the stack vars and such.

@mattklein123
Copy link
Member Author

@postwait I'm gathering a core dump w/ full symbols. I've been looking at the library code and I also agree this is likely a concurrency issue. I'm trying to track down where/how this might be happening.

@ramaraochavali
Copy link
Contributor

@mattklein123 thanks much for digging in to this. any pointers from you that I should look at to help this debugging?
Based on the stack, I am suspecting couple of things

  1. The runOnAllThreads might be calling the post_complete_callback before all workers finished ? so the worker thread's currentIndex is still pointing to the histogram (so it is writing to it) and merge is also working on the same histogram?
  2. the merge method micro optimization that we did in the last minute, instead of creating temp array we just passed direct pointer - this should not cause issue but one area of code to look for. @jmarantz WDYT?
    To eliminate the suspicion of concurrency issue, should we just temporarily introduce a lock in the merge method and see if the problem goes away?

@jmarantz
Copy link
Contributor

jmarantz commented Apr 27, 2018 via email

@mattklein123
Copy link
Member Author

Assuming the posting machinery is working correctly, current_active_ should not need to be atomic, but obviously it can't hurt to make it so. I'm working on getting a TSAN/ASAN build running in our production environment. This should make it pretty clear as to whether this is a threading issue or not. I already tried some locking and it still crashed which is somewhat surprising, so it's still not out of the realm of possibility that it's a library issue, though from my looking at the code I find that unlikely.

@ramaraochavali
Copy link
Contributor

ramaraochavali commented Apr 27, 2018

@mattklein123 Thank you. Let us see what TSAN/ASAN shows us. FWIW, Today I tried with TSAN on Mac with 50 threads of the same service sending requests continuously. TSAN did not report any errors. Not sure if this load is sufficient to trigger that condition. I am still puzzled as to why locking did not help though it is inefficient. Will wait for Matt's results on TSAN/ASAN run on production.

@mattklein123
Copy link
Member Author

Update, 2018-04-27T15:57:02.42857 envoy: external/com_github_circonus_labs_libcircllhist/src/circllhist.c:844: void internal_bucket_accum(histogram_t *, int, histogram_t *, int): Assertion `tgtidx < tgt->allocd' failed. fired on TSAN build with no TSAN error. Going to try ASAN now.

@mattklein123
Copy link
Member Author

ASSERT hit on ASAN build also. This is very confusing. Will keep poking.

@postwait
Copy link

I'd love to explore a core to look at the internals of all merging histograms.

@mattklein123
Copy link
Member Author

@postwait I can get you a core dump. Let me email you offline.

@mattklein123 mattklein123 changed the title memory corruption in recent commits memory corruption in histogram change Apr 27, 2018
@mattklein123
Copy link
Member Author

Just a small update here. I've still been plugging away at this. I've tried the following things:

  1. Making current_active_ atomic
  2. Adding complete locking around record/merge
  3. Some small other fixes around scope listing during merge
  4. Using "fast" histograms

I'm now running a test using extra assertions provided by @postwait. The rough things I've tried are shown here: 7499284

I will keep thinking about this, but I've been through the Envoy side code really carefully and given the extra locking it's really difficult see where the corruption would be coming from (also no hits on ASAN/TSAN).

@ramaraochavali
Copy link
Contributor

@mattklein123 @postwait Couple of thoughts

  1. If the issue is happening with source histogram, can we try by cloning the histogram using hist_clone and passing the cloned histogram to merge - This would completely eliminate the doubt that is being modified by some one else while hist_accumulate is being executed. While the lock that @mattklein123 has introduced does the same, just a thought to try from the library perspective. Since the interval histogram may not have huge amount of values clone should be faster. @postwait WDYT?
  2. If the assert is due to target histogram, based on conversation I suppose this is not the problem. If this is problem, we could try reallocating it before merge (instead of hist_clear()).

@mattklein123
Copy link
Member Author

@postwait:

I applied the following diff on top of your current master:

mklein@localhost:~/Source/libcircllhist$ git diff HEAD~1
diff --git a/src/circllhist.c b/src/circllhist.c
index 23ad46f..ceaf202 100644
--- a/src/circllhist.c
+++ b/src/circllhist.c
@@ -54,14 +54,14 @@ static union {
 } private_nan_union = { .private_nan_internal_rep = 0x7fffffffffffffff };
 
 #define MAX_HIST_BINS (2 + 2 * 90 * 256)
-#ifdef DEBUG
+//#ifdef DEBUG
 #define assert_good_hist(h) do { \
   assert(h->allocd <= MAX_HIST_BINS); \
   assert(h->used <= h->allocd); \
 } while(0)
-#else
-#define assert_good_hist(h)
-#endif
+//#else
+//#define assert_good_hist(h)
+//#endif
 #define private_nan private_nan_union.private_nan_double_rep
 
 static double power_of_ten[256] = {
@@ -823,6 +823,9 @@ hist_needed_merge_size_fc(histogram_t **hist, int cnt,
                                     histogram_t *src, int srcidx),
                           histogram_t *tgt) {
   assert_good_hist(hist[0]);
+  if (tgt) {
+     assert_good_hist(tgt);
+  }
   unsigned short idx_static[8192];
   unsigned short *idx = idx_static;
   int i, count = 0;
@@ -831,6 +834,9 @@ hist_needed_merge_size_fc(histogram_t **hist, int cnt,
     if(!idx) return -1;
   }
   memset(idx, 0, cnt * sizeof(*idx));
+  if (tgt) {
+     assert_good_hist(tgt);
+  }
   while(1) {
     hist_bucket_t smallest = { .exp = 0, .val = 0 };
     for(i=0;i<cnt;i++)
@@ -944,6 +950,7 @@ hist_accumulate(histogram_t *tgt, const histogram_t* const *src, int cnt) {
   tgt->used = 0;
   if (! tgt->allocd) tgt->allocd = 1;
   tgt->bvs = tgt->allocator->calloc(tgt->allocd, sizeof(*tgt->bvs));
+  assert_good_hist(tgt);
   hist_needed_merge_size_fc(inclusive_src, cnt+1, internal_bucket_accum, tgt);
   if(oldtgtbuff) tgt->allocator->free(oldtgtbuff);
   if(inclusive_src != inclusive_src_static) free(inclusive_src);

I'm able to cause the last assert in that diff to hit:

(gdb) bt
#0  0x00007f60488701fb in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x00000000010b9193 in Envoy::SignalAction::sigHandler (sig=6, info=0x7f60490aa4b0, context=0x7f60490aa380) at external/envoy/source/exe/signal_action.cc:40
#2  <signal handler called>
#3  0x00007f60481c7c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#4  0x00007f60481cb028 in __GI_abort () at abort.c:89
#5  0x00007f60481c0bf6 in __assert_fail_base (fmt=0x7f6048315058 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x151e990 "tgt->allocd <= MAX_HIST_BINS", file=file@entry=0x151e6b0 "external/com_github_circonus_labs_libcircllhist/src/circllhist.c", line=line@entry=953, 
    function=function@entry=0x151ea4a "int hist_accumulate(histogram_t *, const histogram_t *const *, int)") at assert.c:92
#6  0x00007f60481c0ca2 in __GI___assert_fail (assertion=0x151e990 "tgt->allocd <= MAX_HIST_BINS", file=0x151e6b0 "external/com_github_circonus_labs_libcircllhist/src/circllhist.c", line=953, function=0x151ea4a "int hist_accumulate(histogram_t *, const histogram_t *const *, int)") at assert.c:101
#7  0x0000000000fd796e in hist_accumulate (tgt=0x3fb5da0, src=0x3fd7748, cnt=1) at external/com_github_circonus_labs_libcircllhist/src/circllhist.c:953
#8  0x00000000009dce99 in Envoy::Stats::ThreadLocalHistogramImpl::merge (this=0x3fd76d0, target=0x3fb5da0) at external/envoy/source/common/stats/thread_local_store.cc:340
#9  0x00000000009dd637 in Envoy::Stats::ParentHistogramImpl::merge (this=0x3fce6c0) at external/envoy/source/common/stats/thread_local_store.cc:377
#10 0x00000000009dabb4 in Envoy::Stats::ThreadLocalStoreImpl::mergeInternal(std::function<void ()>) (this=0x3c40e40, merge_complete_cb=...) at external/envoy/source/common/stats/thread_local_store.cc:122
#11 0x00000000009de510 in Envoy::Stats::ThreadLocalStoreImpl::mergeHistograms(std::function<void ()>)::$_2::operator()() const (this=0x8812960) at external/envoy/source/common/stats/thread_local_store.cc:115
#12 0x00000000009de2fd in std::_Function_handler<void (), Envoy::Stats::ThreadLocalStoreImpl::mergeHistograms(std::function<void ()>)::$_2>::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.1/../../../../include/c++/5.4.1/functional:1871
#13 0x000000000061c95e in std::function<void ()>::operator()() const (this=0x7ffe021187d8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.1/../../../../include/c++/5.4.1/functional:2267
#14 0x000000000098f677 in Envoy::Event::DispatcherImpl::runPostCallbacks (this=0x3c41200) at external/envoy/source/common/event/dispatcher_impl.cc:171
#15 0x0000000000990088 in Envoy::Event::DispatcherImpl::DispatcherImpl(std::unique_ptr<Envoy::Buffer::WatermarkFactory, std::default_delete<Envoy::Buffer::WatermarkFactory> >&&)::$_1::operator()() const (this=0x3c381e8) at external/envoy/source/common/event/dispatcher_impl.cc:36
#16 0x000000000098ff3d in std::_Function_handler<void (), Envoy::Event::DispatcherImpl::DispatcherImpl(std::unique_ptr<Envoy::Buffer::WatermarkFactory, std::default_delete<Envoy::Buffer::WatermarkFactory> >&&)::$_1>::_M_invoke(std::_Any_data const&) (__functor=...)
    at /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.1/../../../../include/c++/5.4.1/functional:1871
#17 0x000000000061c95e in std::function<void ()>::operator()() const (this=0x3c381e8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.1/../../../../include/c++/5.4.1/functional:2267
#18 0x00000000009959ad in Envoy::Event::TimerImpl::TimerImpl(Envoy::Event::DispatcherImpl&, std::function<void ()>)::$_0::operator()(int, short, void*) const (this=0x3ca04c0, arg=0x3c38160) at external/envoy/source/common/event/timer_impl.cc:15
#19 0x0000000000995978 in Envoy::Event::TimerImpl::TimerImpl(Envoy::Event::DispatcherImpl&, std::function<void ()>)::$_0::__invoke(int, short, void*) (arg=0x3c38160) at external/envoy/source/common/event/timer_impl.cc:15
#20 0x0000000000f9b6b3 in event_process_active_single_queue (base=0x3ca02c0, activeq=0x3c3e580, max_to_process=2147483647, endtime=0x0) at event.c:1646
#21 0x0000000000f986de in event_process_active (base=<optimized out>) at event.c:1741
#22 event_base_loop (base=0x3ca02c0, flags=<optimized out>) at event.c:1961
#23 0x000000000098f5cd in Envoy::Event::DispatcherImpl::run (this=0x3c41200, type=Envoy::Event::Dispatcher::Block) at external/envoy/source/common/event/dispatcher_impl.cc:161
#24 0x00000000009329cb in Envoy::Server::InstanceImpl::run (this=0x3cc5180) at external/envoy/source/server/server.cc:402
#25 0x0000000000589ba8 in Envoy::MainCommonBase::run (this=0x3c8c998) at external/envoy/source/exe/main_common.cc:83
#26 0x000000000058872c in Envoy::MainCommon::run (this=0x3c8c500) at bazel-out/k8-dbg/bin/external/envoy/source/exe/_virtual_includes/envoy_main_common_lib/exe/main_common.h:44
#27 0x0000000000587fdd in main (argc=18, argv=0x7ffe02118d88) at external/envoy/source/exe/main.cc:30
(gdb) p *tgt
$11 = {
  allocd = 46086, 
  used = 0, 
  fast = 0, 
  allocator = 0x1d68b00 <default_allocator>, 
  bvs = 0x988a000
}
(gdb) p *src[0]
$12 = {
  allocd = 500, 
  used = 418, 
  fast = 0, 
  allocator = 0x1d68b00 <default_allocator>, 
  bvs = 0x4cf5400
}
(gdb) p tgt_copy
$13 = {
  allocd = 46080, 
  used = 46080, 
  fast = 0, 
  allocator = 0x1d68b00 <default_allocator>, 
  bvs = 0x9818000
}

From my read of the code there is nothing in hist_needed_merge_size_fc() that is checking for the maximum histogram size but maybe I'm missing something. (For that matter I don't see anything obvious in hist_insert_raw() that checks bounds either.

One thing I noticed is that hist_clear() just sets all used counts to zero. I'm going to try an experiment now to see if I completely reallocate the histograms during each interval if the issue goes away.

I'm almost 100% positive at this point that there is no threading issue happening on the Envoy side.

@ramaraochavali
Copy link
Contributor

@mattklein123 thanks for the update. I think complete reallocate might just help if target histogram is the problem which seems to be the case as I mentioned above...

@mattklein123
Copy link
Member Author

Yes, given that I now I think this is some type of library overflow issue, I would not be surprised if reallocation fixes the issue. I'm running a test now and will report back.

@ramaraochavali
Copy link
Contributor

And If source is not the problem, I think you are right 100% that there is no threading issue on the Envoy side.

@mattklein123
Copy link
Member Author

This patch looks like it fixes the problem:

mklein@localhost:~/Source/envoy-private/envoy$ git diff HEAD~1
diff --git a/source/common/stats/thread_local_store.cc b/source/common/stats/thread_local_store.cc
index 0bc82f0..731dd0f 100644
--- a/source/common/stats/thread_local_store.cc
+++ b/source/common/stats/thread_local_store.cc
@@ -338,7 +338,8 @@ void ThreadLocalHistogramImpl::merge(histogram_t* target) {
   std::lock_guard<std::mutex> lock(lock_);
   histogram_t** other_histogram = &histograms_[otherHistogramIndex()];
   hist_accumulate(target, other_histogram, 1);
-  hist_clear(*other_histogram);
+  hist_free(*other_histogram);
+  *other_histogram = hist_alloc();
 }
 
 ParentHistogramImpl::ParentHistogramImpl(const std::string& name, Store& parent,
@@ -368,7 +369,8 @@ bool ParentHistogramImpl::used() const {
 void ParentHistogramImpl::merge() {
   std::unique_lock<std::mutex> lock(merge_lock_);
   if (usedLockHeld()) {
-    hist_clear(interval_histogram_);
+    hist_free(interval_histogram_);
+    interval_histogram_ = hist_alloc();
     // Here we could copy all the pointers to TLS histograms in the tls_histogram_ list,
     // then release the lock before we do the actual merge. However it is not a big deal
     // because the tls_histogram merge is not that expensive as it is a single histogram
@@ -380,7 +382,7 @@ void ParentHistogramImpl::merge() {
     lock.unlock();
     //hist_accumulate(cumulative_histogram_, &interval_histogram_, 1);
     //cumulative_statistics_.refresh(cumulative_histogram_);
-    //interval_statistics_.refresh(interval_histogram_);
+    interval_statistics_.refresh(interval_histogram_);
   }
 }

I'm pretty positive at this point that hist_needed_merge_size_fc (https://github.com/circonus-labs/libcircllhist/blob/master/src/circllhist.c#L821) can overflow count. Then when we go back into to do the copy we start stomping. @postwait this assert is not sufficient as it won't track overflow: https://github.com/circonus-labs/libcircllhist/blob/master/src/circllhist.c#L942. There would need to be a check inside hist_needed_merge_size_fc every time count is incremented.

The reason I think this always shows up on the same stat in my debug (istener.0.0.0.0_9300.downstream_cx_length_ms), is that this is one of our edge boxes and the connection length ms stat is going to have an almost endless array of values. (In our env ~0 - 900,000). Over time I think we get into a situation in which only using hist_clear() leads to bin growth until we overflow.

@postwait let us know how else we can help debug.

@ramaraochavali
Copy link
Contributor

@mattklein123 thanks for the update. Good to know reallocation at least fixed the issue. Should we remove the locks we added for recordValue and merge and see because they seem to be not required.

@mattklein123
Copy link
Member Author

Once the underlying library issue is understood/fixed I will cleanup the change, test, and do a PR. There are some fixes that need to go in.

@ramaraochavali
Copy link
Contributor

sure. SGTM.

@mattklein123
Copy link
Member Author

Small correction, looks like hist_needed_merge_size_fc is using int for count, so I doubt that is overflowing, so I guess the assert() is probably good. (Though it might be.) So it's a little less clear to me here where the issue is, but hopefully @postwait can spot it.

@postwait
Copy link

I'll have some other engineers here look at this as well. We'll try to both statically analyze and repeat.

@mattklein123
Copy link
Member Author

Update: @postwait has fixed an issue in the library and my sanity tests look good. I've unrolled all the extra locking I added and things appear to be running stably. @postwait lmk when the fix gets officially pushed and I clean up my un-revert branch and push a new PR. Thank you!!!

@ramaraochavali
Copy link
Contributor

@mattklein123 Awesome news. Thank you. @postwait thanks for fixing it so quickly.

htuch pushed a commit that referenced this issue May 3, 2018
This change unreverts:
#3130
#3183
#3219
Also fixes #3223

Please see the 2nd commit for the actual changes. The changes are:

Bring in a new histogram library version with the fix (and more debug
checking).
Fix a small issue with scope iteration during merging.
Remove de-dup for histograms until we iterate to shared
storage for overlapping scope histograms. Personally, I found
this very confusing when debugging and I think the way I changed
it is better for now given the code we have.

Signed-off-by: Matt Klein <mklein@lyft.com>
ramaraochavali pushed a commit to ramaraochavali/envoy that referenced this issue May 3, 2018
This change unreverts:
Also fixes envoyproxy#3223

Please see the 2nd commit for the actual changes. The changes are:

Bring in a new histogram library version with the fix (and more debug
checking).
Fix a small issue with scope iteration during merging.
Remove de-dup for histograms until we iterate to shared
storage for overlapping scope histograms. Personally, I found
this very confusing when debugging and I think the way I changed
it is better for now given the code we have.

Signed-off-by: Matt Klein <mklein@lyft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants