Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: sigabort in histogram code #688

Closed
junr03 opened this issue Feb 13, 2020 · 15 comments · Fixed by #839
Closed

core: sigabort in histogram code #688

junr03 opened this issue Feb 13, 2020 · 15 comments · Fixed by #839

Comments

@junr03
Copy link
Member

junr03 commented Feb 13, 2020

SIGABRT: 

0  libsystem_kernel.dylib  ___pthread_kill
1  libsystem_pthread.dylib _pthread_kill$VARIANT$armv81
2  libsystem_c.dylib       _abort
3  Lyft                    Envoy::Stats::HistogramStatisticsImpl::refresh(histogram const*) (Lyft)
4  Lyft                    Envoy::Stats::ParentHistogramImpl::merge() (Lyft)
5  Lyft                    Envoy::Stats::ThreadLocalStoreImpl::mergeInternal(std::__1::function<void ()>) (Lyft)
6  Lyft                    std::__1::__function::__func<Envoy::Stats::ThreadLocalStoreImpl::mergeHistograms(std::__1::function<void ()>)::$_2, std::__1::allocator<Envoy::Stats::ThreadLocalStoreImpl::mergeHistograms(std::__1::function<void ()>)::$_2>, void ()>::operator()() (Lyft)
7  Lyft                    Envoy::Event::DispatcherImpl::runPostCallbacks() (Lyft)
8  Lyft                    event_process_active_single_queue (Lyft)
9  Lyft                    event_base_loop (Lyft)
10 Lyft                    Envoy::Server::InstanceImpl::run() (Lyft)
11 Lyft                    Envoy::MainCommonBase::run() (Lyft)
12 Lyft                    Envoy::Engine::run(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) (Lyft)
13 Lyft                    void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, envoy_status_t (Envoy::Engine::*)(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >), Envoy::Engine*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >(void*) (Lyft)
14 libsystem_pthread.dylib __pthread_start
@junr03 junr03 added the crash label Feb 13, 2020
@junr03 junr03 added this to the v0.3 "Secondi" milestone Feb 13, 2020
@junr03 junr03 self-assigned this Feb 13, 2020
@junr03
Copy link
Member Author

junr03 commented Feb 13, 2020

note that this issue is extremely low frequency (~5 times in the last 6 weeks with thousands of sessions in hundreds of clients at Lyft) and happened before the major refactor in #616

@junr03
Copy link
Member Author

junr03 commented Mar 11, 2020

@jmarantz, sorry to pull you in here. I came back to this today, and re-read the histogram code. I am a little bit at a loss about how we could be triggering either of the assertions in the refresh code. All manipulations of the two vectors should ensure that we end up with the supported number of elements...

Do you have any intuition about what might be happening here?

@junr03
Copy link
Member Author

junr03 commented Mar 13, 2020

@ramaraochavali in envoyproxy/envoy#10179 you mentioned the previous merge messing up with data. Well, what is interesting is that we noticed that another crash started happening after we started flushing metrics by manual triggers #748. That problem is more simple, as it makes sense that we are hitting the assertion.

However, what is interesting is that the crash reported in this issue saw an uptick in occurrences since we started doing manual flushes (vs. in the past is happened very very rarely). So I am wondering if it is possible that a previous merge is messing up a current merge. However, what gives me less confidence in that theory is that the assert triggered in #748 should protect us against my theory.

Obviously, there is also the possibility that the uptick in this crash since the introduction of #748 might just be a red herring, correlation != causation and all, but it seems suspicious.

lmk if you have any further thoughts.

@ramaraochavali
Copy link

Just to clarify on my previous comment about the previous merge messing with the current merge, I did not mean they are running in parallel - If they were running in parallel, the merge_in_progress assert will protect us.

What I meant was at https://github.com/envoyproxy/envoy/blob/master/source/common/stats/histogram_impl.cc#L67 we call hist_approx_quantile to compute the quantiles for the new histogram. That populates the computes_quantiles_ vector. If that call for example populates incorrectly, we would come to know about that only in the next merge because assert at line 66 would fire when it is firing next time. We should probably assert immediately after the call to hist_approx_quantile - otherwise computed_quantiles_ is never altered outside of this call. This theory is assuming that you are hitting that assert some how. Does this make sense?

I do not know enough prometheus part to confirm about computed buckets though.

Any way, let us see what you find.

@ramaraochavali
Copy link

@junr03 Just curious, did you find any thing on this?

@junr03
Copy link
Member Author

junr03 commented Mar 27, 2020

Hi @ramaraochavali so sorry for not updating this, I totally missed your last two messages. Your clarification makes total sense, I agree that that is how subsequent calls could mess up the assertion. Of course the calls could not run in parallel, as protected by the assertion.

I don't have anything to report as of yet. After #749, the rate of this crash went back down to very sporadically, so I de-prioritized it for the time being. I will update this once I pick it back up.

@ramaraochavali
Copy link

@junr03 No problem. Thanks for the details and will wait to see what you find.

@stale
Copy link

stale bot commented Apr 27, 2020

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 27, 2020
@stale
Copy link

stale bot commented May 4, 2020

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

@stale stale bot closed this as completed May 4, 2020
@rebello95 rebello95 reopened this May 4, 2020
@junr03
Copy link
Member Author

junr03 commented May 8, 2020

In trying to repro #838 reliably, I actually found a 100% repro for this crash! It helped me understand what is actually happening here. HistogramStatisticsImpl has a static const std::vector<doubles> here and here. On the other hand, Envoy Mobile's Engine has a static lifecycle held by a shared_ptr here. The thread that runs the envoy event loop exits via ~Engine(). Thus, because the static destruction outside of compilation units is undefined, it is possible for the static vector in the histogram class to be destroyed while the envoy event loop is still running. And thus, during a histogram merge, the assert comparing the supported vs the computed quantiles would fail. In brief, classic static deinitialization fiasco.

The fix from the envoy mobile side is to pursue instance-based engines that do not have a static lifetime.

cc @ramaraochavali in case you are curious

@ramaraochavali
Copy link

@junr03 Great find and thank you for the update.

@jmarantz
Copy link

jmarantz commented May 9, 2020 via email

@ramaraochavali
Copy link

ramaraochavali commented May 10, 2020

changing to construct_on_first_use sounds good to me - Envoy PR envoyproxy/envoy#11127

@junr03
Copy link
Member Author

junr03 commented May 10, 2020

Should we just make the static vector of doubles be a c array, or lazy init it per style guide?

Yep, way easier than solving on our side. Thanks :)

@junr03
Copy link
Member Author

junr03 commented May 10, 2020

Maybe we should add a help-wanted bug to eliminate all of those? Maybe with a format-check to ensure no more are added?

envoyproxy/envoy#11130

junr03 added a commit that referenced this issue May 12, 2020
Description: brings in an envoy update. Relevant commits:
    - envoyproxy/envoy#11127 which uses the construct on first use idiom for certain static variables in the histogram code which were causing #688.
Risk Level: low
Testing: fixed local repro of the crash.

Fixes #688

Signed-off-by: Jose Nino <jnino@lyft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants