-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: sigabort in histogram code #688
Comments
note that this issue is extremely low frequency (~5 times in the last 6 weeks with thousands of sessions in hundreds of clients at Lyft) and happened before the major refactor in #616 |
@jmarantz, sorry to pull you in here. I came back to this today, and re-read the histogram code. I am a little bit at a loss about how we could be triggering either of the assertions in the refresh code. All manipulations of the two vectors should ensure that we end up with the supported number of elements... Do you have any intuition about what might be happening here? |
@ramaraochavali in envoyproxy/envoy#10179 you mentioned the previous merge messing up with data. Well, what is interesting is that we noticed that another crash started happening after we started flushing metrics by manual triggers #748. That problem is more simple, as it makes sense that we are hitting the assertion. However, what is interesting is that the crash reported in this issue saw an uptick in occurrences since we started doing manual flushes (vs. in the past is happened very very rarely). So I am wondering if it is possible that a previous merge is messing up a current merge. However, what gives me less confidence in that theory is that the assert triggered in #748 should protect us against my theory. Obviously, there is also the possibility that the uptick in this crash since the introduction of #748 might just be a red herring, correlation != causation and all, but it seems suspicious. lmk if you have any further thoughts. |
Just to clarify on my previous comment about the previous merge messing with the current merge, I did not mean they are running in parallel - If they were running in parallel, the merge_in_progress assert will protect us. What I meant was at https://github.com/envoyproxy/envoy/blob/master/source/common/stats/histogram_impl.cc#L67 we call I do not know enough prometheus part to confirm about computed buckets though. Any way, let us see what you find. |
@junr03 Just curious, did you find any thing on this? |
Hi @ramaraochavali so sorry for not updating this, I totally missed your last two messages. Your clarification makes total sense, I agree that that is how subsequent calls could mess up the assertion. Of course the calls could not run in parallel, as protected by the assertion. I don't have anything to report as of yet. After #749, the rate of this crash went back down to very sporadically, so I de-prioritized it for the time being. I will update this once I pick it back up. |
@junr03 No problem. Thanks for the details and will wait to see what you find. |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions. |
In trying to repro #838 reliably, I actually found a 100% repro for this crash! It helped me understand what is actually happening here. The fix from the envoy mobile side is to pursue instance-based engines that do not have a static lifetime. cc @ramaraochavali in case you are curious |
@junr03 Great find and thank you for the update. |
Should we just make the static vector of doubles be a c array, or lazy init
it per style guide?
There are also a daily significant number of static std:: string and other
small structures in the code base. Maybe we should add a help-wanted bug to
eliminate all of those? Maybe with a format-check to ensure no more are
added?
…On Sat, May 9, 2020, 4:27 AM Rama Chavali ***@***.***> wrote:
@junr03 <https://github.com/junr03> Great find and thank you for the
update.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#688 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAO2IPNCOB7OXTJL5NR7PSTRQUHWLANCNFSM4KUZZTEQ>
.
|
changing to |
Yep, way easier than solving on our side. Thanks :) |
|
Description: brings in an envoy update. Relevant commits: - envoyproxy/envoy#11127 which uses the construct on first use idiom for certain static variables in the histogram code which were causing #688. Risk Level: low Testing: fixed local repro of the crash. Fixes #688 Signed-off-by: Jose Nino <jnino@lyft.com>
The text was updated successfully, but these errors were encountered: