-
Notifications
You must be signed in to change notification settings - Fork 391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bound the time that stats calculation can take #467
Conversation
keep stats from proceeding forever.
1 second is pretty short, have you timed bear calculating histograms on a busy cluster? |
I have not, I'll give that a shot. On Thu, Dec 5, 2013 at 4:38 PM, Russell Brown notifications@github.comwrote:
|
Hrm. What are the odds that the vnode_status delay can be made worse by a retry-timeout-retry-too-soon loop? |
So the problem here isn't that vnode status is taking forever to return, On Thu, Dec 5, 2013 at 7:32 PM, Scott Lystig Fritchie <
|
+1 merge |
Bound the time that stats calculation can take
We usually detect stale stats in the output by looking at riak_kv_stat_ts. |
That's an excellent question. There are two components here, to Do you have an opinion, Rune? |
The most important use case for stats is as a snapshot to indicate how Riak is doing. Operators set up their monitoring tools to poll "riak-admin status" or "/stats" every few minutes, and alert them if percentiles start climbing or other potential precursors of service degradation. Trustworthy stats is what makes operators sleep well at night. :-) If we want to remain friends with the Riak operators, we should do our best to avoid undetectable stale stats. The times where it is most important to be able to be rely on stats being fresh, is when the the likelyhood of them going stale is also highest - such as overload scenarios or hardware failures. Returning stale stats here could potentially mask symptoms, and delay the necessary intervention. I think marking a stat as failed in the output is a good idea. On a side note, basho/riak_kv#753 could be the culprit behind the observed hung stats calculations during overload. |
Oh indeed. |
Given time constraints, I think that we'll leave this in 1.4.4, but I'll issue a new PR to mark the value with calculation failure for 2.0. Yes, 753 is the direct cause of this particular failure. Part of the 2.0 fix will be to make various things that stats depends on (vnode_status, particularly) overload safe. |
Bound the time that stats calculation can take
add a timeout to stat calculation so that stuck or extremely slow processes no longer keep stats from proceeding forever.
cc @jonmeredith @russelldb
Note that jon already did a revew, so we may just want to merge this.
We took two different approaches to testing this:
This should also get into develop, along with a fix for the base problem (vnode_status not being overload safe).