Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bound the time that stats calculation can take #467

Merged
merged 2 commits into from
Dec 7, 2013
Merged

Conversation

evanmcc
Copy link
Contributor

@evanmcc evanmcc commented Dec 6, 2013

add a timeout to stat calculation so that stuck or extremely slow processes no longer keep stats from proceeding forever.

cc @jonmeredith @russelldb

Note that jon already did a revew, so we may just want to merge this.

We took two different approaches to testing this:

  • I mecked riak_kv_vnode:vnode_status to never return (which is what we were seeing in the field) and instrumented the code to be more verbose in logging.
  • Jon IIRC set the proxy overload length to 1 and then loaded the cluster to get a more "natural" reproduction.

This should also get into develop, along with a fix for the base problem (vnode_status not being overload safe).

@russelldb
Copy link
Member

1 second is pretty short, have you timed bear calculating histograms on a busy cluster?

@evanmcc
Copy link
Contributor Author

evanmcc commented Dec 6, 2013

I have not, I'll give that a shot.

On Thu, Dec 5, 2013 at 4:38 PM, Russell Brown notifications@github.comwrote:

1 second is pretty short, have you timed bear calculating histograms on a
busy cluster?


Reply to this email directly or view it on GitHubhttps://github.com//pull/467#issuecomment-29953614
.

@slfritchie
Copy link
Contributor

Hrm. What are the odds that the vnode_status delay can be made worse by a retry-timeout-retry-too-soon loop?

@evanmcc
Copy link
Contributor Author

evanmcc commented Dec 6, 2013

So the problem here isn't that vnode status is taking forever to return,
it's that the vnode is never getting the status message, and its receive
never times out. That said, I am totally willing to raise the timeout, I
just selected 1s arbitrarily to make stats move on more quickly.

On Thu, Dec 5, 2013 at 7:32 PM, Scott Lystig Fritchie <
notifications@github.com> wrote:

Hrm. What are the odds that the vnode_status delay can be made worse by a
retry-timeout-retry-too-soon loop?


Reply to this email directly or view it on GitHubhttps://github.com//pull/467#issuecomment-29961171
.

@jonmeredith
Copy link
Contributor

+1 merge

evanmcc added a commit that referenced this pull request Dec 7, 2013
Bound the time that stats calculation can take
@evanmcc evanmcc merged commit 366b9dd into 1.4 Dec 7, 2013
@cmeiklejohn cmeiklejohn deleted the pevm-status-safety branch December 7, 2013 04:09
@runesl
Copy link

runesl commented Dec 10, 2013

We usually detect stale stats in the output by looking at riak_kv_stat_ts.
Is it still possible to detect stale stats after this pull?

@evanmcc
Copy link
Contributor Author

evanmcc commented Dec 10, 2013

That's an excellent question. There are two components here, to
staleness. The implication of this patch is that some stats may silently
become stale for an unknown period of time, while allowing the rest of the
stats to proceed more or less in sync with the timestamp. We could
possibly mark the failing stat instead of allowing it to get staler (I've
yet to port this change to the 2.0 branch so we could change it there).

Do you have an opinion, Rune?

@runesl
Copy link

runesl commented Dec 11, 2013

The most important use case for stats is as a snapshot to indicate how Riak is doing. Operators set up their monitoring tools to poll "riak-admin status" or "/stats" every few minutes, and alert them if percentiles start climbing or other potential precursors of service degradation. Trustworthy stats is what makes operators sleep well at night. :-)

If we want to remain friends with the Riak operators, we should do our best to avoid undetectable stale stats. The times where it is most important to be able to be rely on stats being fresh, is when the the likelyhood of them going stale is also highest - such as overload scenarios or hardware failures. Returning stale stats here could potentially mask symptoms, and delay the necessary intervention.

I think marking a stat as failed in the output is a good idea.

On a side note, basho/riak_kv#753 could be the culprit behind the observed hung stats calculations during overload.

@russelldb
Copy link
Member

Oh indeed.

@evanmcc
Copy link
Contributor Author

evanmcc commented Dec 11, 2013

Given time constraints, I think that we'll leave this in 1.4.4, but I'll issue a new PR to mark the value with calculation failure for 2.0.

Yes, 753 is the direct cause of this particular failure. Part of the 2.0 fix will be to make various things that stats depends on (vnode_status, particularly) overload safe.

evanmcc added a commit that referenced this pull request Dec 14, 2013
Bound the time that stats calculation can take
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants