Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: add inflight snapshot metrics #11009

Merged
merged 3 commits into from
Aug 8, 2019
Merged

*: add inflight snapshot metrics #11009

merged 3 commits into from
Aug 8, 2019

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Aug 8, 2019

Currently, there's no way to find out if current follower is receiving a leader snapshot. If snapshot receive is inflight, the operator may wait until it's done, rather than marking the follower as unhealthy. This is useful for a large cluster, where snapshot sends takes several minutes (e.g. 4 GB).

/cc @xiang90 @jpbetz @jingyih

Copy link
Contributor

@jpbetz jpbetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks useful! A couple minor comments

applySnapshotInflights = prometheus.NewGauge(prometheus.GaugeOpts{
Namespace: "etcd",
Subsystem: "server",
Name: "snapshot_apply_inflights_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since this is not a network metric (and subsystem=server), maybe instead of "inflight" name it something like "snapshot_apply_in_progress_total" ?

@@ -287,6 +289,7 @@ func (h *snapshotHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
plog.Error(msg)
}
http.Error(w, msg, http.StatusInternalServerError)
snapshotReceiveInflights.WithLabelValues(from).Dec()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would be great if we could somehow structure the code blocks we only need to make one call to Dec(), ideally in a defer. The code looks correct, but I worry we'll end up miscounting at some point...

@gyuho
Copy link
Contributor Author

gyuho commented Aug 8, 2019

@jpbetz All addressed. Thanks!

@@ -90,6 +90,7 @@ func (s *snapshotSender) send(merged snap.Message) {
plog.Infof("start to send database snapshot [index: %d, to %s]...", m.Snapshot.Metadata.Index, types.ID(m.To))
}

snapshotSendInflights.WithLabelValues(to).Inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer for Dec() here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fixed :)

@jpbetz
Copy link
Contributor

jpbetz commented Aug 8, 2019

LGTM after other defer Dec() is addressed (#11009 (comment)).

Thanks!

…cd_network_snapshot_receive_inflights_total"

Useful for deciding when to terminate the unhealthy follower.
If the follower is receiving a leader snapshot, operator may wait.

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
@jingyih
Copy link
Contributor

jingyih commented Aug 8, 2019

Agree with Joe. Thanks!

gyuho added 2 commits August 8, 2019 13:33
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
@jpbetz
Copy link
Contributor

jpbetz commented Aug 8, 2019

LGTM!

@hexfusion
Copy link
Contributor

LGTM thanks @gyuho!

@gyuho gyuho merged commit 046c705 into etcd-io:master Aug 8, 2019
@gyuho gyuho deleted the snapshot branch August 8, 2019 20:56
gyuho added a commit to gyuho/etcd that referenced this pull request Aug 8, 2019
*: add inflight snapshot metrics
@wenjiaswe
Copy link
Contributor

Could we backport this to 3.2 and 3.3?

@gyuho
Copy link
Contributor Author

gyuho commented Oct 9, 2019

@wenjiaswe Sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants