Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: add watchdog to detect stalled writes #15440

Closed
wants to merge 1 commit into from

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Mar 10, 2023

@ahrtr ahrtr marked this pull request as draft March 10, 2023 02:31
@ahrtr
Copy link
Member Author

ahrtr commented Mar 10, 2023

This is the draft PR per our discussion in the doc

@mitake @ptabor @serathius @chaochn47 @fuweid Please let me know if you have any immediate comment or concern, before I continue to add tests.

server/embed/etcd.go Show resolved Hide resolved
server/embed/etcd.go Outdated Show resolved Hide resolved
server/embed/etcd.go Show resolved Hide resolved
server/watchdog/watchdog.go Outdated Show resolved Hide resolved
@ahrtr ahrtr force-pushed the watchdog_20230309 branch 2 times, most recently from b8e2988 to 989cce8 Compare March 13, 2023 00:58
v.inactiveElapsed++
if v.inactiveElapsed > wd.inactiveTimeoutTick/2 {
elapsedTime := time.Duration(v.inactiveElapsed*tickMs) * time.Millisecond
wd.lg.Warn("Slow activity detected", zap.String("activity", v.name), zap.Duration("duration", elapsedTime))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please add a gauge metric for slow activities count with activity name as label. I think this will be useful for folks who rely on metrics to actively monitor etcd cluster.

@serathius
Copy link
Member

I would like to clarify what is the proposed solution and how it differs from the design proposed by @ptabor and me. Can you please cleanup the design document before people jump on reviewing the code?

server/embed/config.go Outdated Show resolved Hide resolved
server/etcdserver/api/snap/snapshotter.go Outdated Show resolved Hide resolved
@@ -88,7 +89,9 @@ func (s *Snapshotter) save(snapshot *raftpb.Snapshot) error {
spath := filepath.Join(s.dir, fname)

fsyncStart := time.Now()
cancel := watchdog.Register("save v2 snapshot")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to decide the naming better.

a) ```
endActivity := wd.StartActivity("saving v2 snapshot")
endActivity()


b)  ```
  markAsDone := wd.Start("saving v2 snapshot")
  markAsDone()

c) ```
markAsDone := wd.Notify("saving v2 snapshot")
markAsDone()


d) ```
 endScope := wd.StartScope("saving v2 snapshot")
 endScope()

We might also have a synthactic sugar:

or watchdog.executeInScope("saving v2 snapshot", func() { ...} );

I don't like 'cancel' as it suggests an interruption of an existing process.. and it's not.


I hoped we will settle this (API / naming) in a document rather than keep updating the PR, that is relatively expensive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the suggestion of synthactic sugar. I imagine it's similar to bboltDB's View and Update method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StartXXX seems not good to me, it doesn't start any activity, instead essentially it just registers the activity to watchdog. So how about Register and unregister?

unregister := wd.Register("saving v2 snapshot")
unregister()

Users can also call the syntactic sugar as you proposed below,

wd.Execute("saving v2 snapshot", fn)

Please also see the updated doc https://docs.google.com/document/d/1U9hAcZQp3Y36q_JFiw2VBJXVAo2dK2a-8Rsbqv3GgDo/edit#heading=h.3oeryohw1c9o

@@ -88,7 +89,9 @@ func (s *Snapshotter) save(snapshot *raftpb.Snapshot) error {
spath := filepath.Join(s.dir, fname)

fsyncStart := time.Now()
cancel := watchdog.Register("save v2 snapshot")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what do you think about pre-registering type of activities that are tracked in the watchdog:

{ at the package level:
  var SAVING_V2_SNAPSHOTS = watchdog.RegisterActivity("saving v2 snapshots");
}```


endActivity := wd.StartActivity(SAVING_V2_SNAPSHOTS)
endActivity()

The main benefit is that it drives the API usage to small number of strings rather than patterns like:

wd.StartActivity(string.format("Writing log entry: %d", entryId));

thus the API can be e.g. used for monitoring how long classes of activities take in scope of a watchdog.
i.e. watchdog sees that 17% of wall-time it was in the writing-logs routine. 

It's a stretch and over design. but in theory such registration allows to build also a hierarchy of 'activities/scope'. 

wd.mu.Lock()
for _, v := range wd.activities {
v.inactiveElapsed++
if v.inactiveElapsed > wd.inactiveTimeoutTick/2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fundamental difference between this design and watchdog pattern in general is that you propose to monitor a specific registered activities.

I assumed the goal is to monitor watchdog as a whole thing....

Watchdog is unhealthy when for the whole 'timeout' it got no 'Start' nor 'Stop/Done' notifications.
Any such notifications does reset the watchdog. Lack of new activities is especially an alart.

We do register specific activities only to:
a) have actionable error messages:
- the watchdog is stuck with following active activities...
- the watchdog is stuck and the last started / DONE activity is X
b) [stretch] We might piggy back on it and collect metrics what percent of time specifica activities are active state

server/watchdog/watchdog.go Outdated Show resolved Hide resolved
server/watchdog/watchdog.go Show resolved Hide resolved
@ahrtr
Copy link
Member Author

ahrtr commented Mar 14, 2023

Thanks all for the feedback, which basically makes sense to me. The overall idea is basically coming from one of @ptabor 's comments in the doc. Will update & cleanup the doc later. I will also keep this PR (just a PoC) in sync with the doc so as to avoid any misunderstanding.

One thing I'd like to clarify. Previously I thought it should be an etcd-raft-loop watchdog, and gets notified each time on receiving a ready data struct. The watchdog isn't healthy if it doesn't receive any new notification in the given max-duration. But on second thought, it seems not good, because,

  • The etcd raft loop will not be triggered at all for one-node cluster when there is no any client requests.
  • Our purpose for now is to monitor stalled write, why should we care about the etcd-raft-loop?

So eventually I changed to monitor only registered activities (e.g. sync WAL log, commit boltDB, etc). Any concern?

@ahrtr ahrtr force-pushed the watchdog_20230309 branch 3 times, most recently from 865764b to 9b6a4a3 Compare March 17, 2023 05:34
Use watchdog to monitor all storage read/write operations

Signed-off-by: Benjamin Wang <wachao@vmware.com>
@ahrtr
Copy link
Member Author

ahrtr commented Mar 19, 2023

@ahrtr
Copy link
Member Author

ahrtr commented Apr 22, 2023

#15477 (comment)

@stale
Copy link

stale bot commented Aug 12, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Aug 12, 2023
@ahrtr ahrtr added stage/tracked and removed stale labels Aug 12, 2023
@k8s-ci-robot
Copy link

@ahrtr: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-verify 93f6dcc link true /test pull-etcd-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ruiming-lu
Copy link

ruiming-lu commented Mar 9, 2024

Hello developers! I am wondering what is the status of this issue right now? Will it be merged to the main branch?

@serathius
Copy link
Member

We decided to go with different design. Etcd will detect stalled writes by checking raft loop execution and expose a /livez endpoint that supervisor can use to restart etcd if needed.

@serathius serathius closed this Mar 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

8 participants