clientv3: cancel watches proactively on client context cancellation #11850

jackkleeman · 2020-05-06T11:19:47Z

Currently, watch cancel requests are only sent to the server after a
message comes through on a watch where the client has cancelled. This
means that cancelled watches that don't receive any new messages are
never cancelled; they persist for the lifetime of the client stream.
This has negative connotations for locking applications where a watch
may observe a key which might never change again after cancellation,
leading to many accumulating watches on the server.

By cancelling proactively, in most cases we simply move the cancel
request to happen earlier, and additionally we solve the case where the
cancel request would never be sent.

Fixes #9416
Heavy inspiration drawn from the solutions proposed there.

gyuho

Have you noticed any degraded performance of etcd because of this?

clientv3/watch.go

integration/v3_watch_test.go

jackkleeman · 2020-05-07T15:26:18Z

Have you noticed any degraded performance of etcd because of this?

Not in my basic testing. I'll roll it out to a heavy lock throughput service and see what happens. Are there any useful benchmarks we could run? What might be the source of degraded performance? As far as I can tell, we are simply shifting an inevitable close message to occur earlier.

jackkleeman · 2020-05-07T15:47:40Z

I've checked on a service doing a few hundred locks per second, no noticeable affect on 99th percentile lock acquire latency.

Currently, watch cancel requests are only sent to the server after a message comes through on a watch where the client has cancelled. This means that cancelled watches that don't receive any new messages are never cancelled; they persist for the lifetime of the client stream. This has negative connotations for locking applications where a watch may observe a key which might never change again after cancellation, leading to many accumulating watches on the server. By cancelling proactively, in most cases we simply move the cancel request to happen earlier, and additionally we solve the case where the cancel request would never be sent. Fixes etcd-io#9416 Heavy inspiration drawn from the solutions proposed there.

gyuho · 2020-05-07T18:06:22Z

watch may observe a key which might never change again after cancellation,
leading to many accumulating watches on the server.

Can we update change log linking to this PR?

I've checked on a service doing a few hundred locks per second, no noticeable affect on 99th percentile lock acquire latency.

Thanks a lot.

gyuho

lgtm.

etcd-io#11850

The watch count metrics are not robust to duplicate cancellations. These cause the count to be decremented twice, leading eventually to negative counts. We are seeing this in production. The duplicate cancellations themselves are not themselves a big problem (except performance), but they are caused by the new proactive cancellation logic (etcd-io#11850). As it turns out, w.closingc seems to receive two messages for a cancellation. I have added a heuristic to help us avoid sending two cancellations but its not guaranteed. We might want to change the behaviour of w.closingc to avoid duplicates there, but that could be more involved. It seems wise to me, at least, to fix the metrics issue. The heuristic to avoid duplicate cancellation may be valuable to those who update their client but not their server with this fix.

The watch count metrics are not robust to duplicate cancellations. These cause the count to be decremented twice, leading eventually to negative counts. We are seeing this in production. The duplicate cancellations themselves are not themselves a big problem (except performance), but they are caused by the new proactive cancellation logic (etcd-io#11850). As it turns out, w.closingc seems to receive two messages for a cancellation. I have added a fix which ensures that we won't send duplicate cancel requests.

The watch count metrics are not robust to duplicate cancellations. These cause the count to be decremented twice, leading eventually to negative counts. We are seeing this in production. The duplicate cancellations themselves are not themselves a big problem (except performance), but they are caused by the new proactive cancellation logic (etcd-io#11850) which cancels proactively even immediately before initiating a Close, thus nearly guaranteeing a Close-cancel race, as discussed in watchable_store.go. We can avoid this in most cases by not sending a cancellation when we are going to Close.

The watch count metrics are not robust to duplicate cancellations. These cause the count to be decremented twice, leading eventually to negative counts. We are seeing this in production. The duplicate cancellations themselves are not themselves a big problem (except performance), but they are caused by the new proactive cancellation logic (#11850) which cancels proactively even immediately before initiating a Close, thus nearly guaranteeing a Close-cancel race, as discussed in watchable_store.go. We can avoid this in most cases by not sending a cancellation when we are going to Close.

etcd-io#11850

…50-origin-release-3.4 Automated cherry pick of #11850

The watch count metrics are not robust to duplicate cancellations. These cause the count to be decremented twice, leading eventually to negative counts. We are seeing this in production. The duplicate cancellations themselves are not themselves a big problem (except performance), but they are caused by the new proactive cancellation logic (etcd-io#11850) which cancels proactively even immediately before initiating a Close, thus nearly guaranteeing a Close-cancel race, as discussed in watchable_store.go. We can avoid this in most cases by not sending a cancellation when we are going to Close.

jackkleeman force-pushed the cancel-watch branch from 7cd61c6 to 93825c9 Compare May 6, 2020 11:40

gyuho self-assigned this May 6, 2020

gyuho suggested changes May 7, 2020

View reviewed changes

clientv3/watch.go Outdated Show resolved Hide resolved

clientv3/watch.go Show resolved Hide resolved

integration/v3_watch_test.go Show resolved Hide resolved

jackkleeman force-pushed the cancel-watch branch from 93825c9 to fae28b1 Compare May 7, 2020 15:17

jackkleeman requested a review from gyuho May 7, 2020 15:17

jackkleeman force-pushed the cancel-watch branch from fae28b1 to 87aa5a9 Compare May 7, 2020 15:50

gyuho approved these changes May 7, 2020

View reviewed changes

gyuho merged commit 6d06799 into etcd-io:master May 7, 2020

jackkleeman deleted the cancel-watch branch May 11, 2020 09:23

jackkleeman added a commit to jackkleeman/etcd that referenced this pull request May 11, 2020

Update CHANGELOG for 11850

2b61ee0

etcd-io#11850

jackkleeman mentioned this pull request May 11, 2020

Update CHANGELOG for 11850 #11874

Merged

jackkleeman mentioned this pull request May 11, 2020

mvcc: avoid negative watcher count metrics #11879

Closed

jackkleeman mentioned this pull request May 11, 2020

mvcc: avoid negative watcher count metrics #11882

Merged

jackkleeman added a commit to jackkleeman/etcd that referenced this pull request May 14, 2020

CHANGELOG: update for 11850

3b6d222

etcd-io#11850

This was referenced Jun 22, 2020

proxy/grpcproxy: fix grpc proxy hang when broadcast failed to cancel a watcher #12030

Merged

Automated cherry pick of #11850 #12055

Merged

gyuho added a commit that referenced this pull request Jun 25, 2020

Merge pull request #12055 from tangcong/automated-cherry-pick-of-#118…

83fc96d

…50-origin-release-3.4 Automated cherry pick of #11850

liggitt mentioned this pull request Jul 26, 2020

excessive watch cancellation logging on server shutdown kubernetes/kubernetes#93450

Closed

halleyshx mentioned this pull request Jun 17, 2021

the hyperlink of bug fixes(5) in the blog is incorrect #13116

Closed

tangcong mentioned this pull request Jun 17, 2021

content/en/blog/2021: add "Announcing etcd 3.5" etcd-io/website#312

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clientv3: cancel watches proactively on client context cancellation #11850

clientv3: cancel watches proactively on client context cancellation #11850

jackkleeman commented May 6, 2020

gyuho left a comment

jackkleeman commented May 7, 2020

jackkleeman commented May 7, 2020

gyuho commented May 7, 2020

gyuho left a comment

clientv3: cancel watches proactively on client context cancellation #11850

clientv3: cancel watches proactively on client context cancellation #11850

Conversation

jackkleeman commented May 6, 2020

gyuho left a comment

Choose a reason for hiding this comment

jackkleeman commented May 7, 2020

jackkleeman commented May 7, 2020

gyuho commented May 7, 2020

gyuho left a comment

Choose a reason for hiding this comment