Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3: cancel watches proactively on client context cancellation #11850

Merged
merged 1 commit into from
May 7, 2020

Conversation

jackkleeman
Copy link
Contributor

Currently, watch cancel requests are only sent to the server after a
message comes through on a watch where the client has cancelled. This
means that cancelled watches that don't receive any new messages are
never cancelled; they persist for the lifetime of the client stream.
This has negative connotations for locking applications where a watch
may observe a key which might never change again after cancellation,
leading to many accumulating watches on the server.

By cancelling proactively, in most cases we simply move the cancel
request to happen earlier, and additionally we solve the case where the
cancel request would never be sent.

Fixes #9416
Heavy inspiration drawn from the solutions proposed there.

@gyuho gyuho self-assigned this May 6, 2020
Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you noticed any degraded performance of etcd because of this?

clientv3/watch.go Outdated Show resolved Hide resolved
clientv3/watch.go Show resolved Hide resolved
integration/v3_watch_test.go Show resolved Hide resolved
@jackkleeman jackkleeman requested a review from gyuho May 7, 2020 15:17
@jackkleeman
Copy link
Contributor Author

Have you noticed any degraded performance of etcd because of this?

Not in my basic testing. I'll roll it out to a heavy lock throughput service and see what happens. Are there any useful benchmarks we could run? What might be the source of degraded performance? As far as I can tell, we are simply shifting an inevitable close message to occur earlier.

@jackkleeman
Copy link
Contributor Author

I've checked on a service doing a few hundred locks per second, no noticeable affect on 99th percentile lock acquire latency.

Currently, watch cancel requests are only sent to the server after a
message comes through on a watch where the client has cancelled. This
means that cancelled watches that don't receive any new messages are
never cancelled; they persist for the lifetime of the client stream.
This has negative connotations for locking applications where a watch
may observe a key which might never change again after cancellation,
leading to many accumulating watches on the server.

By cancelling proactively, in most cases we simply move the cancel
request to happen earlier, and additionally we solve the case where the
cancel request would never be sent.

Fixes etcd-io#9416
Heavy inspiration drawn from the solutions proposed there.
@gyuho
Copy link
Contributor

gyuho commented May 7, 2020

watch may observe a key which might never change again after cancellation,
leading to many accumulating watches on the server.

Can we update change log linking to this PR?

I've checked on a service doing a few hundred locks per second, no noticeable affect on 99th percentile lock acquire latency.

Thanks a lot.

Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@gyuho gyuho merged commit 6d06799 into etcd-io:master May 7, 2020
@jackkleeman jackkleeman deleted the cancel-watch branch May 11, 2020 09:23
jackkleeman added a commit to jackkleeman/etcd that referenced this pull request May 11, 2020
jackkleeman added a commit to monzo/etcd that referenced this pull request May 11, 2020
The watch count metrics are not robust to duplicate cancellations. These
cause the count to be decremented twice, leading eventually to negative
counts. We are seeing this in production. The duplicate cancellations
themselves are not themselves a big problem (except performance), but
they are caused by the new proactive cancellation logic (etcd-io#11850). As it
turns out, w.closingc seems to receive two messages for a cancellation.
I have added a heuristic to help us avoid sending two cancellations but
its not guaranteed.

We might want to change the behaviour of w.closingc to avoid duplicates
there, but that could be more involved. It seems wise to me, at least,
to fix the metrics issue. The heuristic to avoid duplicate cancellation
may be valuable to those who update their client but not their server
with this fix.
jackkleeman added a commit to monzo/etcd that referenced this pull request May 11, 2020
The watch count metrics are not robust to duplicate cancellations. These
cause the count to be decremented twice, leading eventually to negative
counts. We are seeing this in production. The duplicate cancellations
themselves are not themselves a big problem (except performance), but
they are caused by the new proactive cancellation logic (etcd-io#11850). As it
turns out, w.closingc seems to receive two messages for a cancellation.
I have added a fix which ensures that we won't send duplicate cancel
requests.
jackkleeman added a commit to monzo/etcd that referenced this pull request May 11, 2020
The watch count metrics are not robust to duplicate cancellations. These
cause the count to be decremented twice, leading eventually to negative
counts. We are seeing this in production. The duplicate cancellations
themselves are not themselves a big problem (except performance), but
they are caused by the new proactive cancellation logic (etcd-io#11850). As it
turns out, w.closingc seems to receive two messages for a cancellation.
I have added a fix which ensures that we won't send duplicate cancel
requests.
jackkleeman added a commit to monzo/etcd that referenced this pull request May 11, 2020
The watch count metrics are not robust to duplicate cancellations. These
cause the count to be decremented twice, leading eventually to negative
counts. We are seeing this in production. The duplicate cancellations
themselves are not themselves a big problem (except performance), but
they are caused by the new proactive cancellation logic (etcd-io#11850). As it
turns out, w.closingc seems to receive two messages for a cancellation.
I have added a fix which ensures that we won't send duplicate cancel
requests.
jackkleeman added a commit to monzo/etcd that referenced this pull request May 11, 2020
The watch count metrics are not robust to duplicate cancellations. These
cause the count to be decremented twice, leading eventually to negative
counts. We are seeing this in production. The duplicate cancellations
themselves are not themselves a big problem (except performance), but
they are caused by the new proactive cancellation logic (etcd-io#11850) which
cancels proactively even immediately before initiating a Close, thus
nearly guaranteeing a Close-cancel race, as discussed in
watchable_store.go. We can avoid this in most cases by not sending a
cancellation when we are going to Close.
jackkleeman added a commit to monzo/etcd that referenced this pull request May 11, 2020
The watch count metrics are not robust to duplicate cancellations. These
cause the count to be decremented twice, leading eventually to negative
counts. We are seeing this in production. The duplicate cancellations
themselves are not themselves a big problem (except performance), but
they are caused by the new proactive cancellation logic (etcd-io#11850) which
cancels proactively even immediately before initiating a Close, thus
nearly guaranteeing a Close-cancel race, as discussed in
watchable_store.go. We can avoid this in most cases by not sending a
cancellation when we are going to Close.
jackkleeman added a commit to monzo/etcd that referenced this pull request May 11, 2020
The watch count metrics are not robust to duplicate cancellations. These
cause the count to be decremented twice, leading eventually to negative
counts. We are seeing this in production. The duplicate cancellations
themselves are not themselves a big problem (except performance), but
they are caused by the new proactive cancellation logic (etcd-io#11850) which
cancels proactively even immediately before initiating a Close, thus
nearly guaranteeing a Close-cancel race, as discussed in
watchable_store.go. We can avoid this in most cases by not sending a
cancellation when we are going to Close.
jackkleeman added a commit to monzo/etcd that referenced this pull request May 11, 2020
The watch count metrics are not robust to duplicate cancellations. These
cause the count to be decremented twice, leading eventually to negative
counts. We are seeing this in production. The duplicate cancellations
themselves are not themselves a big problem (except performance), but
they are caused by the new proactive cancellation logic (etcd-io#11850) which
cancels proactively even immediately before initiating a Close, thus
nearly guaranteeing a Close-cancel race, as discussed in
watchable_store.go. We can avoid this in most cases by not sending a
cancellation when we are going to Close.
jackkleeman added a commit to monzo/etcd that referenced this pull request May 11, 2020
The watch count metrics are not robust to duplicate cancellations. These
cause the count to be decremented twice, leading eventually to negative
counts. We are seeing this in production. The duplicate cancellations
themselves are not themselves a big problem (except performance), but
they are caused by the new proactive cancellation logic (etcd-io#11850) which
cancels proactively even immediately before initiating a Close, thus
nearly guaranteeing a Close-cancel race, as discussed in
watchable_store.go. We can avoid this in most cases by not sending a
cancellation when we are going to Close.
xiang90 pushed a commit that referenced this pull request May 12, 2020
The watch count metrics are not robust to duplicate cancellations. These
cause the count to be decremented twice, leading eventually to negative
counts. We are seeing this in production. The duplicate cancellations
themselves are not themselves a big problem (except performance), but
they are caused by the new proactive cancellation logic (#11850) which
cancels proactively even immediately before initiating a Close, thus
nearly guaranteeing a Close-cancel race, as discussed in
watchable_store.go. We can avoid this in most cases by not sending a
cancellation when we are going to Close.
jackkleeman added a commit to jackkleeman/etcd that referenced this pull request May 14, 2020
gyuho added a commit that referenced this pull request Jun 25, 2020
…50-origin-release-3.4

Automated cherry pick of #11850
fcvarela pushed a commit to monzo/etcd that referenced this pull request Aug 27, 2020
The watch count metrics are not robust to duplicate cancellations. These
cause the count to be decremented twice, leading eventually to negative
counts. We are seeing this in production. The duplicate cancellations
themselves are not themselves a big problem (except performance), but
they are caused by the new proactive cancellation logic (etcd-io#11850) which
cancels proactively even immediately before initiating a Close, thus
nearly guaranteeing a Close-cancel race, as discussed in
watchable_store.go. We can avoid this in most cases by not sending a
cancellation when we are going to Close.
suhailpatel pushed a commit to monzo/etcd that referenced this pull request Feb 3, 2022
The watch count metrics are not robust to duplicate cancellations. These
cause the count to be decremented twice, leading eventually to negative
counts. We are seeing this in production. The duplicate cancellations
themselves are not themselves a big problem (except performance), but
they are caused by the new proactive cancellation logic (etcd-io#11850) which
cancels proactively even immediately before initiating a Close, thus
nearly guaranteeing a Close-cancel race, as discussed in
watchable_store.go. We can avoid this in most cases by not sending a
cancellation when we are going to Close.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

clientv3: Canceling Watch() doesn't send &pb.WatchCancelRequest
2 participants