-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clientv3: cancel watches proactively on client context cancellation #11850
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you noticed any degraded performance of etcd because of this?
Not in my basic testing. I'll roll it out to a heavy lock throughput service and see what happens. Are there any useful benchmarks we could run? What might be the source of degraded performance? As far as I can tell, we are simply shifting an inevitable close message to occur earlier. |
I've checked on a service doing a few hundred locks per second, no noticeable affect on 99th percentile lock acquire latency. |
Currently, watch cancel requests are only sent to the server after a message comes through on a watch where the client has cancelled. This means that cancelled watches that don't receive any new messages are never cancelled; they persist for the lifetime of the client stream. This has negative connotations for locking applications where a watch may observe a key which might never change again after cancellation, leading to many accumulating watches on the server. By cancelling proactively, in most cases we simply move the cancel request to happen earlier, and additionally we solve the case where the cancel request would never be sent. Fixes etcd-io#9416 Heavy inspiration drawn from the solutions proposed there.
Can we update change log linking to this PR?
Thanks a lot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
The watch count metrics are not robust to duplicate cancellations. These cause the count to be decremented twice, leading eventually to negative counts. We are seeing this in production. The duplicate cancellations themselves are not themselves a big problem (except performance), but they are caused by the new proactive cancellation logic (etcd-io#11850). As it turns out, w.closingc seems to receive two messages for a cancellation. I have added a heuristic to help us avoid sending two cancellations but its not guaranteed. We might want to change the behaviour of w.closingc to avoid duplicates there, but that could be more involved. It seems wise to me, at least, to fix the metrics issue. The heuristic to avoid duplicate cancellation may be valuable to those who update their client but not their server with this fix.
The watch count metrics are not robust to duplicate cancellations. These cause the count to be decremented twice, leading eventually to negative counts. We are seeing this in production. The duplicate cancellations themselves are not themselves a big problem (except performance), but they are caused by the new proactive cancellation logic (etcd-io#11850). As it turns out, w.closingc seems to receive two messages for a cancellation. I have added a fix which ensures that we won't send duplicate cancel requests.
The watch count metrics are not robust to duplicate cancellations. These cause the count to be decremented twice, leading eventually to negative counts. We are seeing this in production. The duplicate cancellations themselves are not themselves a big problem (except performance), but they are caused by the new proactive cancellation logic (etcd-io#11850). As it turns out, w.closingc seems to receive two messages for a cancellation. I have added a fix which ensures that we won't send duplicate cancel requests.
The watch count metrics are not robust to duplicate cancellations. These cause the count to be decremented twice, leading eventually to negative counts. We are seeing this in production. The duplicate cancellations themselves are not themselves a big problem (except performance), but they are caused by the new proactive cancellation logic (etcd-io#11850). As it turns out, w.closingc seems to receive two messages for a cancellation. I have added a fix which ensures that we won't send duplicate cancel requests.
The watch count metrics are not robust to duplicate cancellations. These cause the count to be decremented twice, leading eventually to negative counts. We are seeing this in production. The duplicate cancellations themselves are not themselves a big problem (except performance), but they are caused by the new proactive cancellation logic (etcd-io#11850) which cancels proactively even immediately before initiating a Close, thus nearly guaranteeing a Close-cancel race, as discussed in watchable_store.go. We can avoid this in most cases by not sending a cancellation when we are going to Close.
The watch count metrics are not robust to duplicate cancellations. These cause the count to be decremented twice, leading eventually to negative counts. We are seeing this in production. The duplicate cancellations themselves are not themselves a big problem (except performance), but they are caused by the new proactive cancellation logic (etcd-io#11850) which cancels proactively even immediately before initiating a Close, thus nearly guaranteeing a Close-cancel race, as discussed in watchable_store.go. We can avoid this in most cases by not sending a cancellation when we are going to Close.
The watch count metrics are not robust to duplicate cancellations. These cause the count to be decremented twice, leading eventually to negative counts. We are seeing this in production. The duplicate cancellations themselves are not themselves a big problem (except performance), but they are caused by the new proactive cancellation logic (etcd-io#11850) which cancels proactively even immediately before initiating a Close, thus nearly guaranteeing a Close-cancel race, as discussed in watchable_store.go. We can avoid this in most cases by not sending a cancellation when we are going to Close.
The watch count metrics are not robust to duplicate cancellations. These cause the count to be decremented twice, leading eventually to negative counts. We are seeing this in production. The duplicate cancellations themselves are not themselves a big problem (except performance), but they are caused by the new proactive cancellation logic (etcd-io#11850) which cancels proactively even immediately before initiating a Close, thus nearly guaranteeing a Close-cancel race, as discussed in watchable_store.go. We can avoid this in most cases by not sending a cancellation when we are going to Close.
The watch count metrics are not robust to duplicate cancellations. These cause the count to be decremented twice, leading eventually to negative counts. We are seeing this in production. The duplicate cancellations themselves are not themselves a big problem (except performance), but they are caused by the new proactive cancellation logic (etcd-io#11850) which cancels proactively even immediately before initiating a Close, thus nearly guaranteeing a Close-cancel race, as discussed in watchable_store.go. We can avoid this in most cases by not sending a cancellation when we are going to Close.
The watch count metrics are not robust to duplicate cancellations. These cause the count to be decremented twice, leading eventually to negative counts. We are seeing this in production. The duplicate cancellations themselves are not themselves a big problem (except performance), but they are caused by the new proactive cancellation logic (#11850) which cancels proactively even immediately before initiating a Close, thus nearly guaranteeing a Close-cancel race, as discussed in watchable_store.go. We can avoid this in most cases by not sending a cancellation when we are going to Close.
…50-origin-release-3.4 Automated cherry pick of #11850
The watch count metrics are not robust to duplicate cancellations. These cause the count to be decremented twice, leading eventually to negative counts. We are seeing this in production. The duplicate cancellations themselves are not themselves a big problem (except performance), but they are caused by the new proactive cancellation logic (etcd-io#11850) which cancels proactively even immediately before initiating a Close, thus nearly guaranteeing a Close-cancel race, as discussed in watchable_store.go. We can avoid this in most cases by not sending a cancellation when we are going to Close.
The watch count metrics are not robust to duplicate cancellations. These cause the count to be decremented twice, leading eventually to negative counts. We are seeing this in production. The duplicate cancellations themselves are not themselves a big problem (except performance), but they are caused by the new proactive cancellation logic (etcd-io#11850) which cancels proactively even immediately before initiating a Close, thus nearly guaranteeing a Close-cancel race, as discussed in watchable_store.go. We can avoid this in most cases by not sending a cancellation when we are going to Close.
Currently, watch cancel requests are only sent to the server after a
message comes through on a watch where the client has cancelled. This
means that cancelled watches that don't receive any new messages are
never cancelled; they persist for the lifetime of the client stream.
This has negative connotations for locking applications where a watch
may observe a key which might never change again after cancellation,
leading to many accumulating watches on the server.
By cancelling proactively, in most cases we simply move the cancel
request to happen earlier, and additionally we solve the case where the
cancel request would never be sent.
Fixes #9416
Heavy inspiration drawn from the solutions proposed there.