-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flacky test TestLeasingDeleteRangeContendTxn #15352
Comments
Thanks for report and repro! |
just adding another error message here, in the test linked above the result was empty []:
|
Fixes etcd-io#15352. Depending on the goroutine scheduling, the expected count of 8 might not have been reached yet. This ensures the routine won't stop earlier than that. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
Fixes etcd-io#15352. Depending on the goroutine scheduling, the expected count of 8 might not have been reached yet. This ensures the routine won't stop earlier than that. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
The TestLeasingDeleteRangeContendTxn is trying to test for RangeDelete when the target resources are being updated. When the `txnLeasing` wants a server-side transaction, it needs to ensure all the keys mod revision should be leass than what it saw. If the compare fails, it will repeat to apply the server-side transaction until it is sucessful. I believe the test-case is trying to verify how the `txnLeasing` handles the race issue. Before the patch etcd-io#15401, the resource-updating goroutine keeps updating until the RangeDelete finishes. The testcase is flaky because two goroutines are sharing one `ctx` and grpc-go client won't wait for the response if `ctx` has been canceled. For example, | DelLease Goroutine | PutLease Goroutine | ETCD Server | Key/0 Status | | -- | --- | -- | -- | | deleted | | | version = 0 | | | send update(key/0=123) req | received update(key/0=123) req | version = 0 | | cancel | | | version = 0 | | | exit because of cancel | | version = 0 | | get key/0 by putkv | | | version = 0 | | | | applied update(key/0=123) | version = 1 | | get key/0 by raw-cli | | | version = 1 | So `raw-cli` gets `[key/0=123]` while the `putkv` gets `[]`. If `putkv` applies two update reqs to ETCD server and the last one is canceled before apply, the error will be like: ``` expected [key:"key/0" version:2 value:"123" ], got [key:"key/0" version:1 value:"123" ] ``` The resource-updating goroutine should not share the ctx with RangeDelete here. And I also revert current main branch because the resource-update goroutine only updates 8 times and might exit before `RangeDelete`. In this case, the `txnLeasing` is not handling the race issue. Fixes: etcd-io#15352 Signed-off-by: Wei Fu <fuweid89@gmail.com>
The TestLeasingDeleteRangeContendTxn is trying to test for RangeDelete when the target resources are being updated. When the `txnLeasing` wants a server-side transaction, it needs to ensure all the keys mod revision should be leass than what it saw. If the compare fails, it will repeat to apply the server-side transaction until it is sucessful. I believe the test-case is trying to verify how the `txnLeasing` handles the race issue. Before the patch etcd-io#15401, the resource-updating goroutine keeps updating until the RangeDelete finishes. The testcase is flaky because two goroutines are sharing one `ctx` and grpc-go client won't wait for the response if `ctx` has been canceled. For example, | DelLease Goroutine | PutLease Goroutine | ETCD Server | Key/0 Status | | -- | --- | -- | -- | | deleted | | | version = 0 | | | send update(key/0=123) req | received update(key/0=123) req | version = 0 | | cancel | | | version = 0 | | | exit because of cancel | | version = 0 | | get key/0 by putkv | | | version = 0 | | | | applied update(key/0=123) | version = 1 | | get key/0 by raw-cli | | | version = 1 | So `raw-cli` gets `[key/0=123]` while the `putkv` gets `[]`. If `putkv` applies two update reqs to ETCD server and the last one is canceled before apply, the error will be like: ``` expected [key:"key/0" version:2 value:"123" ], got [key:"key/0" version:1 value:"123" ] ``` The resource-updating goroutine should not share the ctx with RangeDelete here. And I also revert current main branch because the resource-update goroutine only updates 8 times and might exit before `RangeDelete`. In this case, the `txnLeasing` is not handling the race issue. Fixes: etcd-io#15352 Signed-off-by: Wei Fu <fuweid89@gmail.com>
Fixes etcd-io#15352. Depending on the goroutine scheduling, the expected count of 8 might not have been reached yet. This ensures the routine won't stop earlier than that. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com> Signed-off-by: Prasad Chandrasekaran <prasadc@vmware.com>
The TestLeasingDeleteRangeContendTxn is trying to test for RangeDelete when the target resources are being updated. When the `txnLeasing` wants a server-side transaction, it needs to ensure all the keys mod revision should be leass than what it saw. If the compare fails, it will repeat to apply the server-side transaction until it is sucessful. I believe the test-case is trying to verify how the `txnLeasing` handles the race issue. Before the patch etcd-io#15401, the resource-updating goroutine keeps updating until the RangeDelete finishes. The testcase is flaky because two goroutines are sharing one `ctx` and grpc-go client won't wait for the response if `ctx` has been canceled. For example, | DelLease Goroutine | PutLease Goroutine | ETCD Server | Key/0 Status | | -- | --- | -- | -- | | deleted | | | version = 0 | | | send update(key/0=123) req | received update(key/0=123) req | version = 0 | | cancel | | | version = 0 | | | exit because of cancel | | version = 0 | | get key/0 by putkv | | | version = 0 | | | | applied update(key/0=123) | version = 1 | | get key/0 by raw-cli | | | version = 1 | So `raw-cli` gets `[key/0=123]` while the `putkv` gets `[]`. If `putkv` applies two update reqs to ETCD server and the last one is canceled before apply, the error will be like: ``` expected [key:"key/0" version:2 value:"123" ], got [key:"key/0" version:1 value:"123" ] ``` The resource-updating goroutine should not share the ctx with RangeDelete here. And I also revert current main branch because the resource-update goroutine only updates 8 times and might exit before `RangeDelete`. In this case, the `txnLeasing` is not handling the race issue. Fixes: etcd-io#15352 Signed-off-by: Wei Fu <fuweid89@gmail.com> Signed-off-by: Prasad Chandrasekaran <prasadc@vmware.com>
Which github workflows are flaking?
Failed in the forked repo github workflow https://github.com/chaochn47/etcd/actions/runs/4258007188/jobs/7408759584
The flacky test has been mentioned in the PR comment #14918 (comment) and it still can be reproduced now.
Which tests are flaking?
TestLeasingDeleteRangeContendTxn
Github Action link
No response
Reason for failure (if possible)
No response
Anything else we need to know?
Reproduced in the local box
The text was updated successfully, but these errors were encountered: