Fix regression for pool creation timeout retry #887

tiagolobocastro · 2024-11-20T00:19:01Z

test: use tmp in project workspace

Use a tmp folder from the workspace allowing us to cleanup up things like
LVM volumes a lot easier as we can just purge it.

Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

test(pool): create on very large or very slow disks

Uses LVM Lvols as backend devices for the pool.
We suspend these before pool creation, allowing us to simulate slow
pool creation.
This test ensures that the pool creation is completed by itself and also
that a client can also complete it by calling create again.

Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

fix: allow pool creation to complete asynchronously

When the initial create gRPC times out, the data-plane may still be creating
the pool in the background, which can happen for very large pools.
Rather than assume failure, we allow this to complete in the background up to
a large arbitrary amount of time. If the pool creation completes before, then
we retry the creation flow.
The reason why we don't simply use very large timeouts is because the gRPC
operations are currently sequential, mostly due to historical reasons.
Now that the data-plane is allowing concurrent calls, we should also allow
this on the control-plane.
TODO: allow concurrent node operations

Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

fix: check for correct not found error code

A previous fix ended up not working correctly because it was merged
incorrectly, somehow!

Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

chore: update terraform node prep

Pull the Release key from a recent k8s version since the old keys are no
longer valid.
This will have to be updated from time to time.

Pull the Release key from a recent k8s version since the old keys are no longer valid. This will have to be updated from time to time. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

A previous fix ended up not working correctly because it was merged incorrectly, somehow! Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

tiagolobocastro · 2024-11-20T00:19:13Z

Resolves openebs/mayastor#1772

control-plane/agents/src/bin/core/controller/resources/operations_helper.rs

When the initial create gRPC times out, the data-plane may still be creating the pool in the background, which can happen for very large pools. Rather than assume failure, we allow this to complete in the background up to a large arbitrary amount of time. If the pool creation completes before, then we retry the creation flow. The reason why we don't simply use very large timeouts is because the gRPC operations are currently sequential, mostly due to historical reasons. Now that the data-plane is allowing concurrent calls, we should also allow this on the control-plane. TODO: allow concurrent node operations Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

Uses LVM Lvols as backend devices for the pool. We suspend these before pool creation, allowing us to simulate slow pool creation. This test ensures that the pool creation is completed by itself and also that a client can also complete it by calling create again. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

tiagolobocastro · 2024-11-25T17:29:57Z

bors try

bors-openebs-mayastor · 2024-11-25T17:44:39Z

try

Build failed:

continuous-integration/jenkins/branch

Use a tmp folder from the workspace allowing us to cleanup up things like LVM volumes a lot easier as we can just purge it. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

Not sure why this is starting to fail now... even on an unchanged release branch it's failing now!? Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

tiagolobocastro · 2024-11-25T19:32:27Z

bors try

bors-openebs-mayastor · 2024-11-25T19:52:55Z

try

Build succeeded:

continuous-integration/jenkins/branch

tiagolobocastro · 2024-11-26T09:31:14Z

bors merge

bors-openebs-mayastor · 2024-11-26T09:53:10Z

Build succeeded:

continuous-integration/jenkins/branch

887: Fix regression for pool creation timeout retry r=tiagolobocastro a=tiagolobocastro test: use tmp in project workspace Use a tmp folder from the workspace allowing us to cleanup up things like LVM volumes a lot easier as we can just purge it. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- test(pool): create on very large or very slow disks Uses LVM Lvols as backend devices for the pool. We suspend these before pool creation, allowing us to simulate slow pool creation. This test ensures that the pool creation is completed by itself and also that a client can also complete it by calling create again. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix: allow pool creation to complete asynchronously When the initial create gRPC times out, the data-plane may still be creating the pool in the background, which can happen for very large pools. Rather than assume failure, we allow this to complete in the background up to a large arbitrary amount of time. If the pool creation completes before, then we retry the creation flow. The reason why we don't simply use very large timeouts is because the gRPC operations are currently sequential, mostly due to historical reasons. Now that the data-plane is allowing concurrent calls, we should also allow this on the control-plane. TODO: allow concurrent node operations Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix: check for correct not found error code A previous fix ended up not working correctly because it was merged incorrectly, somehow! Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- chore: update terraform node prep Pull the Release key from a recent k8s version since the old keys are no longer valid. This will have to be updated from time to time. Co-authored-by: Tiago Castro <tiagolobocastro@gmail.com>

890: Backport fixes to release/2.7 r=tiagolobocastro a=tiagolobocastro chore(bors): merge pull request #887 887: Fix regression for pool creation timeout retry r=tiagolobocastro a=tiagolobocastro test: use tmp in project workspace Use a tmp folder from the workspace allowing us to cleanup up things like LVM volumes a lot easier as we can just purge it. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- test(pool): create on very large or very slow disks Uses LVM Lvols as backend devices for the pool. We suspend these before pool creation, allowing us to simulate slow pool creation. This test ensures that the pool creation is completed by itself and also that a client can also complete it by calling create again. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix: allow pool creation to complete asynchronously When the initial create gRPC times out, the data-plane may still be creating the pool in the background, which can happen for very large pools. Rather than assume failure, we allow this to complete in the background up to a large arbitrary amount of time. If the pool creation completes before, then we retry the creation flow. The reason why we don't simply use very large timeouts is because the gRPC operations are currently sequential, mostly due to historical reasons. Now that the data-plane is allowing concurrent calls, we should also allow this on the control-plane. TODO: allow concurrent node operations Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix: check for correct not found error code A previous fix ended up not working correctly because it was merged incorrectly, somehow! Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- chore: update terraform node prep Pull the Release key from a recent k8s version since the old keys are no longer valid. This will have to be updated from time to time. Co-authored-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix(resize): atomically check for the required size Ensures races don't lead into volume resize failures. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- test(bdd/thin): fix racy thin prov test Add retry waiting for condition to be met. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- feat(topology): remove the internal labels while displaying Signed-off-by: sinhaashish <ashi.sinha.87@gmail.com> --- fix(fsfreeze): improved error message when volume is not staged Signed-off-by: Abhinandan Purkait <purkaitabhinandan@gmail.com> --- fix(deployer): increasing the max number of allowed connection attempts to the io-engine Signed-off-by: sinhaashish <ashi.sinha.87@gmail.com> --- fix(topology): hasTopologyKey overwites affinityTopologyLabels Signed-off-by: sinhaashish <ashi.sinha.87@gmail.com> Co-authored-by: sinhaashish <ashi.sinha.87@gmail.com> Co-authored-by: Abhinandan Purkait <purkaitabhinandan@gmail.com> Co-authored-by: Tiago Castro <tiagolobocastro@gmail.com> Co-authored-by: mayastor-bors <mayastor-bors@noreply.github.com>

887: Fix regression for pool creation timeout retry r=tiagolobocastro a=tiagolobocastro test: use tmp in project workspace Use a tmp folder from the workspace allowing us to cleanup up things like LVM volumes a lot easier as we can just purge it. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- test(pool): create on very large or very slow disks Uses LVM Lvols as backend devices for the pool. We suspend these before pool creation, allowing us to simulate slow pool creation. This test ensures that the pool creation is completed by itself and also that a client can also complete it by calling create again. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix: allow pool creation to complete asynchronously When the initial create gRPC times out, the data-plane may still be creating the pool in the background, which can happen for very large pools. Rather than assume failure, we allow this to complete in the background up to a large arbitrary amount of time. If the pool creation completes before, then we retry the creation flow. The reason why we don't simply use very large timeouts is because the gRPC operations are currently sequential, mostly due to historical reasons. Now that the data-plane is allowing concurrent calls, we should also allow this on the control-plane. TODO: allow concurrent node operations Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix: check for correct not found error code A previous fix ended up not working correctly because it was merged incorrectly, somehow! Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- chore: update terraform node prep Pull the Release key from a recent k8s version since the old keys are no longer valid. This will have to be updated from time to time. Co-authored-by: Tiago Castro <tiagolobocastro@gmail.com>

887: Fix regression for pool creation timeout retry r=tiagolobocastro a=tiagolobocastro test: use tmp in project workspace Use a tmp folder from the workspace allowing us to cleanup up things like LVM volumes a lot easier as we can just purge it. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- test(pool): create on very large or very slow disks Uses LVM Lvols as backend devices for the pool. We suspend these before pool creation, allowing us to simulate slow pool creation. This test ensures that the pool creation is completed by itself and also that a client can also complete it by calling create again. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix: allow pool creation to complete asynchronously When the initial create gRPC times out, the data-plane may still be creating the pool in the background, which can happen for very large pools. Rather than assume failure, we allow this to complete in the background up to a large arbitrary amount of time. If the pool creation completes before, then we retry the creation flow. The reason why we don't simply use very large timeouts is because the gRPC operations are currently sequential, mostly due to historical reasons. Now that the data-plane is allowing concurrent calls, we should also allow this on the control-plane. TODO: allow concurrent node operations Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix: check for correct not found error code A previous fix ended up not working correctly because it was merged incorrectly, somehow! Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- chore: update terraform node prep Pull the Release key from a recent k8s version since the old keys are no longer valid. This will have to be updated from time to time. Co-authored-by: Tiago Castro <tiagolobocastro@gmail.com> Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

890: Backport fixes to release/2.7 r=tiagolobocastro a=tiagolobocastro chore(bors): merge pull request #887 887: Fix regression for pool creation timeout retry r=tiagolobocastro a=tiagolobocastro test: use tmp in project workspace Use a tmp folder from the workspace allowing us to cleanup up things like LVM volumes a lot easier as we can just purge it. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- test(pool): create on very large or very slow disks Uses LVM Lvols as backend devices for the pool. We suspend these before pool creation, allowing us to simulate slow pool creation. This test ensures that the pool creation is completed by itself and also that a client can also complete it by calling create again. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix: allow pool creation to complete asynchronously When the initial create gRPC times out, the data-plane may still be creating the pool in the background, which can happen for very large pools. Rather than assume failure, we allow this to complete in the background up to a large arbitrary amount of time. If the pool creation completes before, then we retry the creation flow. The reason why we don't simply use very large timeouts is because the gRPC operations are currently sequential, mostly due to historical reasons. Now that the data-plane is allowing concurrent calls, we should also allow this on the control-plane. TODO: allow concurrent node operations Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix: check for correct not found error code A previous fix ended up not working correctly because it was merged incorrectly, somehow! Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- chore: update terraform node prep Pull the Release key from a recent k8s version since the old keys are no longer valid. This will have to be updated from time to time. Co-authored-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix(resize): atomically check for the required size Ensures races don't lead into volume resize failures. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- test(bdd/thin): fix racy thin prov test Add retry waiting for condition to be met. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- feat(topology): remove the internal labels while displaying Signed-off-by: sinhaashish <ashi.sinha.87@gmail.com> --- fix(fsfreeze): improved error message when volume is not staged Signed-off-by: Abhinandan Purkait <purkaitabhinandan@gmail.com> --- fix(deployer): increasing the max number of allowed connection attempts to the io-engine Signed-off-by: sinhaashish <ashi.sinha.87@gmail.com> --- fix(topology): hasTopologyKey overwites affinityTopologyLabels Signed-off-by: sinhaashish <ashi.sinha.87@gmail.com> Co-authored-by: sinhaashish <ashi.sinha.87@gmail.com> Co-authored-by: Abhinandan Purkait <purkaitabhinandan@gmail.com> Co-authored-by: Tiago Castro <tiagolobocastro@gmail.com> Co-authored-by: mayastor-bors <mayastor-bors@noreply.github.com>

tiagolobocastro added 2 commits November 18, 2024 12:44

chore: update terraform node prep

8333853

Pull the Release key from a recent k8s version since the old keys are no longer valid. This will have to be updated from time to time. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

fix: check for correct not found error code

8c2b226

A previous fix ended up not working correctly because it was merged incorrectly, somehow! Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

tiagolobocastro requested review from sinhaashish, Abhinandan-Purkait, abhilashshetty04 and dsharma-dc November 20, 2024 00:19

dsharma-dc approved these changes Nov 20, 2024

View reviewed changes

control-plane/agents/src/bin/core/controller/resources/operations_helper.rs Outdated Show resolved Hide resolved

tiagolobocastro force-pushed the pool-timeout branch 3 times, most recently from 9607994 to 405f633 Compare November 25, 2024 12:59

tiagolobocastro force-pushed the pool-timeout branch 2 times, most recently from b0afe9d to 83ac4b4 Compare November 25, 2024 15:27

tiagolobocastro mentioned this pull request Nov 25, 2024

feat(chart): add requestTimeout for core-agent container openebs/mayastor-extensions#569

Merged

7 tasks

bors-openebs-mayastor bot pushed a commit that referenced this pull request Nov 25, 2024

Try #887:

1471be1

tiagolobocastro added 2 commits November 25, 2024 19:30

test: use tmp in project workspace

1f638e3

Use a tmp folder from the workspace allowing us to cleanup up things like LVM volumes a lot easier as we can just purge it. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

test: increase etcd pagination test retries

8caf2a6

Not sure why this is starting to fail now... even on an unchanged release branch it's failing now!? Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

tiagolobocastro force-pushed the pool-timeout branch from 83ac4b4 to 8caf2a6 Compare November 25, 2024 19:32

bors-openebs-mayastor bot pushed a commit that referenced this pull request Nov 25, 2024

Try #887:

f336f99

Abhinandan-Purkait approved these changes Nov 26, 2024

View reviewed changes

bors-openebs-mayastor bot merged commit 61ef768 into develop Nov 26, 2024
4 checks passed

bors-openebs-mayastor bot deleted the pool-timeout branch November 26, 2024 09:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix regression for pool creation timeout retry #887

Fix regression for pool creation timeout retry #887

tiagolobocastro commented Nov 20, 2024 •

edited

Loading

tiagolobocastro commented Nov 20, 2024

tiagolobocastro commented Nov 25, 2024

bors-openebs-mayastor bot commented Nov 25, 2024

tiagolobocastro commented Nov 25, 2024

bors-openebs-mayastor bot commented Nov 25, 2024

tiagolobocastro commented Nov 26, 2024

bors-openebs-mayastor bot commented Nov 26, 2024

Fix regression for pool creation timeout retry #887

Fix regression for pool creation timeout retry #887

Conversation

tiagolobocastro commented Nov 20, 2024 • edited Loading

tiagolobocastro commented Nov 20, 2024

tiagolobocastro commented Nov 25, 2024

bors-openebs-mayastor bot commented Nov 25, 2024

try

tiagolobocastro commented Nov 25, 2024

bors-openebs-mayastor bot commented Nov 25, 2024

try

tiagolobocastro commented Nov 26, 2024

bors-openebs-mayastor bot commented Nov 26, 2024

tiagolobocastro commented Nov 20, 2024 •

edited

Loading