Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make decommission test more reliable #2018

Merged

Conversation

cniackz
Copy link
Contributor

@cniackz cniackz commented Mar 5, 2024

Objective:

Enhance the stability of our decommissioning test.

@cniackz cniackz added the enhancement New feature or request label Mar 5, 2024
@cniackz cniackz self-assigned this Mar 5, 2024
@cniackz
Copy link
Contributor Author

cniackz commented Mar 5, 2024

Next step: Get log when test fails

@cniackz
Copy link
Contributor Author

cniackz commented Mar 5, 2024

Attempt 1:

Screenshot 2024-03-05 at 6 11 26 PM

7 minutes

@cniackz
Copy link
Contributor Author

cniackz commented Mar 5, 2024

Log when it passes:

logs_21419704597.zip

@cniackz
Copy link
Contributor Author

cniackz commented Mar 5, 2024

Rerunning decommission again...

@cniackz
Copy link
Contributor Author

cniackz commented Mar 5, 2024

I am going to execute 10 times this test or as many as needed to get the failure, I want to see how reliable it is and how to make it more robust

@pjuarezd
Copy link
Member

pjuarezd commented Mar 5, 2024

I am going to execute 10 times this test or as many as needed to get the failure, I want to see how reliable it is and how to make it more robust

check this test, have failed 4 times, if this a true possitive would be good to know why it failed
https://github.com/minio/operator/actions/runs/8163548330/job/22318543265?pr=2017

@cniackz cniackz force-pushed the make-decommission-more-reliable-1 branch from 17943b3 to d5ff329 Compare March 6, 2024 15:07
@cniackz
Copy link
Contributor Author

cniackz commented Mar 6, 2024

Thank you Pedro, I see the failure is after new pool is added:

* Adding another pool to myminio tenant to test decommissioning
wait_for_n_tenant_pods(): * Waiting for 8 'myminio' tenant pods in tenant-lite namespace
Requested specific waiting time
* Waiting for 8 pods to come up; wait_time: 600;
* Waiting on: kubectl -n tenant-lite get pods --field-selector=status.phase=Running --no-headers --ignore-not-found=true -l v1.min.io/tenant=myminio
 
 
##############################
To show visibility in all pods
##############################
NAMESPACE            NAME                                         READY   STATUS             RESTARTS      AGE
default              ubuntu-pod                                   1/1     Running            0             4m22s
kube-system          coredns-76f75df574-mz4wr                     1/1     Running            0             5m16s
kube-system          coredns-76f75df574-xj54s                     1/1     Running            0             5m16s
kube-system          etcd-kind-control-plane                      1/1     Running            0             5m30s
kube-system          kindnet-8hpv4                                1/1     Running            0             5m16s
kube-system          kindnet-c78zc                                1/1     Running            0             5m10s
kube-system          kindnet-cqtxp                                1/1     Running            0             5m11s
kube-system          kindnet-f5rjw                                1/1     Running            0             5m16s
kube-system          kindnet-tqjgk                                1/1     Running            0             5m10s
kube-system          kube-apiserver-kind-control-plane            1/1     Running            0             5m30s
kube-system          kube-controller-manager-kind-control-plane   1/1     Running            0             5m31s
kube-system          kube-proxy-24m4n                             1/1     Running            0             5m16s
kube-system          kube-proxy-5cc46                             1/1     Running            0             5m10s
kube-system          kube-proxy-92jkb                             1/1     Running            0             5m16s
kube-system          kube-proxy-nq8t2                             1/1     Running            0             5m11s
kube-system          kube-proxy-s4rvw                             1/1     Running            0             5m10s
kube-system          kube-scheduler-kind-control-plane            1/1     Running            0             5m30s
local-path-storage   local-path-provisioner-7577fdbbfb-bwrjn      1/1     Running            0             5m16s
minio-operator       console-7984458cf5-p2j8f                     1/1     Running            0             4m54s
minio-operator       minio-operator-666c7cb4b8-bp58j              1/1     Running            0             4m54s
minio-operator       minio-operator-666c7cb4b8-ss8zt              1/1     Running            0             4m54s
tenant-lite          myminio-pool-0-0                             2/2     Running            0             3m22s
tenant-lite          myminio-pool-0-1                             2/2     Running            0             3m22s
tenant-lite          myminio-pool-0-2                             2/2     Running            0             3m22s
tenant-lite          myminio-pool-0-3                             1/2     Error              1 (14s ago)   15s
tenant-lite          myminio-pool-1-0                             1/2     CrashLoopBackOff   3 (40s ago)   97s
tenant-lite          myminio-pool-1-1                             1/2     Error              1 (5s ago)    8s
tenant-lite          myminio-pool-1-2                             1/2     Error              1 (10s ago)   12s
tenant-lite          myminio-pool-1-3                             1/2     Error              1 (13s ago)   15s
 
 
 
Waiting for the tenant pods to be ready (5m timeout)
pod/myminio-pool-0-0 condition met
pod/myminio-pool-0-1 condition met
pod/myminio-pool-0-2 condition met
pod/myminio-pool-0-3 condition met
pod/myminio-pool-1-0 condition met
pod/myminio-pool-1-1 condition met
pod/myminio-pool-1-2 condition met
error: timed out waiting for the condition on pods/myminio-pool-1-3
/home/runner/work/operator/operator/testing/decommission-test.sh: cannot kubectl wait --namespace tenant-lite --for=condition=ready pod --selector v1.min.io/tenant=myminio --timeout=300s
Deleting cluster "kind" ...
Deleted nodes: ["kind-control-plane" "kind-worker3" "kind-worker" "kind-worker2" "kind-worker4"]
Error: Process completed with exit code 111.

@cniackz
Copy link
Contributor Author

cniackz commented Mar 6, 2024

I haven't been able to reproduce this failure yet, and I currently don't understand its root cause. However, this issue arises intermittently when adding a new pool. To address it, I will start by increasing the timeout from 300 seconds to 600 seconds and observe if this makes the test more reliable. Perhaps it's just a matter of giving the pool enough time to be properly added.

@cniackz cniackz force-pushed the make-decommission-more-reliable-1 branch from d5ff329 to b33dfde Compare March 6, 2024 16:10
@cniackz
Copy link
Contributor Author

cniackz commented Mar 6, 2024

Another cause of failure was a missing object following decommission. I am printing the list of objects to further debug this issue:

Verify Data in remaining pool(s) after decommission test
Get data and verify files are still present as they were uploaded
There was an error in mc ls, retrying
1
There was an error in mc ls, retrying
0
mc ls was successful
fail, there is a missing file: file1
exiting with 1 as there is missing file...
command terminated with exit code 1

@cniackz cniackz force-pushed the make-decommission-more-reliable-1 branch from b33dfde to 472d60a Compare March 6, 2024 16:20
@cniackz cniackz changed the title [WIP] - Make decommission test more reliable Make decommission test more reliable Mar 6, 2024
@cniackz
Copy link
Contributor Author

cniackz commented Mar 6, 2024

This version is more stable than before. If any further issues arise, please notify me, and I will work to find solutions to make it even better.

README.md Outdated Show resolved Hide resolved
@cniackz cniackz force-pushed the make-decommission-more-reliable-1 branch 2 times, most recently from 4214a68 to 9aeaab8 Compare March 7, 2024 18:49
@cniackz cniackz force-pushed the make-decommission-more-reliable-1 branch from 9aeaab8 to 6332d7b Compare March 7, 2024 18:57
@cniackz cniackz requested a review from cesnietor March 7, 2024 19:10
@harshavardhana harshavardhana merged commit 3afe8f1 into minio:master Mar 8, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants