ccruntime e2e test nightly - unstable #339

wainersm · 2024-01-24T14:28:33Z

ccruntime e2e test nightly jobs are pretty unstable, latest 5 out of 9 failed.

They aren't failing due same reason. For example:

DEBUG: Pod: cc-operator-controller-manager-ccbbcfdf7-vtk82, Container: kube-rbac-proxy, Restart count: 0
DEBUG: Pod: manager, Container: 3, Restart count: 
DEBUG: Pod: cc-operator-daemon-install-2v5xd, Container: cc-runtime-install-pod, Restart count: 0
DEBUG: Pod: cc-operator-pre-install-daemon-hpgqq, Container: cc-runtime-pre-install-pod, Restart count: 0
INFO: No new restarts in 3x21s, proceeding...
INFO: Run tests
INFO: Running operator tests for kata-clh
1..2
Error: The action has timed out.

In another job:

~/actions-runner/_work/operator/operator/install/pre-install-payload
ccruntime.confidentialcontainers.org/ccruntime-sample created
No resources found
No resources found
No resources found
No resources found
No resources found
No resources found
No resources found
No resources found
No resources found
No resources found
ERROR: runtimeclass kata-qemu is not up
INFO: Uninstall the operator
ccruntime.confidentialcontainers.org "ccruntime-sample" deleted
ERROR: there are labels left behind
{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/os":"linux","cc-preinstall/done":"true","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"garm-rolg2ncd0m","kubernetes.io/os":"linux","node-role.kubernetes.io/control-plane":"","node.kubernetes.io/exclude-from-external-load-balancers":"","node.kubernetes.io/worker":""}INFO: Shutdown the cluster

The text was updated successfully, but these errors were encountered:

ldoktor · 2024-01-25T16:27:02Z

I managed to reproduce the ERROR: there are labels left behind while running:

kcli create vm -i ubuntu2204 -P memory=8G -P numcpus=4 -P disks=[50] e2e
kcli ssh e2e
git clone --depth=1 https://github.com/confidential-containers/operator
cd operator/tests/e2e
export PATH="$PATH:/usr/local/bin"
ansible-playbook -i localhost, -c local --tags untagged ansible/main.yml
sudo -E PATH="$PATH" bash -c './cluster/up.sh'
export KUBECONFIG=/etc/kubernetes/admin.conf

followed by a loop:

export "PATH=$PATH:/usr/local/bin"
export KUBECONFIG=/etc/kubernetes/admin.conf

UP=0
TEST=0
DOWN=0

I=0
while :; do
    echo "---< START ITERATION $I: $(date) >--" | tee -a job.log; SECONDS=0
    sudo -E PATH="$PATH" timeout 25m bash -c './operator.sh' || { date; exit -1; }
    UP="$SECONDS"; SECONDS=0; echo "UP    $(date) ($UP)" | tee -a job.log
    sudo -E PATH="$PATH" timeout 25m bash -c ./tests_runner.sh -r kata-qemu || { date; exit -2; }
    TEST="$SECONDS"; SECONDS=0; echo "TESTS $(date) ($TEST)" | tee -a job.log
    sudo -E PATH="$PATH" timeout 25m bash -c './operator.sh uninstall' || { date; exit -3; }
    DOWN="$SECONDS"; SECONDS=0; echo "DOWN  $(date) ($TEST)" | tee -a job.log
    echo -e "---< END ITERATION $I: $(date) ($UP\t$TEST\t$DOWN)\t[$((UP+TEST+DOWN))] >---" | tee -a job.log
    ((I+=1))
done

Which resulted in the left-behind labels. Interesting is that the operator stayed installed and the cc-operator-pre-install-daemon container, which might indicate the first grep (around line 171) finished before it started. Just an assumption at this point, I'm adding some debugs and will re-try that.

recent issues in CI indicate that kubectl might sometimes fail which results in wait_for_process interrupting the loop. Let's improve the command to ensure kubectl command passed and only then grep for the (un)expected output. Note the positive commands do not need to be treated as the output should not contain the pod names on failure. Fixes: confidential-containers#339 Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>

the network in CI environment tends to break from time to time, let's allow up to 3 retries for tasks that support it and that use external sources. Fixes: confidential-containers#339 Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>

wainersm added the ci-next label Jan 24, 2024

wainersm assigned ldoktor Jan 24, 2024

ldoktor mentioned this issue Jan 26, 2024

Install/Uninstall ccruntime in a loop fails #340

Open

wainersm closed this as completed in 62d67be Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ccruntime e2e test nightly - unstable #339

ccruntime e2e test nightly - unstable #339

wainersm commented Jan 24, 2024

ldoktor commented Jan 25, 2024

ccruntime e2e test nightly - unstable #339

ccruntime e2e test nightly - unstable #339

Comments

wainersm commented Jan 24, 2024

ldoktor commented Jan 25, 2024