-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: Improve e2e tests reliability #343
Conversation
554b7d5
to
d84dc6d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great
undo_changes | ||
fi | ||
[ "$RET" -ne 0 ] && echo && echo "::error:: Testing failed with $RET" || echo "::info:: Testing passed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When undo
is true and RET
is nonzero, we'll get a couple "testing failed" error messages (one before undoing changes, and one after). I guess that's OK, though this could be tweaked if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my intention, you get notified things went wrong (as it might not be obvious from the logs) and that things are going to be cleared, then you get a bunch of output spilled over finishing with ansible message that everything went well after which you get this second Testing failed message to emphasize although Ansible is happy the testing actually did not went well.
we are seeing stalled jobs, let's ensure our steps won't take longer than expected. Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
ensure each test won't take longer than expected treshold by setting a 10m timeout. Note one can only set one timeout for all tests within a single file. Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
recent issues in CI indicate that kubectl might sometimes fail which results in wait_for_process interrupting the loop. Let's improve the command to ensure kubectl command passed and only then grep for the (un)expected output. Note the positive commands do not need to be treated as the output should not contain the pod names on failure. Fixes: confidential-containers#339 Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
On test failure we might still execute a cleanup that spils a bunch of text making it not obvious whether the testing passed or failed. Note the return code is already fine, this change is only for the users to better notice things didn't went well. Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
Replace the simple "DEBUG|ERROR|INFO" prefixes with the github action commands "::debug::" as it should improve the GH logs readability while leaving the bash outputs still parsable by humans. Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
Changes:
Hopefully no other changes, tried hard not to introduce new "spaces" rather than "tabs" :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks Lukas!
@@ -62,7 +62,7 @@ jobs: | |||
if [ $RUNNING_INSTANCE = "s390x" ]; then | |||
args="" | |||
fi | |||
./run-local.sh -r "${{ matrix.runtimeclass }}" "${args}" | |||
./run-local.sh -t -r "${{ matrix.runtimeclass }}" "${args}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it makes sense to have the timeout in the script because we don't map this steps to github job's steps, otherwise the timeouts could be set on the github workflows. This scenario might change when we address #309 .
run() { | ||
duration=$1; shift | ||
if [ "$timeout" == "true" ]; then | ||
timeout $duration "$@" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if it prints a friendly message (e.g. "Run timed out after XX") when it timed out? i.e. when $?
-eq 124
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can open a can of worms as the script itself can return 124 so we'd have to add a logic to get the actual time (I mean we could use the $SECONDS so it's not that extensive but still) and then report "Run probably timed out after XXXs" when the timeout seems correct. Do you want me to add it or are we going to rely on log timestamps only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... we better leave as is. If we start seeing too many timeouts and it proves to be confusing then we may change it.
# | ||
local cmd="! sudo -E kubectl get pods -n $op_ns |" | ||
cmd+="grep -q -e cc-operator-daemon-install" | ||
# (ensure failing kubectl keeps iterating) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
Hi @ldoktor ! I left one suggestion that you might accept. Everything else looks good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ldoktor
There are several improvements to the e2e pipeline to avoid odd returns as well as to interrupt on hangs. Last but not least I added extra messages that should improve reading the logs.
I'm definitely open to suggestions to remove/improve individual commits....