Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CI: misc parallel flakes #23600

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Honny1
Copy link
Member

@Honny1 Honny1 commented Aug 13, 2024

This PR fixes parallel flakes.

  • |035| podman logs - --until --follow journald The bash $SECONDS are not accurate because the time is rounded to whole seconds. So if the t0 is 1856ms, but the $SECONDS is still 1, this inaccuracy causes the command to be at most only about 150ms late which is less variation than I observed between test runs (the time was around 3150-3650ms). At higher workloads, this delay can be larger. Measuring time in ms should solve the problem.

  • [035] podman logs - multi k8s-file test: issue: Long k8s-file logs are truncated on 'podman logs --follow' output of stopped container #21914

  • [035] podman logs - --since --follow journald I would say that when running in parallel the journald is used by multiple containers, so it will be necessary to increase the timeout time to give the container more time to write to the journald, and also perform a check, end of journald content.

Fixes: #23682

Does this PR introduce a user-facing change?

None

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Aug 13, 2024
Copy link
Contributor

openshift-ci bot commented Aug 13, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Honny1
Once this PR has been reviewed and has the lgtm label, please assign giuseppe for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@edsantiago
Copy link
Member

I think you accidentally included #23275 :-)

@Honny1 Honny1 force-pushed the misc-parallel-flakes branch 6 times, most recently from 02d43e8 to 3f1ad80 Compare August 15, 2024 11:46
@openshift-ci openshift-ci bot added release-note-none and removed do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Aug 15, 2024
@Honny1 Honny1 changed the title Flake hunting of CI: misc parallel flakes Fix CI: misc parallel flakes Aug 15, 2024
@Honny1 Honny1 marked this pull request as ready for review August 15, 2024 15:30
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 15, 2024
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edsantiago PTAL and test these changes in your PR.

@@ -104,6 +104,8 @@ function _log_test_multi() {
doit $cname1 "echo a1; echo a2"
doit $cname2 "echo b1; echo b2"

run_podman wait $cname1 $cname2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-f already wait for the containers to exit and is supposed to read until the end
I think #21914 is the much more likely problem with logs -f. The extra wait delay might fix that silently but I don't see this one here as proper fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know that -f was waiting for containers to exit. I will remove this change.

@@ -328,12 +330,12 @@ function _log_test_follow_since() {

# sleep is required to make sure the podman event backend no longer sees the start event in the log
# This value must be greater or equal than the value given in --since below
sleep 0.2
sleep 0.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a proper fix, the issue can go in both directions. If we are slow here and need 1.1 s in total the log line printed in the ctr will be missed and the next line is only written after 5s so I think at the very least we should make the container print every 1s or something

@Honny1 Honny1 force-pushed the misc-parallel-flakes branch 2 times, most recently from 4b019f7 to 0dd9179 Compare August 19, 2024 08:41
Copy link
Member

@edsantiago edsantiago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, seems to work and I'm OK giving this a try on main. One change request please.

assert "$output" =~ "^$content
timeout: sending signal TERM to command.*" "logs --since -f on running container works"
assert "$output" =~ "$content
timeout: sending signal TERM to command.*$" "logs --since -f on running container works"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding $ here is a NOP which may confuse future maintenance programmers

@edsantiago
Copy link
Member

Also, I changed the Fixes line.

assert "$output" =~ "^$content
timeout: sending signal TERM to command.*" "logs --since -f on running container works"
assert "$output" =~ "${content}
timeout: sending signal TERM to command.*\$" "logs --since -f on running container works"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I wasn't clear.

You added a dollar sign. It is a NOP, it does not do anything. So when the next programmer comes to look at this code some day, she or he will be confused by it, wondering why it exists. This will cost them time.

Please remove it, making it back like it was. Even better, remove the .* which is also useless (that one is my fault). Make it just ...signal TERM to command"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I get it now.

Signed-off-by: Jan Rodák <hony.com@seznam.cz>
Signed-off-by: Jan Rodák <hony.com@seznam.cz>
@edsantiago
Copy link
Member

@mheon I've seen this flake in my parallel testing and, just now, in regular plain old usual bats:

[+0089s] not ok 60 [030] podman run --rmi in 2526ms
...
<+009ms> # # podman-remote --url unix:///tmp/CI_0fS1/podman_tmp_nSJR run --rmi quay.io/libpod/testimage:00000004 /bin/true
<+498ms> # time="2024-08-21T11:33:50Z" level=warning msg="quay.io/libpod/testimage:00000004: image not known"
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Is this obvious to you, or shall I file an issue and try to find a reproducer?

@mheon
Copy link
Member

mheon commented Aug 21, 2024

Not obvious. We pulled the image successfully 3 previous times in the test by my count. Is it always that last call?

@edsantiago
Copy link
Member

I don't have enough data to answer an "always" question. I'll look into it.

Copy link

A friendly reminder that this PR had no activity for 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: parallel system tests: logs --until --follow: exited too late
4 participants