-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
increase timer for TestRealRunnerTimeout #6409
Conversation
This commit increase the timer for TestRealRunnerTimeout and hope this could reduce the flake of tektoncd#4643. Some thoughts about why tektoncd#4643 happened.The flaky test got "step didn't timeout", which means that the rr.Run doesn't return any errors, including the DeadlineExceeded error. It could be that the context timeout is accidentaly larger than the sleep time and the Run finishes without context timeout. So I think we may increase the sleep time to avoid this flake. Even it is already a rare case. Signed-off-by: Yongxuan Zhang yongxuanzhang@google.com
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
0d05620
to
4555a80
Compare
@Yongxuanzhang: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thanks @Yongxuanzhang for looking into this! I agree that "step didn't timeout" signals that rr.Run doesn't return any errors, but I'm not convinced by the explanation here that for whatever reason, "context timeout is accidentally larger than the sleep time". I think this change will make it less likely that we observe test failure, but I'm concerned the problem is in the code rather than the tests. My theory of what's happening here: The entrypoint waits for the stdout/stderr buffers here before waiting for the command to finish: pipeline/cmd/entrypoint/runner.go Lines 150 to 159 in f83cd1f
If reading from stdout/stderr takes longer than the context timeout, I think the call to This would imply a bug in the code rather than in the tests. I think it could be fixed by running the two I'm not sure if this is actually problematic in real scenarios or not. |
If I run the test locally it will return here: pipeline/cmd/entrypoint/runner.go Lines 119 to 125 in f83cd1f
This is a case that it will not reach the code you mentioned and I suspect this may also happen in our ci. (I can open a test PR to test this behaviour) That's why I commented in the original issue that I'm not sure if the test is testing what it originally wants to test So I think maybe we should also update the context |
Regarding on the possible reason you mentioned, if that is true then we should be able to reproduce this flake? We could add sleep time after the wg.wait() and test if there are no errors returned. And I'm also not sure about this:
If it takes longer time, why cmd.Wait() won't return error? |
Good idea! I tried this with sleep = 10s, timeout = 1s, and command = 5s at commit 538fee3. Interestingly what happens is However, #6162 removed the call to My guess is that there may have been a bug, but if so it was fixed by #6162. However, it seems like the existing test we have doesn't reliably tell us whether the entrypoint times out correctly-- I wonder if we can find a way to address this.
I was wrong! |
Haven't seen the flaky for a long time, close this PR |
Changes
This commit increase the timer for TestRealRunnerTimeout and hope this could mitigate/fix the flake of #4643.
Some thoughts about why #4643 happened. The flaky test got "step didn't timeout", which means that the rr.Run doesn't return any errors, including the DeadlineExceeded error. It could be that the context timeout is accidentally larger than the sleep time and the Run finishes without context timeout (This could be the sleep time decreases, or the ctx timeout increases).
If we look at those failed reports:
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/5666/pull-tekton-pipeline-unit-tests/1584556112152104960
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/5666/pull-tekton-pipeline-unit-tests/1584676793351147520
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/5465/pull-tekton-pipeline-unit-tests/1570762197682884608
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/6120/pull-tekton-pipeline-unit-tests/1623358151954796544
TestRealRunnerTimeout took either 0.09s or 0.08s but the passed case should take less time than that (includes code run time and ctx timeout time, sleep time if the process started, if we look at passed reports most of them should be less than 0.02s). So I suspect it may be the case where ctx timeout took much longer time, and the test finished (code run time+process sleep time) less than ctx timeout. And I think we may increase the sleep time to 0.2s to leave more room and may avoid this flake.
/kind flake
Signed-off-by: Yongxuan Zhang yongxuanzhang@google.com
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
functionality, content, code)
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes