Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase timer for TestRealRunnerTimeout #6409

Closed
wants to merge 1 commit into from

Conversation

Yongxuanzhang
Copy link
Member

@Yongxuanzhang Yongxuanzhang commented Mar 21, 2023

Changes

This commit increase the timer for TestRealRunnerTimeout and hope this could mitigate/fix the flake of #4643.

Some thoughts about why #4643 happened. The flaky test got "step didn't timeout", which means that the rr.Run doesn't return any errors, including the DeadlineExceeded error. It could be that the context timeout is accidentally larger than the sleep time and the Run finishes without context timeout (This could be the sleep time decreases, or the ctx timeout increases).
If we look at those failed reports:
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/5666/pull-tekton-pipeline-unit-tests/1584556112152104960
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/5666/pull-tekton-pipeline-unit-tests/1584676793351147520
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/5465/pull-tekton-pipeline-unit-tests/1570762197682884608
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/6120/pull-tekton-pipeline-unit-tests/1623358151954796544

TestRealRunnerTimeout took either 0.09s or 0.08s but the passed case should take less time than that (includes code run time and ctx timeout time, sleep time if the process started, if we look at passed reports most of them should be less than 0.02s). So I suspect it may be the case where ctx timeout took much longer time, and the test finished (code run time+process sleep time) less than ctx timeout. And I think we may increase the sleep time to 0.2s to leave more room and may avoid this flake.

/kind flake

Signed-off-by: Yongxuan Zhang yongxuanzhang@google.com

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs included if any changes are user facing
  • Has Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

NONE

This commit increase the timer for TestRealRunnerTimeout and hope this
could reduce the flake of tektoncd#4643.

Some thoughts about why tektoncd#4643 happened.The flaky test got "step didn't timeout", which means that the rr.Run
doesn't return any errors, including the DeadlineExceeded error. It
could be that the context timeout is accidentaly larger than the sleep
time and the Run finishes without context timeout. So I think we may
increase the sleep time to avoid this flake. Even it is already a rare
case.

Signed-off-by: Yongxuan Zhang yongxuanzhang@google.com
@tekton-robot
Copy link
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@tekton-robot tekton-robot added kind/flake Categorizes issue or PR as related to a flakey test release-note-none Denotes a PR that doesnt merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Mar 21, 2023
@tekton-robot tekton-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 21, 2023
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign dibyom after the PR has been reviewed.
You can assign the PR to them by writing /assign @dibyom in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 21, 2023
@Yongxuanzhang Yongxuanzhang marked this pull request as ready for review March 21, 2023 13:08
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 21, 2023
@tekton-robot tekton-robot requested a review from afrittoli March 21, 2023 13:08
@lbernick lbernick self-assigned this Mar 21, 2023
@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 30, 2023
@tekton-robot
Copy link
Collaborator

@Yongxuanzhang: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lbernick
Copy link
Member

Thanks @Yongxuanzhang for looking into this!

I agree that "step didn't timeout" signals that rr.Run doesn't return any errors, but I'm not convinced by the explanation here that for whatever reason, "context timeout is accidentally larger than the sleep time".

I think this change will make it less likely that we observe test failure, but I'm concerned the problem is in the code rather than the tests. My theory of what's happening here:

The entrypoint waits for the stdout/stderr buffers here before waiting for the command to finish:

// Wait for stdout/err buffers to finish reading before returning.
wg.Wait()
// Wait for command to exit
if err := cmd.Wait(); err != nil {
if ctx.Err() == context.DeadlineExceeded {
return context.DeadlineExceeded
}
return err
}

If reading from stdout/stderr takes longer than the context timeout, I think the call to cmd.Wait will not return an error, because the command will have already exited with code 0. Reading from stdout/stderr could take extra time for the same reasons as in #5254 (comment).

This would imply a bug in the code rather than in the tests. I think it could be fixed by running the two Waits in different goroutines? The problem is that we don't have a reliable way of reproducing this bug.

I'm not sure if this is actually problematic in real scenarios or not.

@Yongxuanzhang
Copy link
Member Author

Yongxuanzhang commented Apr 3, 2023

Thanks @Yongxuanzhang for looking into this!

I agree that "step didn't timeout" signals that rr.Run doesn't return any errors, but I'm not convinced by the explanation here that for whatever reason, "context timeout is accidentally larger than the sleep time".

I think this change will make it less likely that we observe test failure, but I'm concerned the problem is in the code rather than the tests. My theory of what's happening here:

The entrypoint waits for the stdout/stderr buffers here before waiting for the command to finish:

// Wait for stdout/err buffers to finish reading before returning.
wg.Wait()
// Wait for command to exit
if err := cmd.Wait(); err != nil {
if ctx.Err() == context.DeadlineExceeded {
return context.DeadlineExceeded
}
return err
}

If reading from stdout/stderr takes longer than the context timeout, I think the call to cmd.Wait will not return an error, because the command will have already exited with code 0. Reading from stdout/stderr could take extra time for the same reasons as in #5254 (comment).

This would imply a bug in the code rather than in the tests. I think it could be fixed by running the two Waits in different goroutines? The problem is that we don't have a reliable way of reproducing this bug.

I'm not sure if this is actually problematic in real scenarios or not.

If I run the test locally it will return here:

// Start defined command
if err := cmd.Start(); err != nil {
if ctx.Err() == context.DeadlineExceeded {
return context.DeadlineExceeded
}
return err
}

This is a case that it will not reach the code you mentioned and I suspect this may also happen in our ci. (I can open a test PR to test this behaviour) That's why I commented in the original issue that I'm not sure if the test is testing what it originally wants to test

So I think maybe we should also update the context timeout value if this needs to be fixed?

@Yongxuanzhang
Copy link
Member Author

Thanks @Yongxuanzhang for looking into this!

I agree that "step didn't timeout" signals that rr.Run doesn't return any errors, but I'm not convinced by the explanation here that for whatever reason, "context timeout is accidentally larger than the sleep time".

I think this change will make it less likely that we observe test failure, but I'm concerned the problem is in the code rather than the tests. My theory of what's happening here:

The entrypoint waits for the stdout/stderr buffers here before waiting for the command to finish:

// Wait for stdout/err buffers to finish reading before returning.
wg.Wait()
// Wait for command to exit
if err := cmd.Wait(); err != nil {
if ctx.Err() == context.DeadlineExceeded {
return context.DeadlineExceeded
}
return err
}

If reading from stdout/stderr takes longer than the context timeout, I think the call to cmd.Wait will not return an error, because the command will have already exited with code 0. Reading from stdout/stderr could take extra time for the same reasons as in #5254 (comment).

This would imply a bug in the code rather than in the tests. I think it could be fixed by running the two Waits in different goroutines? The problem is that we don't have a reliable way of reproducing this bug.

I'm not sure if this is actually problematic in real scenarios or not.

Regarding on the possible reason you mentioned, if that is true then we should be able to reproduce this flake? We could add sleep time after the wg.wait() and test if there are no errors returned.

And I'm also not sure about this:

If reading from stdout/stderr takes longer than the context timeout, I think the call to cmd.Wait will not return an error,

If it takes longer time, why cmd.Wait() won't return error?

@lbernick
Copy link
Member

lbernick commented Apr 4, 2023

Regarding on the possible reason you mentioned, if that is true then we should be able to reproduce this flake? We could add sleep time after the wg.wait() and test if there are no errors returned.

Good idea! I tried this with sleep = 10s, timeout = 1s, and command = 5s at commit 538fee3. Interestingly what happens is rr.Run takes 10s (when it should time out after 1s), but then the tests pass.

However, #6162 removed the call to wg.Wait(), so there's nothing blocking in between the calls to cmd.Start and cmd.Wait.

My guess is that there may have been a bug, but if so it was fixed by #6162. However, it seems like the existing test we have doesn't reliably tell us whether the entrypoint times out correctly-- I wonder if we can find a way to address this.

And I'm also not sure about this:

If reading from stdout/stderr takes longer than the context timeout, I think the call to cmd.Wait will not return an error,

If it takes longer time, why cmd.Wait() won't return error?

I was wrong!

@Yongxuanzhang
Copy link
Member Author

Haven't seen the flaky for a long time, close this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flakey test needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesnt merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants