Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Flake: Unit test cmd/entrypoint TestRealRunnerTimeout #4643

Open
lbernick opened this issue Mar 3, 2022 · 10 comments
Open

Test Flake: Unit test cmd/entrypoint TestRealRunnerTimeout #4643

lbernick opened this issue Mar 3, 2022 · 10 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flakey test lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@lbernick
Copy link
Member

lbernick commented Mar 3, 2022

Sadly not much info present in log file:

=== RUN   TestRealRunnerTimeout
    runner_test.go:37: step didn't timeout
--- FAIL: TestRealRunnerTimeout (0.03s)

/kind flake
/priority important-soon

@tekton-robot tekton-robot added kind/flake Categorizes issue or PR as related to a flakey test priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 3, 2022
@lbernick
Copy link
Member Author

lbernick commented Mar 3, 2022

This could be fixed by rewriting waiter.Wait to use the k8s Clock.Sleep, and the test to use FakeClock.Sleep.

@dibyom dibyom added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Apr 25, 2022
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 24, 2022
@abayer
Copy link
Contributor

abayer commented Aug 11, 2022

/lifecycle frozen

I hit this one while trying to reproduce the various events-related flakes, so it's still a thing. I thought that maybe increasing the timeout in the test would help, but I seem to still be hitting it...

@tekton-robot tekton-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 11, 2022
@abayer
Copy link
Contributor

abayer commented Aug 11, 2022

Oh, and I reproduced it with this:

while go test -race -count 100 -v ./cmd/entrypoint > flake-log; do :; done

@abayer
Copy link
Contributor

abayer commented Aug 11, 2022

...and now I can't reproduce it at all! Fun times.

@abayer
Copy link
Contributor

abayer commented Aug 11, 2022

Yeah, it failing or not depended on what I had for -count and the overall load on my system, not on anything flaky. Sigh. Well, it's still a flake, just one I can't come up with a fix for...

@Yongxuanzhang
Copy link
Member

I'm looking into this issue and I'm not sure what the testcase wants to test. From the test description it wants to test that the runner will be killed after a millisecond (though it wants to sleep 10 ms?),

// TestRealRunnerTimeout tests whether cmd is killed after a millisecond even though it's supposed to sleep for 10 milliseconds.

and the DeadlineExceeded error can be retuned in 2 places:
if err := cmd.Start(); err != nil {
if ctx.Err() == context.DeadlineExceeded {
return context.DeadlineExceeded
}
return err
}

if err := cmd.Wait(); err != nil {
if ctx.Err() == context.DeadlineExceeded {
return context.DeadlineExceeded
}
return err

It seems that the DeadlineExceeded can still be returned if the runner really sleep 10 ms?

@lbernick
Copy link
Member Author

lbernick commented Dec 2, 2022

I'm looking into this issue and I'm not sure what the testcase wants to test. From the test description it wants to test that the runner will be killed after a millisecond (though it wants to sleep 10 ms?),

The command here that the entrypoint is supposed to be running is "sleep 10ms". This test ensures that the runner can be timed out in the appropriate time. i.e. it wants to ensure that the runner times out after 1ms even though the command is not done.

It seems that the DeadlineExceeded can still be returned if the runner really sleep 10 ms?

Maybe? If so, that would mean that our tests could pass because the runner finished running and not because it got timed out correctly, which isn't great. (I suspect our timeout logic works correctly but ideally our tests should tell us this!) To test out whether or not this could happen, you could set the timeout to a very long time, and see if the tests still pass when the "sleep" command completes after 10ms.

@Yongxuanzhang
Copy link
Member

Yongxuanzhang commented Dec 2, 2022

From the test and code, it uses context.WithTimeout(context.Background(), timeout) this timeout ctx to count the time, and the ctx is also added to the cmd, right now the error is returned here:

if err := cmd.Start(); err != nil {

And if we jump into the Start(), it will return before the process is created to execute the cmd b/c the ctx is already done with timeout err, it this expected?
image

So I'm curious if the test is really testing what it says. Or maybe my understanding is wrong

@Yongxuanzhang
Copy link
Member

/assign

Yongxuanzhang added a commit to Yongxuanzhang/pipeline that referenced this issue Mar 21, 2023
This commit increase the timer for TestRealRunnerTimeout and hope this
could reduce the flake of tektoncd#4643.

Some thoughts about why tektoncd#4643 happened.The flaky test got "step didn't timeout", which means that the rr.Run
doesn't return any errors, including the DeadlineExceeded error. It
could be that the context timeout is accidentaly larger than the sleep
time and the Run finishes without context timeout. So I think we may
increase the sleep time to avoid this flake. Even it is already a rare
case.

Signed-off-by: Yongxuan Zhang yongxuanzhang@google.com
Yongxuanzhang added a commit to Yongxuanzhang/pipeline that referenced this issue Mar 21, 2023
This commit increase the timer for TestRealRunnerTimeout and hope this
could reduce the flake of tektoncd#4643.

Some thoughts about why tektoncd#4643 happened.The flaky test got "step didn't timeout", which means that the rr.Run
doesn't return any errors, including the DeadlineExceeded error. It
could be that the context timeout is accidentaly larger than the sleep
time and the Run finishes without context timeout. So I think we may
increase the sleep time to avoid this flake. Even it is already a rare
case.

Signed-off-by: Yongxuan Zhang yongxuanzhang@google.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flakey test lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
Status: Todo
Development

No branches or pull requests

5 participants