Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flake: entrypoint test TestRealWaiterWaitWithContent #5254

Open
lbernick opened this issue Aug 2, 2022 · 2 comments · Fixed by #5626
Open

Flake: entrypoint test TestRealWaiterWaitWithContent #5254

lbernick opened this issue Aug 2, 2022 · 2 comments · Fixed by #5626
Labels
kind/flake Categorizes issue or PR as related to a flakey test

Comments

@lbernick
Copy link
Member

lbernick commented Aug 2, 2022

=== RUN   TestRealWaiterWaitWithContent
    waiter_test.go:123: expected Wait() to have detected a non-zero file size by now
@lbernick lbernick added the kind/flake Categorizes issue or PR as related to a flakey test label Aug 2, 2022
@lbernick
Copy link
Member Author

Same error for TestRealWaiterWaitWithBreakpointOnFailure on #5622

=== RUN   TestRealWaiterWaitWithBreakpointOnFailure
    waiter_test.go:191: expected Wait() to have detected a non-zero file size by now

Thanks @bendory for spending some time looking into this issue. He was not able to get it to flake locally and I wasn't able to get it to flake run locally in a container, so I'm not sure why this is still happening on CI. Unfortunately makes it hard to know if a given fix works.
His description of what's probably going on:

  • there is no guarantee of when the spawned goroutine kicks off
  • it may not start until after the Timer is created
  • the timer ticks pretty quickly -- you've only got 20ms max
  • assume the test goroutine doesn't yield until it blocks in the select{}; you've now got <20ms because the Timer is already running
  • the realWaiter now starts polling -- but it depends on i/o
  • if i/o now takes > 20ms for the OS to report the existence of the non-empty file, the test fails

Two possible ideas for why:

  • based on https://itnext.io/temporary-storage-for-kubernetes-pods-f8330ad8db88: This says "to create a usable file system from container layers, a storage driver has to do extra work, which has its performance penalties" and that you should use emptydir volumes rather than container memory for "applications that work a lot with temporary storage, and performance is crucial". It sounds like writing to a tempfile on your laptop is a lot faster than writing to a tempfile within a container, which might be why the test is timing out on CI but not locally
  • If multiple instances of this test are somehow running concurrently on the same pod (which I don't think they should be?), race conditions could cause flakiness, since the temp file names we are using are deterministic.

Possible mitigations:

  • just increase the timeout (doesn't fix underlying problem but definitely easiest)
  • use a filesystem mocking library like afero (a decent amount of work compared to increasing the timeout)
  • something similar to the strategy proposed in this article although this is based on io.Reader while our entrypoint tests just check to see if the file has contents

@jerop
Copy link
Member

jerop commented Nov 3, 2023

Flaked again in #7328 🤔

@jerop jerop reopened this Nov 3, 2023
@github-project-automation github-project-automation bot moved this from Done to In Progress in Tekton Community Roadmap Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flakey test
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

2 participants