-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Unit and end-to-end tests are flakey #2672
Comments
I had a case on #2668: https://circleci.com/gh/fluxcd/flux/8904. |
Thanks |
It turns out unit tests are also flakey. Three instances: https://circleci.com/gh/fluxcd/flux/8957 , https://circleci.com/gh/fluxcd/flux/8964 and https://circleci.com/gh/fluxcd/flux/8966 I believe this started happening as a result of adding two more cores to CircleCI (see #2647 ) |
https://circleci.com/gh/fluxcd/flux/8984 is a PR rebased on #2674 and still fails to pass unit tests. |
Without running them in parallel at least we get a simpler backtrace ... backtrace``` goroutine 274 [running]: testing.(*M).startAlarm.func1() /usr/local/go/src/testing/testing.go:1377 +0x11c created by time.goFunc /usr/local/go/src/time/sleep.go:168 +0x52goroutine 1 [chan receive]: goroutine 19 [chan receive]: goroutine 193 [chan receive]: goroutine 220 [syscall]: goroutine 223 [IO wait]: goroutine 222 [IO wait]:
|
I think it may be |
Of course, as soon as I add debug printouts it doesn't fail anymore, grrrrr https://circleci.com/gh/fluxcd/flux/9004 |
I managed to make it fail https://circleci.com/gh/fluxcd/flux/9008
I am pretty sure docker gets stuck when executing |
Here is another failure instance https://circleci.com/gh/fluxcd/flux/9013 . In this case it fails in a different test-case:
|
OK. I think I know what's happening (and I feel a bit stupid about not having figured it out before). TL;DR: we were being too conservative in the We set a timeout of 60s when running the tests. I had assumed that the timeout applied per test (i.e. per
It's not completely clear to me what a binary means, but, in the Makefile pass all the flux go packages to
For instance, this fails
But if we filter For instance, this fails
So, I think this started happening when adding new tests to Also, this didn't surface all that much because of caching (the test was only re-run when the package was modified and if you got lucky, it run fast enough, getting cached again). Since it's hard to control Jeez, this took way longer than it should had. |
Also (if the root cause didn't change after #2674 ), this means that parallelization wasn't really doing much because the tests of |
And ... here is another instance of e2e tests failing https://circleci.com/gh/fluxcd/flux/9070 In this case, and for some reason, gitsrv cannot be reached on port 22. |
and gitsrv failed to start again (in master): https://circleci.com/gh/fluxcd/flux/9090 |
I have seen CI builds on other platforms where port |
As we discussed offline. It's only port 22 in the pod, the port on the client side ( |
Here is another e2e test failing https://circleci.com/gh/fluxcd/flux/9093 (from #2684 ). In this case I am not even sure about what's wrong. |
The problem here seems to be:
Looking at the failing test it seems that our checks are insufficient and there is a (small) possibility that we compare an empty string to the # This does not ensure the tag was pushed by Flux,
# only that the sync was applied.
poll_until_equals "podinfo image" "stefanprodan/podinfo:3.1.5" "kubectl get pod -n demo -l app=podinfo -o\"jsonpath={['items'][0]['spec']['containers'][0]['image']}\""
# So it is possible that this does not result in a rev string.
git pull -f --tags
sync_tag_hash=$(git rev-list -n 1 flux)
[ "$head_hash" = "$sync_tag_hash" ] |
Good catch. It should be easy to turn that into a |
There is also this unit test failure. Which may be a legitimate problem? https://circleci.com/gh/fluxcd/flux/9093 |
I think you mean the unit test failure in https://circleci.com/gh/fluxcd/flux/9109? My theory is that this is due to some CircleCI nodes having a better connection than others, resulting in a (temporary) rate limit from DockerHub because we are hitting their registry too hard. I think this can be simply avoided by mocking the registry using |
Sorry, wrong link. I was referring to https://circleci.com/gh/fluxcd/flux/9158 |
After #2688 it seems like e2e tests are way more stable. I run them ~30 times in a row at https://circleci.com/gh/fluxcd/workflows/flux/tree/reproduce-flakey-tests and they worked except for an occurrence of Kind failing to create a cluster which I think we will need to live with. We still have the unit test failures from https://circleci.com/gh/fluxcd/flux/9158 , https://circleci.com/gh/fluxcd/flux/9109 , but they seem to seldom occur. I am going to step away from this issue for a while, but let's keep adding flakey tests here. |
Another one just happened on |
Another failure, due to kind failing to create a cluster https://circleci.com/gh/fluxcd/flux/9444 |
Another failure https://circleci.com/gh/fluxcd/flux/9462 |
Same failure, different build: https://circleci.com/gh/fluxcd/flux/9465 |
another failure from Kind: https://circleci.com/gh/fluxcd/flux/9822
|
I created an upstream issue for this at kubernetes-sigs/kind#1288 |
Flakey execution of the image release e2e test: https://circleci.com/gh/fluxcd/flux/9930 |
I actually don't think it was a flakey test since it failed 3 times in a row. There must be a bug in that PR EDIT: it wasn't |
Flakey policy update unit test: https://circleci.com/gh/fluxcd/flux/9984 |
Policy update unit test, and panic in releaser unit test due to registry timeout: https://circleci.com/gh/fluxcd/flux/9998 |
Kind initialization error https://circleci.com/gh/fluxcd/flux/10092 I will report it upstream |
I have not seen any flaky tests so far, except for the one I narrowly managed to avoid merging in with the last release, (my first Flux Daemon release, 1.21.2.) Thank you for documenting this. I will reopen, (or more likely start a new issue, so you are not bothered with it) if it turns out that there are still flaky tests while Flux v1 is in maintenance. |
They fail from time to time. e.g. https://circleci.com/gh/fluxcd/flux/8957
@hiddeco @squaremo @stefanprodan please add more cases here (if you see them) so that I can fix them.
The text was updated successfully, but these errors were encountered: