Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race Condition in parallel_tests integration: an outdated report may be written #815

Open
PragTob opened this issue Jan 11, 2020 · 5 comments

Comments

@PragTob
Copy link
Collaborator

PragTob commented Jan 11, 2020

Discovered/verified in #814

Basically apparently it can happen that we generate the correct report but override it with an outdated one. Discovered running the parallel tests part of the features introduced in #841 (currently tagged wip to avoid flakies, remove the wip tag if you want to test this out). It doesn't happen reliably. Might happen 3 times in a row or not at all for 20 runs (for me at least).

I gathered some examples here: #814 (comment)

What you can see there is something like this:

Coverage report generated for (2/8) to /home/tobi/github/simplecov/tmp/aruba/project/coverage. 23 / 33 LOC (69.7%) covered.
Coverage report generated for (1/8), (2/8), (3/8) to /home/tobi/github/simplecov/tmp/aruba/project/coverage. 37 / 43 LOC (86.05%) covered.
Coverage report generated for (1/8), (2/8), (3/8), (4/8) to /home/tobi/github/simplecov/tmp/aruba/project/coverage. 42 / 47 LOC (89.36%) covered.
Coverage report generated for (1/8), (2/8) to /home/tobi/github/simplecov/tmp/aruba/project/coverage. 32 / 39 LOC (82.05%) covered.

Notice that the second too last has the right number of loc and coverage expected in the test and also seems to take everything into account (unsure why it says n/8 though, guess: I could run 8 in parallel but parallel tests only spawned 4 as there are only 4 test files) but is seemingly later overridden by an outdated report.

If possible we should of course find and fix that race condition, and then remove the wip tags to always run these tests!

PragTob added a commit that referenced this issue Jan 11, 2020
It's WIP because it's flaky at the moment and we don't want
our CI to be flaky. See #815
PragTob added a commit that referenced this issue Jan 12, 2020
It's WIP because it's flaky at the moment and we don't want
our CI to be flaky. See #815
@nickshatilo
Copy link

+1 here

@benissimo
Copy link

benissimo commented Feb 19, 2020

Does bumping up merge_timeout mitigate this? It looks like the number of tests involved is small enough that the default timeout of 10 minutes should be more than enough. Just curious whether that's a factor.

I ran into the following case with some long running test suites:

[osm.jruby] Using 'RSpecdevn3' as SimpleCov command name
[osm.jruby] Using 'RSpecdevn1' as SimpleCov command name
[osm.jruby] Using 'RSpecdevn4' as SimpleCov command name
[osm.jruby] Using 'RSpecdevn2' as SimpleCov command name
[osm.jruby] Coverage report generated for RSpecdevn3 to rails/coverage. 44377 / 168731 LOC (26.3%) covered.
[osm.jruby] Coverage report generated for RSpecdevn2, RSpecdevn3 to rails/coverage. 69229 / 156447 LOC (44.25%) covered.
[osm.jruby] Coverage report generated for RSpecdevn1, RSpecdevn2 to rails/coverage. 73551 / 150657 LOC (48.82%) covered.
[osm.jruby] Coverage report generated for RSpecdevn1, RSpecdevn2, RSpecdevn4 to ...rails/coverage. 86913 / 144881 LOC (59.99%) covered.

and the fix, in my case, was to bump up merge_timeout.

That might be a separate issue with similar symptoms. Mentioning it here in case others googling for this fall into the same case.

@PragTob
Copy link
Collaborator Author

PragTob commented Feb 19, 2020

@benissimo similar symptoms, separate issue (more user configuration)

In the local test case the tests finish within ~2 seconds so it's way beyond being impacted by the merge timeout.

@PragTob
Copy link
Collaborator Author

PragTob commented Aug 11, 2020

This might be related to us not doing correct checking which I ask/discuss in grosser/parallel_tests#772 - after that is resolved and changes are implemented this should be worth re checking.

@PragTob
Copy link
Collaborator Author

PragTob commented Aug 16, 2020

It seems to be better now but not fixed. CI now occasionally fails with:

(::) failed steps (::)

Expected `bundle exec parallel_rspec spec` to succeed but got non-zero exit status and the following output:

2 processes for 4 specs, ~ 2 specs per process
...

Finished in 0.02936 seconds (files took 0.67039 seconds to load)
3 examples, 0 failures

...

Finished in 0.02201 seconds (files took 0.6368 seconds to load)
3 examples, 0 failures

Coverage report generated for (2/2) to /home/runner/work/simplecov/simplecov/tmp/aruba/project/coverage. 28 / 37 LOC (75.68%) covered.

which funnily enough happens on JRuby the 2-3 times I've seen it and only that spec. Needs a fix soon-ish or a wip on the cuke as not to be annoying/make people think they did anything wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants