-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent deadlocks in MutationCoverageReport process #1047
Comments
@davidburstrom Thanks for the report. I've (unsurprisngly) not been able to reproduce this, so a fix is going to be difficult (not helped by the fact that the code in this area is old and deeply unpleasant). Are any of the minion processes left alive when this occurs? Or is only the main process still running? |
There are no minions left alive when this occurs. I don't see any particular error message either. Is there any built-in mechanism available to provide better minion logging, so it's possible to check if the process dies from unnatural causes? |
@davidburstrom If you set |
@davidburstrom I've just released pitest 1.9.1. This contains a small change to test the theory that the hangs are caused by the final 'done' signal from the minion getting lost. Could try this out once it's synced through and report back on whether the hangs still occur? |
I can try that out! Will that require |
@davidburstrom The change just resends the signal a few times. So, if the issue is that it's getting lost it should eliminate the problem (or at least reduce the frequency with which it happens). If the volume of output doesn't cause you any problems, enabling verbose would still be good. If you do get another hang, it might give a bit more info to work from. |
I seem to have found my root cause of the issue: the minion process wasn't able to allocate required memory, so it died before connecting to the socket. This is in a CI context where concurrent processes starve each other. There is no indication that the coverage report process detects that. I reckon this is testable by supplying bogus JVM arguments to the minion process. |
Replicated with "-Xmx1k" Thanks, I'll have a look at what to do about it. Not sure whether the right thing to do is to fail fast with an error, or try relaunching first. |
If you ask me, I'd say failing fast is the better option, instead of assuming responsibility to stabilize wrt the environment. |
Which is happily also the easiest option. |
Root cause of #1047 seems to be processes that are unable to start due to insufficient memory. This results in the main process hanging, waiting for their signal. At some point the checks that the minion processes are alive look to have been removed. This change reintroduces them with a simple poll.
@davidburstrom |
Cool, I'm trying it out now! |
Actually, my description of the behaviour is not quite right. If the coverage process pitest runs first dies, you'll get a fast error. If the process doesn't start/dies during the mutation testing phase you'll see run errors reported against each mutation. The first scenario was easy to reproduce for a test. The second is not so easy (without first causing it at the coverage stage). |
I can confirm that I came across the fail-fast, as in |
Regarding the overarching stability question, I notice that the various spawned Pitest processes (MutationCoverageReport, CoverageMinion, MutationTestMinion) are started with default memory allowances, which on my machine means allocating 1GB up front, which is far more than necessary. Depending on the number of threads, this means e.g. 5 GB per execution. Since I'm using the Gradle plugin, I've tried experimenting with setting
but these aren't respected. I see that neither the MutationCoverageReport nor CoverageMinion I'd say it's necessary to support custom memory settings in order to stabilize execution locally and in CI. |
Spuriously, the MutationCoverageReport process is deadlocking waiting for the minions to die, but there are no minion processes alive according to
jps
. It's as if there is some signalling that gets randomly dropped, maybe due to a concurrency issue.I've seen this across various versions of Pitest, most recently version 1.9.0, run through Gradle plugin info.solidsoft.pitest:1.7.4. This happens both on Mac and Windows.
The "pit communication" thread seems to wait for a connect that never arrives.
The text was updated successfully, but these errors were encountered: