Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent deadlocks in MutationCoverageReport process #1047

Open
davidburstrom opened this issue Jul 2, 2022 · 15 comments
Open

Intermittent deadlocks in MutationCoverageReport process #1047

davidburstrom opened this issue Jul 2, 2022 · 15 comments

Comments

@davidburstrom
Copy link
Contributor

davidburstrom commented Jul 2, 2022

Spuriously, the MutationCoverageReport process is deadlocking waiting for the minions to die, but there are no minion processes alive according to jps. It's as if there is some signalling that gets randomly dropped, maybe due to a concurrency issue.

I've seen this across various versions of Pitest, most recently version 1.9.0, run through Gradle plugin info.solidsoft.pitest:1.7.4. This happens both on Mac and Windows.

"pool-1-thread-3" #17 prio=5 os_prio=31 tid=0x000000015998e800 nid=0x5807 waiting on condition [0x0000000178412000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x0000000673fa2310> (a java.util.concurrent.FutureTask)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
	at java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at org.pitest.util.CommunicationThread.waitToFinish(CommunicationThread.java:59)
	at org.pitest.mutationtest.execute.MutationTestProcess.waitToDie(MutationTestProcess.java:46)
	at org.pitest.mutationtest.build.MutationTestUnit.waitForMinionToDie(MutationTestUnit.java:92)
	at org.pitest.mutationtest.build.MutationTestUnit.runTestInSeperateProcessForMutationRange(MutationTestUnit.java:85)
	at org.pitest.mutationtest.build.MutationTestUnit.runTestsInSeperateProcess(MutationTestUnit.java:68)
	at org.pitest.mutationtest.build.MutationTestUnit.call(MutationTestUnit.java:55)
	at org.pitest.mutationtest.build.MutationTestUnit.call(MutationTestUnit.java:30)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

The "pit communication" thread seems to wait for a connect that never arrives.

"pit communication" #19 daemon prio=5 os_prio=31 tid=0x000000014a198800 nid=0xa503 runnable [0x000000017882a000]
   java.lang.Thread.State: RUNNABLE
	at java.net.PlainSocketImpl.socketAccept(Native Method)
	at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
	at java.net.ServerSocket.implAccept(ServerSocket.java:562)
	at java.net.ServerSocket.accept(ServerSocket.java:530)
	at org.pitest.util.SocketReadingCallable.call(SocketReadingCallable.java:27)
	at org.pitest.util.SocketReadingCallable.call(SocketReadingCallable.java:11)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.lang.Thread.run(Thread.java:748)
@hcoles
Copy link
Owner

hcoles commented Jul 4, 2022

@davidburstrom Thanks for the report.

I've (unsurprisngly) not been able to reproduce this, so a fix is going to be difficult (not helped by the fact that the code in this area is old and deeply unpleasant).

Are any of the minion processes left alive when this occurs? Or is only the main process still running?

@davidburstrom
Copy link
Contributor Author

davidburstrom commented Jul 4, 2022

There are no minions left alive when this occurs. I don't see any particular error message either. Is there any built-in mechanism available to provide better minion logging, so it's possible to check if the process dies from unnatural causes?

@hcoles
Copy link
Owner

hcoles commented Jul 6, 2022

@davidburstrom If you set verbose=true pitest will spit out a lot more info.

hcoles pushed a commit that referenced this issue Jul 6, 2022
Possible mitigation for #1047
@hcoles
Copy link
Owner

hcoles commented Jul 6, 2022

@davidburstrom I've just released pitest 1.9.1. This contains a small change to test the theory that the hangs are caused by the final 'done' signal from the minion getting lost.

Could try this out once it's synced through and report back on whether the hangs still occur?

@davidburstrom
Copy link
Contributor Author

davidburstrom commented Jul 6, 2022

I can try that out! Will that require verbose=true to be detected, or will it be obvious from the context?

@hcoles
Copy link
Owner

hcoles commented Jul 7, 2022

@davidburstrom The change just resends the signal a few times. So, if the issue is that it's getting lost it should eliminate the problem (or at least reduce the frequency with which it happens).

If the volume of output doesn't cause you any problems, enabling verbose would still be good. If you do get another hang, it might give a bit more info to work from.

@davidburstrom
Copy link
Contributor Author

I seem to have found my root cause of the issue: the minion process wasn't able to allocate required memory, so it died before connecting to the socket. This is in a CI context where concurrent processes starve each other. There is no indication that the coverage report process detects that. I reckon this is testable by supplying bogus JVM arguments to the minion process.

@hcoles
Copy link
Owner

hcoles commented Jul 7, 2022

@davidburstrom

Replicated with "-Xmx1k"

Thanks, I'll have a look at what to do about it. Not sure whether the right thing to do is to fail fast with an error, or try relaunching first.

@davidburstrom
Copy link
Contributor Author

If you ask me, I'd say failing fast is the better option, instead of assuming responsibility to stabilize wrt the environment.

@hcoles
Copy link
Owner

hcoles commented Jul 7, 2022

Which is happily also the easiest option.

hcoles pushed a commit that referenced this issue Jul 7, 2022
Root cause of #1047 seems to be processes that are unable to start due
to insufficient memory. This results in the main process hanging,
waiting for their signal.

At some point the checks that the minion processes are alive look to
have been removed. This change reintroduces them with a simple poll.
@hcoles
Copy link
Owner

hcoles commented Jul 8, 2022

@davidburstrom
Releasing a fix for this now as pitest 1.9.2. Instead of hangs you should now receive errors.

@davidburstrom
Copy link
Contributor Author

Cool, I'm trying it out now!

@hcoles
Copy link
Owner

hcoles commented Jul 8, 2022

Actually, my description of the behaviour is not quite right. If the coverage process pitest runs first dies, you'll get a fast error. If the process doesn't start/dies during the mutation testing phase you'll see run errors reported against each mutation.

The first scenario was easy to reproduce for a test. The second is not so easy (without first causing it at the coverage stage).

@davidburstrom
Copy link
Contributor Author

I can confirm that I came across the fail-fast, as in Exception in thread "main" org.pitest.util.PitError: Coverage generation minion exited abnormally! (MINION_DIED).

@davidburstrom
Copy link
Contributor Author

Regarding the overarching stability question, I notice that the various spawned Pitest processes (MutationCoverageReport, CoverageMinion, MutationTestMinion) are started with default memory allowances, which on my machine means allocating 1GB up front, which is far more than necessary. Depending on the number of threads, this means e.g. 5 GB per execution.

Since I'm using the Gradle plugin, I've tried experimenting with setting

mainProcessJvmArgs.add("-Xmx100m")
childProcessJvmArgs.add("-Xmx100m")

but these aren't respected. I see that neither the MutationCoverageReport nor CoverageMinion

I'd say it's necessary to support custom memory settings in order to stabilize execution locally and in CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants