Messages to actors are lost under heavy load #204

scameronde · 2016-06-16T07:09:43Z

Sometimes under heavy load and using cascading messages, some messages seem to get lost. When monitoring the actors, the number of received messages for some actors are too low.

I have attached a small sample project. I have to run the sample 3-5 times for the problem to occur.

Please keep in mind, that I am new to Quasar and might be doing something completely wrong.

Test environment
Macbook Pro:

2,5 GHz Intel Core i7
16 GB RAM
OSX 10.11.5

Java:

Oracle Java 1.8.0_60-b27 server vm
Maven 3.0.5 to build and run

bugreport.zip

circlespainter · 2016-06-19T12:58:35Z

Hi, I have tried a lot but I've been unable to reproduce it. My setup is: MBP5,1 (2.4G Core 2 Duo, 2 cores, HT is disabled in HW), 8GB RAM, OSX 10.11.5, Oracle JDK 1.8.0_92-b14.

I made only few minor changes, the most relevant ones are logging all actions, explicitly allocating 4G heap (with your constants it crashes because of OOM here) and removing Guava 15 from dependencies: Quasar depends on 19.0 and, since I think Maven uses a nearest-wins (tree-wise) conflict resolution strategy, perhaps this could cause problems.

Could you try again with my version?

quasar-204.zip

scameronde · 2016-06-20T09:23:18Z

Hi, I changed my setup step by step to eliminate potential culprits.

removed Guava 15 from dependencies: problem still exists
set max memory to 4 GB: problem still exists
set min memory to 4 GB: problem still exists
upgraded to JDK 1.8.0_92-b14: problem still exists
used your System.err.println statements: problem gone.

Looks like a timing issue, because the logging slows everything down.
In a second test, I redirected the output to a file on a SSD with same result: the problem is gone.
Then I removed most of your logging statements and voila: problem is back.

It smells like a race condition where you have to have bad luck with timing.

circlespainter · 2016-06-20T12:47:41Z

Yes and it must happen in a very small time slot because I can't reproduce it even with in-memory logging (attaching).

Thanks for investing time; if you can spend a bit more it would help to try out using with a 0 mailbox size. Less interesting but also with this new version, just to get some empirical info on the best race-finding setups (more cores could increase the chances).

quasar-204.zip

scameronde · 2016-06-21T10:47:05Z

I did two tests:

with your in-memory logging: still 1 out of 5 runs do not finish.
with a mailboxsize of 0 and BLOCK as overflow policy: 1 in 10 runs did not finish.

with 1) I had to increase the heap to 8GB, because I got OOMs with 4 GB

Exception in Fiber "fiber-10343926" Exception in Fiber "fiber-10250341" Exception in Fiber "fiber-10516198" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at eu.lindenbaum.lab.racetest.Util.log(Util.java:16)
at eu.lindenbaum.lab.racetest.ChildActor.doRun(ChildActor.java:22)
at eu.lindenbaum.lab.racetest.ChildActor.doRun(ChildActor.java:8)
at co.paralleluniverse.actors.Actor.run0(Actor.java:691)
at co.paralleluniverse.actors.ActorRunner.run(ActorRunner.java:51)
at co.paralleluniverse.fibers.Fiber.run(Fiber.java:1027)

If you need further help identifying the problem, just tell me what I should do.

circlespainter · 2016-06-21T10:50:40Z

Thanks, are you using 10 parent actors and 100k children each?

circlespainter · 2016-06-21T10:52:09Z

Also, can you confirm from the logs that the issue is due to lost message and not something else and if so, Is it possible to tell how many are lost?

scameronde · 2016-06-21T14:28:19Z

I tried it with different numbers. 10 parents, 100k children. And 1k parents and 1k children. With the same result. It is the total number of childrens (actors) that seem to be the problem. Both cases have 1 million actors. When I go lower in numbers the problem happens less often, if I go higher it happens more often.

I did not use log messages to monitor the behavior, but the JMX console of JVisualVM / MissionControl. The numbers of received messages the parent actors are reporting via JMX is off. When everything is ok and all actors terminate, the reported number of received messages is correct. When some of the actors do not terminate, it is sometimes one actor that reports less than the expected number of received messages, sometimes it is two or three actors. Sometimes only one message for a parent actor is missing, sometimes 80%.

circlespainter · 2016-06-21T15:08:03Z

Thanks, we'll look into it.

scameronde · 2016-06-23T13:23:31Z

I investigated the situation a bit more by monitoring send and received messages.
The messages that get lost are of type LifecycleMessage (ExitMessage).

The root actor watches 10 parent actors. Each parent actor watches 100,000 child actors. All child actors receive the shutdown message send by my code. All child actors do quit their doRun-Loop. I am using JMX counters to assure that the numbers are correct.
When the problem occurs, most of the time 9 out of 10 parent actors receive the expected number of ExitMessages and only one parent actor has problems. The number of missing messages for this one parent actor is always a multiple of 10. I have seen 10, 20 or 30 messages missing.

Maybe this helps finding the problem.

circlespainter · 2016-06-23T17:09:22Z

Thanks, I think it does. Lifecycle messages are pretty normal messages except that they're produced and consumed by watch- and link-related logic (and filterMessage/handleLifecycleMessage). The problem could be there because if no normal message is ever lost then we can probably exclude a mailbox/queue issue.

circlespainter · 2016-06-26T10:01:31Z

Could you run again with Quasar's "flight recorder" turned on and share the output file? It'll basically record in-memory many internal events (including related to actors) and then dump them upon exit. You can enable it with -Dco.paralleluniverse.debugMode=true and have a look at more options here.

scameronde · 2016-06-29T09:04:41Z

I tried it with the following settings:

        <argument>-Dco.paralleluniverse.fibers.detectRunawayFibers=false</argument>
        <argument>-Dco.paralleluniverse.fibers.verifyInstrumentation=false</argument>
        <argument>-Dco.paralleluniverse.debugMode=true</argument>
        <argument>-Dco.paralleluniverse.flightRecorderDumpFile=quasar.log</argument>
        <argument>-Dco.paralleluniverse.monitoring.flightRecorderLevel=5</argument>
        <argument>-Dco.paralleluniverse.globalFlightRecorder=true</argument>
        <argument>-Dco.paralleluniverse.monitoring.flightRecorderSize=200000</argument>

The flight recorder slowed everything down from a few seconds runtime to more than an hour with heavy system load. The problem did not occur with two runs (I did not had the time for more runs yet).

How can I tweak the parameters to obtain better runtime, to increase the chance of the problem to occur and to squeeze more runs into one day?

circlespainter · 2016-06-29T09:12:45Z

Sorry, I didn't expect such long runtimes. Level 1 should already produce interesting information but if it doesn't help we can try with a custom Quasar build. Thanks for your support.

scameronde · 2016-06-30T10:02:39Z

When logging at level 1 and increasing the size of the flight recorder, the problem occurs again. The problem is, that the program does not terminate, because there are actors still running due to missing LifeCycle messages. It seems that the flight recorder log is only written to disk when the program ends.

My idea is to install a shutdown hook that catches a kill signal and to write the log via an api call. What API call do I have to make to force writing of the log?

pron · 2016-06-30T10:50:23Z

Hi @scameronde . We've been able to reproduce the bug on a different machine, and are looking into it.

scameronde · 2016-07-29T12:18:24Z

Hi. How is the fix going along? Did you manage to identify the cause of the problem?

pron · 2016-07-29T12:20:15Z

I haven't had the time to work on the issue lately, but I'll devote next week to fixing it as well as a couple other important Quasar bugs. I hope and expect you'll have a fix by the end of next week.

linkerlin · 2016-07-31T06:30:23Z

@pron ExecutorService will drop some messages if the queue of Callables is full.

pron · 2016-08-03T16:57:36Z

@scameronde The issue should be fixed now in 0.7.6-SNAPSHOT. It was a subtle race in the implementation of the unbounded queue used by the actor mailbox.

scameronde · 2016-08-05T11:16:02Z

Thanks! I will test it this weekend.

circlespainter added the bug label Jun 20, 2016

pron closed this as completed in 97ef7c4 Aug 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Messages to actors are lost under heavy load #204

Messages to actors are lost under heavy load #204

scameronde commented Jun 16, 2016

circlespainter commented Jun 19, 2016

scameronde commented Jun 20, 2016

circlespainter commented Jun 20, 2016

scameronde commented Jun 21, 2016

circlespainter commented Jun 21, 2016

circlespainter commented Jun 21, 2016

scameronde commented Jun 21, 2016

circlespainter commented Jun 21, 2016

scameronde commented Jun 23, 2016 •

edited

Loading

circlespainter commented Jun 23, 2016

circlespainter commented Jun 26, 2016

scameronde commented Jun 29, 2016

circlespainter commented Jun 29, 2016

scameronde commented Jun 30, 2016

pron commented Jun 30, 2016

scameronde commented Jul 29, 2016

pron commented Jul 29, 2016

linkerlin commented Jul 31, 2016

pron commented Aug 3, 2016

scameronde commented Aug 5, 2016

Messages to actors are lost under heavy load #204

Messages to actors are lost under heavy load #204

Comments

scameronde commented Jun 16, 2016

circlespainter commented Jun 19, 2016

scameronde commented Jun 20, 2016

circlespainter commented Jun 20, 2016

scameronde commented Jun 21, 2016

circlespainter commented Jun 21, 2016

circlespainter commented Jun 21, 2016

scameronde commented Jun 21, 2016

circlespainter commented Jun 21, 2016

scameronde commented Jun 23, 2016 • edited Loading

circlespainter commented Jun 23, 2016

circlespainter commented Jun 26, 2016

scameronde commented Jun 29, 2016

circlespainter commented Jun 29, 2016

scameronde commented Jun 30, 2016

pron commented Jun 30, 2016

scameronde commented Jul 29, 2016

pron commented Jul 29, 2016

linkerlin commented Jul 31, 2016

pron commented Aug 3, 2016

scameronde commented Aug 5, 2016

scameronde commented Jun 23, 2016 •

edited

Loading