Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Messages to actors are lost under heavy load #204

Closed
scameronde opened this issue Jun 16, 2016 · 20 comments
Closed

Messages to actors are lost under heavy load #204

scameronde opened this issue Jun 16, 2016 · 20 comments
Labels

Comments

@scameronde
Copy link

Sometimes under heavy load and using cascading messages, some messages seem to get lost. When monitoring the actors, the number of received messages for some actors are too low.

I have attached a small sample project. I have to run the sample 3-5 times for the problem to occur.

Please keep in mind, that I am new to Quasar and might be doing something completely wrong.

Test environment
Macbook Pro:

  • 2,5 GHz Intel Core i7
  • 16 GB RAM
  • OSX 10.11.5

Java:

  • Oracle Java 1.8.0_60-b27 server vm
  • Maven 3.0.5 to build and run

bugreport.zip

@circlespainter
Copy link
Member

Hi, I have tried a lot but I've been unable to reproduce it. My setup is: MBP5,1 (2.4G Core 2 Duo, 2 cores, HT is disabled in HW), 8GB RAM, OSX 10.11.5, Oracle JDK 1.8.0_92-b14.

I made only few minor changes, the most relevant ones are logging all actions, explicitly allocating 4G heap (with your constants it crashes because of OOM here) and removing Guava 15 from dependencies: Quasar depends on 19.0 and, since I think Maven uses a nearest-wins (tree-wise) conflict resolution strategy, perhaps this could cause problems.

Could you try again with my version?

quasar-204.zip

@scameronde
Copy link
Author

Hi, I changed my setup step by step to eliminate potential culprits.

  1. removed Guava 15 from dependencies: problem still exists
  2. set max memory to 4 GB: problem still exists
  3. set min memory to 4 GB: problem still exists
  4. upgraded to JDK 1.8.0_92-b14: problem still exists
  5. used your System.err.println statements: problem gone.

Looks like a timing issue, because the logging slows everything down.
In a second test, I redirected the output to a file on a SSD with same result: the problem is gone.
Then I removed most of your logging statements and voila: problem is back.

It smells like a race condition where you have to have bad luck with timing.

@circlespainter
Copy link
Member

Yes and it must happen in a very small time slot because I can't reproduce it even with in-memory logging (attaching).

Thanks for investing time; if you can spend a bit more it would help to try out using with a 0 mailbox size. Less interesting but also with this new version, just to get some empirical info on the best race-finding setups (more cores could increase the chances).

quasar-204.zip

@scameronde
Copy link
Author

I did two tests:

  1. with your in-memory logging: still 1 out of 5 runs do not finish.
  2. with a mailboxsize of 0 and BLOCK as overflow policy: 1 in 10 runs did not finish.

with 1) I had to increase the heap to 8GB, because I got OOMs with 4 GB

Exception in Fiber "fiber-10343926" Exception in Fiber "fiber-10250341" Exception in Fiber "fiber-10516198" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at eu.lindenbaum.lab.racetest.Util.log(Util.java:16)
at eu.lindenbaum.lab.racetest.ChildActor.doRun(ChildActor.java:22)
at eu.lindenbaum.lab.racetest.ChildActor.doRun(ChildActor.java:8)
at co.paralleluniverse.actors.Actor.run0(Actor.java:691)
at co.paralleluniverse.actors.ActorRunner.run(ActorRunner.java:51)
at co.paralleluniverse.fibers.Fiber.run(Fiber.java:1027)

If you need further help identifying the problem, just tell me what I should do.

@circlespainter
Copy link
Member

Thanks, are you using 10 parent actors and 100k children each?

@circlespainter
Copy link
Member

Also, can you confirm from the logs that the issue is due to lost message and not something else and if so, Is it possible to tell how many are lost?

@scameronde
Copy link
Author

I tried it with different numbers. 10 parents, 100k children. And 1k parents and 1k children. With the same result. It is the total number of childrens (actors) that seem to be the problem. Both cases have 1 million actors. When I go lower in numbers the problem happens less often, if I go higher it happens more often.

I did not use log messages to monitor the behavior, but the JMX console of JVisualVM / MissionControl. The numbers of received messages the parent actors are reporting via JMX is off. When everything is ok and all actors terminate, the reported number of received messages is correct. When some of the actors do not terminate, it is sometimes one actor that reports less than the expected number of received messages, sometimes it is two or three actors. Sometimes only one message for a parent actor is missing, sometimes 80%.

@circlespainter
Copy link
Member

Thanks, we'll look into it.

@scameronde
Copy link
Author

scameronde commented Jun 23, 2016

I investigated the situation a bit more by monitoring send and received messages.
The messages that get lost are of type LifecycleMessage (ExitMessage).

The root actor watches 10 parent actors. Each parent actor watches 100,000 child actors. All child actors receive the shutdown message send by my code. All child actors do quit their doRun-Loop. I am using JMX counters to assure that the numbers are correct.
When the problem occurs, most of the time 9 out of 10 parent actors receive the expected number of ExitMessages and only one parent actor has problems. The number of missing messages for this one parent actor is always a multiple of 10. I have seen 10, 20 or 30 messages missing.

Maybe this helps finding the problem.

@circlespainter
Copy link
Member

Thanks, I think it does. Lifecycle messages are pretty normal messages except that they're produced and consumed by watch- and link-related logic (and filterMessage/handleLifecycleMessage). The problem could be there because if no normal message is ever lost then we can probably exclude a mailbox/queue issue.

@circlespainter
Copy link
Member

Could you run again with Quasar's "flight recorder" turned on and share the output file? It'll basically record in-memory many internal events (including related to actors) and then dump them upon exit. You can enable it with -Dco.paralleluniverse.debugMode=true and have a look at more options here.

@scameronde
Copy link
Author

I tried it with the following settings:

        <argument>-Dco.paralleluniverse.fibers.detectRunawayFibers=false</argument>
        <argument>-Dco.paralleluniverse.fibers.verifyInstrumentation=false</argument>
        <argument>-Dco.paralleluniverse.debugMode=true</argument>
        <argument>-Dco.paralleluniverse.flightRecorderDumpFile=quasar.log</argument>
        <argument>-Dco.paralleluniverse.monitoring.flightRecorderLevel=5</argument>
        <argument>-Dco.paralleluniverse.globalFlightRecorder=true</argument>
        <argument>-Dco.paralleluniverse.monitoring.flightRecorderSize=200000</argument>

The flight recorder slowed everything down from a few seconds runtime to more than an hour with heavy system load. The problem did not occur with two runs (I did not had the time for more runs yet).

How can I tweak the parameters to obtain better runtime, to increase the chance of the problem to occur and to squeeze more runs into one day?

@circlespainter
Copy link
Member

Sorry, I didn't expect such long runtimes. Level 1 should already produce interesting information but if it doesn't help we can try with a custom Quasar build. Thanks for your support.

@scameronde
Copy link
Author

When logging at level 1 and increasing the size of the flight recorder, the problem occurs again. The problem is, that the program does not terminate, because there are actors still running due to missing LifeCycle messages. It seems that the flight recorder log is only written to disk when the program ends.

My idea is to install a shutdown hook that catches a kill signal and to write the log via an api call. What API call do I have to make to force writing of the log?

@pron
Copy link
Contributor

pron commented Jun 30, 2016

Hi @scameronde . We've been able to reproduce the bug on a different machine, and are looking into it.

@scameronde
Copy link
Author

Hi. How is the fix going along? Did you manage to identify the cause of the problem?

@pron
Copy link
Contributor

pron commented Jul 29, 2016

I haven't had the time to work on the issue lately, but I'll devote next week to fixing it as well as a couple other important Quasar bugs. I hope and expect you'll have a fix by the end of next week.

@linkerlin
Copy link

@pron ExecutorService will drop some messages if the queue of Callables is full.

@pron pron closed this as completed in 97ef7c4 Aug 3, 2016
@pron
Copy link
Contributor

pron commented Aug 3, 2016

@scameronde The issue should be fixed now in 0.7.6-SNAPSHOT. It was a subtle race in the implementation of the unbounded queue used by the actor mailbox.

@scameronde
Copy link
Author

Thanks! I will test it this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants