-
Notifications
You must be signed in to change notification settings - Fork 574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Messages to actors are lost under heavy load #204
Comments
Hi, I have tried a lot but I've been unable to reproduce it. My setup is: MBP5,1 (2.4G Core 2 Duo, 2 cores, HT is disabled in HW), 8GB RAM, OSX 10.11.5, Oracle JDK 1.8.0_92-b14. I made only few minor changes, the most relevant ones are logging all actions, explicitly allocating 4G heap (with your constants it crashes because of OOM here) and removing Guava 15 from dependencies: Quasar depends on 19.0 and, since I think Maven uses a nearest-wins (tree-wise) conflict resolution strategy, perhaps this could cause problems. Could you try again with my version? |
Hi, I changed my setup step by step to eliminate potential culprits.
Looks like a timing issue, because the logging slows everything down. It smells like a race condition where you have to have bad luck with timing. |
Yes and it must happen in a very small time slot because I can't reproduce it even with in-memory logging (attaching). Thanks for investing time; if you can spend a bit more it would help to try out using with a 0 mailbox size. Less interesting but also with this new version, just to get some empirical info on the best race-finding setups (more cores could increase the chances). |
I did two tests:
with 1) I had to increase the heap to 8GB, because I got OOMs with 4 GB Exception in Fiber "fiber-10343926" Exception in Fiber "fiber-10250341" Exception in Fiber "fiber-10516198" java.lang.OutOfMemoryError: Java heap space If you need further help identifying the problem, just tell me what I should do. |
Thanks, are you using 10 parent actors and 100k children each? |
Also, can you confirm from the logs that the issue is due to lost message and not something else and if so, Is it possible to tell how many are lost? |
I tried it with different numbers. 10 parents, 100k children. And 1k parents and 1k children. With the same result. It is the total number of childrens (actors) that seem to be the problem. Both cases have 1 million actors. When I go lower in numbers the problem happens less often, if I go higher it happens more often. I did not use log messages to monitor the behavior, but the JMX console of JVisualVM / MissionControl. The numbers of received messages the parent actors are reporting via JMX is off. When everything is ok and all actors terminate, the reported number of received messages is correct. When some of the actors do not terminate, it is sometimes one actor that reports less than the expected number of received messages, sometimes it is two or three actors. Sometimes only one message for a parent actor is missing, sometimes 80%. |
Thanks, we'll look into it. |
I investigated the situation a bit more by monitoring send and received messages. The root actor watches 10 parent actors. Each parent actor watches 100,000 child actors. All child actors receive the shutdown message send by my code. All child actors do quit their doRun-Loop. I am using JMX counters to assure that the numbers are correct. Maybe this helps finding the problem. |
Thanks, I think it does. Lifecycle messages are pretty normal messages except that they're produced and consumed by |
Could you run again with Quasar's "flight recorder" turned on and share the output file? It'll basically record in-memory many internal events (including related to actors) and then dump them upon exit. You can enable it with |
I tried it with the following settings:
The flight recorder slowed everything down from a few seconds runtime to more than an hour with heavy system load. The problem did not occur with two runs (I did not had the time for more runs yet). How can I tweak the parameters to obtain better runtime, to increase the chance of the problem to occur and to squeeze more runs into one day? |
Sorry, I didn't expect such long runtimes. Level 1 should already produce interesting information but if it doesn't help we can try with a custom Quasar build. Thanks for your support. |
When logging at level 1 and increasing the size of the flight recorder, the problem occurs again. The problem is, that the program does not terminate, because there are actors still running due to missing LifeCycle messages. It seems that the flight recorder log is only written to disk when the program ends. My idea is to install a shutdown hook that catches a kill signal and to write the log via an api call. What API call do I have to make to force writing of the log? |
Hi @scameronde . We've been able to reproduce the bug on a different machine, and are looking into it. |
Hi. How is the fix going along? Did you manage to identify the cause of the problem? |
I haven't had the time to work on the issue lately, but I'll devote next week to fixing it as well as a couple other important Quasar bugs. I hope and expect you'll have a fix by the end of next week. |
@pron ExecutorService will drop some messages if the queue of Callables is full. |
@scameronde The issue should be fixed now in |
Thanks! I will test it this weekend. |
Sometimes under heavy load and using cascading messages, some messages seem to get lost. When monitoring the actors, the number of received messages for some actors are too low.
I have attached a small sample project. I have to run the sample 3-5 times for the problem to occur.
Please keep in mind, that I am new to Quasar and might be doing something completely wrong.
Test environment
Macbook Pro:
Java:
bugreport.zip
The text was updated successfully, but these errors were encountered: