-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A possible memory leak in ShardManager.retryQueue ? #817
Comments
Hi @jgodoniuk, could u please share more your testing details? Actually if you check code, the |
Hi. Thanks for your feedback. Yes, from the code analysis one might suppose that no duplicates are inserted since this is the same instance of WorkContainer. But please take a look at my drawings, especially the difference between the first and the second. Here the thread interoperation takes place, which processes and modifies WorkContainer entries that are already stored in TreeMap. And it does it in a way that results of WorkContainers staying in incorrect positions of the tree. On the second drawing, the node 2222 (12:31) is placed on the right side of 3333 (12:30) because that was its initial place as in drawing one. But after other threads have modified lastFailedAt attributes in these WorkContainers, the 2222 placement becomes incorrect now. Taking lastFailedAt into consideration, 2222 (12:32) should be now placed on the left side of 3333 (12:33). Have you run the sample project I provided? It should give a better evidence of the issue. I have just runned it and observed 185 entries in ShardManager.retryQueue after sending only 10 messages. |
We are also encountering the same kind of behaviour that was explained in details by @jgodoniuk. Are you planning on fixing the issue any time soon @sangreal? |
Hi @jgodoniuk - I am not 100% convinced that the issue is as you describe - i have done a small test for retry queue handling and as long as it is the same WorkContainer object instance that is being updated and re-added into the queue - everything looks correct:
Now if modify the test to instead create a new instance of WorkContainer and add it - then we have a problem as
So there may be an issue with re-adding a created WorkContainer into the queue if its created as new Object and not just updating (and re-adding) existing WorkContainer. |
I had a further look - the WorkContainer is only ever constructed once after its polled by consumer - there are no other calls to WorkContainer constructor from anywhere else - only invocation is here - Line 191 in 29795bf
|
honestly I have already checked, there is no other place to create new WorkContainer. |
@rkolesnev / @sangreal - try to follow what @jgodoniuk said. You need to run his example and follow his guide. I could reproduce the error. It won't happen everytime but at some point the behaviour will be triggered and the memory leak will occur. In our case we've got quite a lot of retrying (huge traffic) and we can go out of memory very quickly. Moreover, the problem lies in re-using WorkContainer. As @jgodoniuk pointed out, the WorkContainer seems to be mutable and that's where the problem arises. The moment lastFailedAt attribute is changed there is a possibility that it will end up on the wrong side of the Tree, thus it won't be found next time (because according to a comparator it should be on another side of the Tree) which will lead to memory leak. |
@netroute-js , @jgodoniuk - The retry queue was recently changed to use ConcurrentSkipListSet instead of TreeMap - could you please retest to see if internal implementation differences between the two invalidate this issue or it is still present? |
Hi @rkolesnev & @sangreal. As regard the test you provided, it has to be more than one element in the retryQueue to obtain intertesting results. See below. What is interesting is that this test fails indeterministically, which I really don't understand.
|
Hi @jgodoniuk , thanks for the tests. First, I do encounter in rare cases, the retryQueue is 4. |
@jgodoniuk, @sangreal - thanks for keeping digging into this - there is definitely some very strange behaviour in play - probably something to do with randoms / hashing etc - failure of the above test (and my own modified version) at least seems pretty random - i've tried for example to wrap the last
but it is either never causes the duplicate to be added to the retryQueue or adds it on first add - which doesnt seem logical. |
I am closing the issue with the fix for this (#834) released - please reopen if needed. |
Hi.
We have recently observed out of memory problems in our Java application instances. Generally, our scenario is that we have many retries (3600 every 1 sec) configured in the parallel consumer observing on Kafka topic with 5 partitions and we encountered a problematic messages (about 200 messages were enough) that triggered these retries. After the retries started, soon the memory in application instances were exhausted.
An analysis revealed the culprit is an instance of a ShardManager class. Heap dump reported that retryQueue structure inside this class has grown to the enormous size. I was expecting the retryQueue should not grow higher then the number of processed messages (200), while it happened to reach e.g. a few thousand entries.
Further Java heaps investigation shown that it is the TreeMap that is backing the retryQueue, which stores way to much nodes then the messages count. It turned out that there are some WorkContainer instances that are stored in the tree map nodes multiple times in different positions of the tree.
What exactly happens is that WorkContainer is a mutable class. And while WorkContainers are stored/removed in the nodes of retryQueue tree map, there are also threads operating on these WorkContainers in the same, changing the value of their lastFailedAt attributes. This attribute is a basis of ShardManager.retryQueueWorkContainerComparator and in some situations it may happen that after some lastFailedAt change, an instance of WorkContainer is placed on an incorrect position of the tree.
Starting with example situation:
It may happen that lastFailedAt in 3333 and 2222 is changed by other threads this way:
As a result we end up with a structure:
Now, when trying to add [2222,12:32] again (after the next failed attempt of a retry), comparison of lastFailedAt attributes will result in 2222 landing in the new node in the map, while the same instance is also still in the previous position.
Taking into consideration that retryQueue seems to bo not cleared, this looks like a potential memory leak.
I have prepared a simple Java project to help in simulating the described issue: https://github.com/jgodoniuk/parallel-consumer-memory-hunger
Just let it run and observe ShardManager.retryQueue in the heap dump after some time intervals.
The text was updated successfully, but these errors were encountered: