A possible memory leak in ShardManager.retryQueue ? #817

jgodoniuk · 2024-07-25T14:36:52Z

Hi.

We have recently observed out of memory problems in our Java application instances. Generally, our scenario is that we have many retries (3600 every 1 sec) configured in the parallel consumer observing on Kafka topic with 5 partitions and we encountered a problematic messages (about 200 messages were enough) that triggered these retries. After the retries started, soon the memory in application instances were exhausted.

An analysis revealed the culprit is an instance of a ShardManager class. Heap dump reported that retryQueue structure inside this class has grown to the enormous size. I was expecting the retryQueue should not grow higher then the number of processed messages (200), while it happened to reach e.g. a few thousand entries.

Further Java heaps investigation shown that it is the TreeMap that is backing the retryQueue, which stores way to much nodes then the messages count. It turned out that there are some WorkContainer instances that are stored in the tree map nodes multiple times in different positions of the tree.

What exactly happens is that WorkContainer is a mutable class. And while WorkContainers are stored/removed in the nodes of retryQueue tree map, there are also threads operating on these WorkContainers in the same, changing the value of their lastFailedAt attributes. This attribute is a basis of ShardManager.retryQueueWorkContainerComparator and in some situations it may happen that after some lastFailedAt change, an instance of WorkContainer is placed on an incorrect position of the tree.

Starting with example situation:

It may happen that lastFailedAt in 3333 and 2222 is changed by other threads this way:

3333: 12:30 -> 12:33
2222: 12:31 -> 12:32

As a result we end up with a structure:

Now, when trying to add [2222,12:32] again (after the next failed attempt of a retry), comparison of lastFailedAt attributes will result in 2222 landing in the new node in the map, while the same instance is also still in the previous position.

Taking into consideration that retryQueue seems to bo not cleared, this looks like a potential memory leak.

I have prepared a simple Java project to help in simulating the described issue: https://github.com/jgodoniuk/parallel-consumer-memory-hunger
Just let it run and observe ShardManager.retryQueue in the heap dump after some time intervals.

sangreal · 2024-08-10T02:53:32Z

Hi @jgodoniuk, could u please share more your testing details? Actually if you check code, the workcontainer is the same object when trying to execute, and when it fails, it will put back but they are the same object in the TreeSet, so there will be no dupe in the TreeSet.

jgodoniuk · 2024-08-12T12:24:47Z

Hi.

Thanks for your feedback.

Yes, from the code analysis one might suppose that no duplicates are inserted since this is the same instance of WorkContainer.

But please take a look at my drawings, especially the difference between the first and the second. Here the thread interoperation takes place, which processes and modifies WorkContainer entries that are already stored in TreeMap. And it does it in a way that results of WorkContainers staying in incorrect positions of the tree.

On the second drawing, the node 2222 (12:31) is placed on the right side of 3333 (12:30) because that was its initial place as in drawing one. But after other threads have modified lastFailedAt attributes in these WorkContainers, the 2222 placement becomes incorrect now. Taking lastFailedAt into consideration, 2222 (12:32) should be now placed on the left side of 3333 (12:33).
And now, when 2222 will be inserted in the tree again (and it will be after the next fail) it will land on the left of 3333, stored in the new TreeMap node. While in the same time other TreeMap node still holds the same instance of WorkContainer. This node on the right of 3333 is now unreachable and will be not removed from the tree map.

Have you run the sample project I provided? It should give a better evidence of the issue. I have just runned it and observed 185 entries in ShardManager.retryQueue after sending only 10 messages.

netroute-js · 2024-08-23T07:11:21Z

We are also encountering the same kind of behaviour that was explained in details by @jgodoniuk. Are you planning on fixing the issue any time soon @sangreal?

rkolesnev · 2024-08-29T09:00:24Z

Hi @jgodoniuk - I am not 100% convinced that the issue is as you describe - i have done a small test for retry queue handling and as long as it is the same WorkContainer object instance that is being updated and re-added into the queue - everything looks correct:

@Test
    void retryQueueOrdering1() {
        PCModuleTestEnv module = mu.getModule();
        ShardManager<String, String> sm = new ShardManager<>(module, module.workManager());
        NavigableSet<WorkContainer<?, ?>> retryQueue = sm.getRetryQueue();


        WorkContainer<String, String> w0 = new WorkContainer<>(1, new ConsumerRecord<>("topic", 0, 0, "key0", "value0"), module);
        ((MutableClock)  module.clock()).setInstant(Instant.now());
        w0.onUserFunctionFailure(new RuntimeException("test1"));
        retryQueue.add(w0);
        sleepQuietly(10);
        ((MutableClock)  module.clock()).setInstant(Instant.now());
        w0.onUserFunctionFailure(new RuntimeException("test2"));
        retryQueue.add(w0);
}

Now if modify the test to instead create a new instance of WorkContainer and add it - then we have a problem as equals() method takes into account all local fields (its just Lombok generated) and not just epoch / partition / offset / topic combination:

@Test
    void retryQueueOrdering1() {
        PCModuleTestEnv module = mu.getModule();
        ShardManager<String, String> sm = new ShardManager<>(module, module.workManager());
        NavigableSet<WorkContainer<?, ?>> retryQueue = sm.getRetryQueue();


        WorkContainer<String, String> w0 = new WorkContainer<>(1, new ConsumerRecord<>("topic", 0, 0, "key0", "value0"), module);
        ((MutableClock)  module.clock()).setInstant(Instant.now());
        w0.onUserFunctionFailure(new RuntimeException("test1"));
        retryQueue.add(w0);
      
        WorkContainer<String, String> w1 = new WorkContainer<>(1, new ConsumerRecord<>("topic", 0, 0, "key0", "value0"), module);
        sleepQuietly(10);
        ((MutableClock)  module.clock()).setInstant(Instant.now());
        w1.onUserFunctionFailure(new RuntimeException("test3"));
        retryQueue.add(w1);
    }

So there may be an issue with re-adding a created WorkContainer into the queue if its created as new Object and not just updating (and re-adding) existing WorkContainer.
I will have a look in the code more to see if that is possible / new instance of WorkContainer is created in any cases.
@sangreal - if you have some time and are interested to look into the retry handling a bit more - i think this is a direction to look into.

rkolesnev · 2024-08-29T09:17:45Z

I had a further look - the WorkContainer is only ever constructed once after its polled by consumer - there are no other calls to WorkContainer constructor from anywhere else - only invocation is here -

parallel-consumer/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ShardManager.java

Line 191 in 29795bf

var wc = new WorkContainer<>(epochOfInboundRecords, aRecord, module);

sangreal · 2024-08-30T14:33:48Z

honestly I have already checked, there is no other place to create new WorkContainer.

netroute-js · 2024-09-01T14:33:47Z

@rkolesnev / @sangreal - try to follow what @jgodoniuk said. You need to run his example and follow his guide. I could reproduce the error. It won't happen everytime but at some point the behaviour will be triggered and the memory leak will occur. In our case we've got quite a lot of retrying (huge traffic) and we can go out of memory very quickly.

Moreover, the problem lies in re-using WorkContainer. As @jgodoniuk pointed out, the WorkContainer seems to be mutable and that's where the problem arises. The moment lastFailedAt attribute is changed there is a possibility that it will end up on the wrong side of the Tree, thus it won't be found next time (because according to a comparator it should be on another side of the Tree) which will lead to memory leak.

rkolesnev · 2024-09-11T09:05:14Z

@netroute-js , @jgodoniuk - The retry queue was recently changed to use ConcurrentSkipListSet instead of TreeMap - could you please retest to see if internal implementation differences between the two invalidate this issue or it is still present?
In addition - if you could extend the unit test that i have posted above to fail / highlight the issue - it would really help to isolate it specifically to retryQueue updates / handling.
Thank you for help on this.

jgodoniuk · 2024-09-30T10:06:07Z

Hi @rkolesnev & @sangreal.

As regard the test you provided, it has to be more than one element in the retryQueue to obtain intertesting results. See below. What is interesting is that this test fails indeterministically, which I really don't understand.

@Test
    void retryQueueOrdering2() {
        PCModuleTestEnv module = mu.getModule();
        ShardManager<String, String> sm = new ShardManager<>(module, module.workManager());
        NavigableSet<WorkContainer<?, ?>> retryQueue = sm.getRetryQueue();

        WorkContainer<String, String> w0 = new WorkContainer<>(
            1, new ConsumerRecord<>("topic", 0, 0, "key0", "value0"), module);
        ((MutableClock)  module.clock()).setInstant(Instant.now());
        w0.onUserFunctionFailure(new RuntimeException("test1"));
        retryQueue.add(w0);

        WorkContainer<String, String> w1 = new WorkContainer<>(
            1, new ConsumerRecord<>("topic", 0, 0, "key1", "value0"), module);
        ThreadUtils.sleepQuietly(10);
        ((MutableClock)  module.clock()).setInstant(Instant.now());
        w1.onUserFunctionFailure(new RuntimeException("test2"));
        retryQueue.add(w1);

        WorkContainer<String, String> w2 = new WorkContainer<>(
            1, new ConsumerRecord<>("topic", 0, 0, "key2", "value0"), module);
        ThreadUtils.sleepQuietly(10);
        ((MutableClock)  module.clock()).setInstant(Instant.now());
        w2.onUserFunctionFailure(new RuntimeException("test3"));
        retryQueue.add(w2);

        ThreadUtils.sleepQuietly(10);
        ((MutableClock)  module.clock()).setInstant(Instant.now());
        w0.onUserFunctionFailure(new RuntimeException("a"));
        retryQueue.add(w0);

        // Sometimes 4 elements are observed in retryQueue
        assertThat(retryQueue).hasSize(3);

        retryQueue.remove(w0);
        retryQueue.remove(w1);
        retryQueue.remove(w2);

        assertThat(retryQueue).hasSize(0);
    }

sangreal · 2024-10-03T08:47:19Z

Hi @jgodoniuk , thanks for the tests. First, I do encounter in rare cases, the retryQueue is 4.
But even so, this test case doesn't mean anything. Since you didn't really let the workContainers into the processing queue.
So the retry mechanism never really work in this UT. Therefore this could not prove anything unfortunately .

rkolesnev · 2024-10-11T15:03:35Z

@jgodoniuk, @sangreal - thanks for keeping digging into this - there is definitely some very strange behaviour in play - probably something to do with randoms / hashing etc - failure of the above test (and my own modified version) at least seems pretty random - i've tried for example to wrap the last retryQueue.add(w0); in a while loop like:

while (retryQueue.size() < 4) {
            retryQueue.add(w0);
            log.info("Add and add RetryQueue {}", retryQueue);
        }

but it is either never causes the duplicate to be added to the retryQueue or adds it on first add - which doesnt seem logical.
I think only possible way to implement the retryQueue with sorting on the lastFailure but key uniqueness by current combination of epoch, topic, partition and message key - would be to implement it as a combination object of HashSet and TreeSet<Key+LastFailure> with write synchronisation while allowing for concurrent reads.
I will give it a stab and see how that shapes out.
Structure discussed here sounds like a fit for retry queue problem:
https://stackoverflow.com/questions/65448001/set-that-uniquely-contains-a-key-but-ordered-on-different-field

rkolesnev · 2024-10-21T13:35:04Z

I am closing the issue with the fix for this (#834) released - please reopen if needed.
FYI @jgodoniuk , @sangreal

This was referenced Oct 11, 2024

[OOM]Offset map reached to 75% pc-broker-poll thread would keep polling records lead to OOM #832

Closed

Fix retry queue implementation to prevent unbounded growth #834

Merged

rkolesnev added the verified bug Something isn't working label Oct 17, 2024

rkolesnev closed this as completed Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A possible memory leak in ShardManager.retryQueue ? #817

A possible memory leak in ShardManager.retryQueue ? #817

jgodoniuk commented Jul 25, 2024

sangreal commented Aug 10, 2024 •

edited

Loading

jgodoniuk commented Aug 12, 2024 •

edited

Loading

netroute-js commented Aug 23, 2024

rkolesnev commented Aug 29, 2024 •

edited

Loading

rkolesnev commented Aug 29, 2024

sangreal commented Aug 30, 2024

netroute-js commented Sep 1, 2024 •

edited

Loading

rkolesnev commented Sep 11, 2024

jgodoniuk commented Sep 30, 2024

sangreal commented Oct 3, 2024

rkolesnev commented Oct 11, 2024

rkolesnev commented Oct 21, 2024

A possible memory leak in ShardManager.retryQueue ? #817

A possible memory leak in ShardManager.retryQueue ? #817

Comments

jgodoniuk commented Jul 25, 2024

sangreal commented Aug 10, 2024 • edited Loading

jgodoniuk commented Aug 12, 2024 • edited Loading

netroute-js commented Aug 23, 2024

rkolesnev commented Aug 29, 2024 • edited Loading

rkolesnev commented Aug 29, 2024

sangreal commented Aug 30, 2024

netroute-js commented Sep 1, 2024 • edited Loading

rkolesnev commented Sep 11, 2024

jgodoniuk commented Sep 30, 2024

sangreal commented Oct 3, 2024

rkolesnev commented Oct 11, 2024

rkolesnev commented Oct 21, 2024

sangreal commented Aug 10, 2024 •

edited

Loading

jgodoniuk commented Aug 12, 2024 •

edited

Loading

rkolesnev commented Aug 29, 2024 •

edited

Loading

netroute-js commented Sep 1, 2024 •

edited

Loading