Question: Memory usage pattern and potential leak #858

Sovietaced · 2024-11-13T21:34:17Z

Hello. We recently moved from using Amazon SQS to using Kafka and we chose to use franz-go as our client library. So far the library has been easy to use.

During the transition from Amazon SQS it seems that the memory usage pattern of our service has changed with the use of Kafka and franz-go and we've had our service OOM killed fairly often under the old resource requests. After increasing our resource requests we seem to see a slow leak of sorts that progresses over time.

I installed pprof today and the preliminary data shows that the majority of the memory consumption seems to be in the decompression logic. The kafka related code seems to be responsible for almost all of the memory usage in the application.

Some notes on our usage pattern... Our service has a single goroutine in which the Kafka client calls PollRecords with a max poll records of 10. Each record has a maximum decompressed size of 1MB so I'd assume that the maximum size of a decompression byte buffer would be whatever is necessary to hold (10*1MB) of data.

After looking into this logic it seems to be the best explanation for the disproportionate memory usage is the sync.Pool of byte buffers that are reused but I'm not not an expert in interpreting these graphs. I'll continue to watch the memory grow now that pprof is running and update with any new findings.

Thanks!

The text was updated successfully, but these errors were encountered:

twmb · 2024-11-13T21:38:08Z

There's a related issue, #823, and Grafana forked this to add some logic to improve the pooling for them. I haven't set aside a full day to think about an API that works generally for end users and meshes with the existing library (and then the follow on implementation). It may be worth taking a look at the fork -- and ideally, proposing a great API to add such that there is no need for a fork anymore :)

Sovietaced · 2024-11-14T00:36:49Z

There's a related issue, #823, and Grafana forked this to add some logic to improve the pooling for them. I haven't set aside a full day to think about an API that works generally for end users and meshes with the existing library (and then the follow on implementation). It may be worth taking a look at the fork -- and ideally, proposing a great API to add such that there is no need for a fork anymore :)

I think #823 is somewhat related but I'm not sure it will solve anything for us. That issue was mostly focused on allocations but it does mention some improvements to avoid poisoning the sync.Pool. I'm still a bit confused since during steady state operation our service is reading 1 message (maximum of 1MB decompressed) once every 3 seconds so I'm trying to wrap my head around why the decompressor is holding onto so much memory for what seems like forever and then increasing the amount of memory used slowly over days and days.

Based on this article it looks like there may be at least as many byte buffers as "logical processors" (in our case 4) and potentially more depending on the level of concurrency on the pool which in our case is probably 2 goroutines. That still wouldn't explain 130MB of usage.

I may be naive but for our simpler use case I feel like we could probably avoid caching the byte buffers entirely since we are not running at scale yet.

Edit: Just looked at our metrics and the average size of the message is 800 bytes.

Sovietaced · 2024-11-14T00:54:55Z

Here is a look at the inuse objects.

Sovietaced · 2024-11-14T02:18:24Z

Looking at the code a bit more I now see that the byteBuffers pool is used more widely including the logging implementation which is where the concurrency might ramp up. Now I can see where poisoning the pool might have an effect here..

Sovietaced · 2024-11-15T06:18:24Z

I'm unable to reproduce the issue on my local machine but I'm going to possibly vendor franz-go to get more visibility into the allocations being triggered in the decompressor in our production environment where the issue is happening.

Sovietaced · 2024-11-15T20:16:06Z

Alright I think I have somewhat root caused the issue and I think its an issue with our code. I was able to validate that the byte buffer pools were not an issue at all and only a few were even allocated so that code works great. I now see that the allocations in decompress were actually the byte array that is returns after decoding. I'm relatively new to Go so it took me a while to fully grok what pprof is indicating here, an allocation directly in the decompress function block.

I traced the code and discovered that the byte array is referenced by a slice and many slices on top of that all the way into the kgo.Record struct. So it seems that for whatever reason kgo.Records were not being garbage collected.

We have some message queue infrastructure that isolates network IO from the processing of records and we had some leftover code from the SQS logic that was making a copy of the kgo.Record. So we were manually committing records and passing pointers to copies of record. Looking at the committing code this looks fine but I'm assuming there is some logic elsewhere that does not behave well with our incorrect pointers.

Sovietaced · 2024-11-16T06:23:54Z

Confirmed.

Sovietaced closed this as completed Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Memory usage pattern and potential leak #858

Question: Memory usage pattern and potential leak #858

Sovietaced commented Nov 13, 2024 •

edited

Loading

twmb commented Nov 13, 2024

Sovietaced commented Nov 14, 2024 •

edited

Loading

Sovietaced commented Nov 14, 2024

Sovietaced commented Nov 14, 2024

Sovietaced commented Nov 15, 2024

Sovietaced commented Nov 15, 2024

Sovietaced commented Nov 16, 2024

Question: Memory usage pattern and potential leak #858

Question: Memory usage pattern and potential leak #858

Comments

Sovietaced commented Nov 13, 2024 • edited Loading

twmb commented Nov 13, 2024

Sovietaced commented Nov 14, 2024 • edited Loading

Sovietaced commented Nov 14, 2024

Sovietaced commented Nov 14, 2024

Sovietaced commented Nov 15, 2024

Sovietaced commented Nov 15, 2024

Sovietaced commented Nov 16, 2024

Sovietaced commented Nov 13, 2024 •

edited

Loading

Sovietaced commented Nov 14, 2024 •

edited

Loading