Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Memory usage pattern and potential leak #858

Closed
Sovietaced opened this issue Nov 13, 2024 · 7 comments
Closed

Question: Memory usage pattern and potential leak #858

Sovietaced opened this issue Nov 13, 2024 · 7 comments

Comments

@Sovietaced
Copy link

Sovietaced commented Nov 13, 2024

Hello. We recently moved from using Amazon SQS to using Kafka and we chose to use franz-go as our client library. So far the library has been easy to use.

During the transition from Amazon SQS it seems that the memory usage pattern of our service has changed with the use of Kafka and franz-go and we've had our service OOM killed fairly often under the old resource requests. After increasing our resource requests we seem to see a slow leak of sorts that progresses over time.

Screenshot 2024-11-13 at 1 21 43 PM

I installed pprof today and the preliminary data shows that the majority of the memory consumption seems to be in the decompression logic. The kafka related code seems to be responsible for almost all of the memory usage in the application.

Screenshot 2024-11-13 at 1 34 39 PM

Some notes on our usage pattern... Our service has a single goroutine in which the Kafka client calls PollRecords with a max poll records of 10. Each record has a maximum decompressed size of 1MB so I'd assume that the maximum size of a decompression byte buffer would be whatever is necessary to hold (10*1MB) of data.

After looking into this logic it seems to be the best explanation for the disproportionate memory usage is the sync.Pool of byte buffers that are reused but I'm not not an expert in interpreting these graphs. I'll continue to watch the memory grow now that pprof is running and update with any new findings.

Thanks!

@twmb
Copy link
Owner

twmb commented Nov 13, 2024

There's a related issue, #823, and Grafana forked this to add some logic to improve the pooling for them. I haven't set aside a full day to think about an API that works generally for end users and meshes with the existing library (and then the follow on implementation). It may be worth taking a look at the fork -- and ideally, proposing a great API to add such that there is no need for a fork anymore :)

@Sovietaced
Copy link
Author

Sovietaced commented Nov 14, 2024

There's a related issue, #823, and Grafana forked this to add some logic to improve the pooling for them. I haven't set aside a full day to think about an API that works generally for end users and meshes with the existing library (and then the follow on implementation). It may be worth taking a look at the fork -- and ideally, proposing a great API to add such that there is no need for a fork anymore :)

I think #823 is somewhat related but I'm not sure it will solve anything for us. That issue was mostly focused on allocations but it does mention some improvements to avoid poisoning the sync.Pool. I'm still a bit confused since during steady state operation our service is reading 1 message (maximum of 1MB decompressed) once every 3 seconds so I'm trying to wrap my head around why the decompressor is holding onto so much memory for what seems like forever and then increasing the amount of memory used slowly over days and days.

Based on this article it looks like there may be at least as many byte buffers as "logical processors" (in our case 4) and potentially more depending on the level of concurrency on the pool which in our case is probably 2 goroutines. That still wouldn't explain 130MB of usage.

I may be naive but for our simpler use case I feel like we could probably avoid caching the byte buffers entirely since we are not running at scale yet.

Edit: Just looked at our metrics and the average size of the message is 800 bytes.

@Sovietaced
Copy link
Author

Here is a look at the inuse objects.

Screenshot 2024-11-13 at 4 54 07 PM

@Sovietaced
Copy link
Author

Looking at the code a bit more I now see that the byteBuffers pool is used more widely including the logging implementation which is where the concurrency might ramp up. Now I can see where poisoning the pool might have an effect here..

@Sovietaced
Copy link
Author

I'm unable to reproduce the issue on my local machine but I'm going to possibly vendor franz-go to get more visibility into the allocations being triggered in the decompressor in our production environment where the issue is happening.

@Sovietaced
Copy link
Author

Alright I think I have somewhat root caused the issue and I think its an issue with our code. I was able to validate that the byte buffer pools were not an issue at all and only a few were even allocated so that code works great. I now see that the allocations in decompress were actually the byte array that is returns after decoding. I'm relatively new to Go so it took me a while to fully grok what pprof is indicating here, an allocation directly in the decompress function block.

I traced the code and discovered that the byte array is referenced by a slice and many slices on top of that all the way into the kgo.Record struct. So it seems that for whatever reason kgo.Records were not being garbage collected.

We have some message queue infrastructure that isolates network IO from the processing of records and we had some leftover code from the SQS logic that was making a copy of the kgo.Record. So we were manually committing records and passing pointers to copies of record. Looking at the committing code this looks fine but I'm assuming there is some logic elsewhere that does not behave well with our incorrect pointers.

@Sovietaced
Copy link
Author

Confirmed.

Screenshot 2024-11-15 at 10 22 25 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants