-
-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Memory usage pattern and potential leak #858
Comments
There's a related issue, #823, and Grafana forked this to add some logic to improve the pooling for them. I haven't set aside a full day to think about an API that works generally for end users and meshes with the existing library (and then the follow on implementation). It may be worth taking a look at the fork -- and ideally, proposing a great API to add such that there is no need for a fork anymore :) |
I think #823 is somewhat related but I'm not sure it will solve anything for us. That issue was mostly focused on allocations but it does mention some improvements to avoid poisoning the Based on this article it looks like there may be at least as many byte buffers as "logical processors" (in our case 4) and potentially more depending on the level of concurrency on the pool which in our case is probably 2 goroutines. That still wouldn't explain 130MB of usage. I may be naive but for our simpler use case I feel like we could probably avoid caching the byte buffers entirely since we are not running at scale yet. Edit: Just looked at our metrics and the average size of the message is 800 bytes. |
Looking at the code a bit more I now see that the |
I'm unable to reproduce the issue on my local machine but I'm going to possibly vendor |
Alright I think I have somewhat root caused the issue and I think its an issue with our code. I was able to validate that the byte buffer pools were not an issue at all and only a few were even allocated so that code works great. I now see that the allocations in I traced the code and discovered that the byte array is referenced by a slice and many slices on top of that all the way into the We have some message queue infrastructure that isolates network IO from the processing of records and we had some leftover code from the SQS logic that was making a copy of the |
Hello. We recently moved from using Amazon SQS to using Kafka and we chose to use
franz-go
as our client library. So far the library has been easy to use.During the transition from Amazon SQS it seems that the memory usage pattern of our service has changed with the use of Kafka and
franz-go
and we've had our service OOM killed fairly often under the old resource requests. After increasing our resource requests we seem to see a slow leak of sorts that progresses over time.I installed pprof today and the preliminary data shows that the majority of the memory consumption seems to be in the decompression logic. The kafka related code seems to be responsible for almost all of the memory usage in the application.
Some notes on our usage pattern... Our service has a single goroutine in which the Kafka client calls
PollRecords
with a max poll records of 10. Each record has a maximum decompressed size of 1MB so I'd assume that the maximum size of a decompression byte buffer would be whatever is necessary to hold (10*1MB) of data.After looking into this logic it seems to be the best explanation for the disproportionate memory usage is the
sync.Pool
of byte buffers that are reused but I'm not not an expert in interpreting these graphs. I'll continue to watch the memory grow now that pprof is running and update with any new findings.Thanks!
The text was updated successfully, but these errors were encountered: