Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store-gateway: protect from OOMs by limiting chunk bytes per/sec #6833

Open
dimitarvdimitrov opened this issue Dec 5, 2023 · 6 comments
Open

Comments

@dimitarvdimitrov
Copy link
Contributor

Background

A store-gateway can still go out of memory when enough queries requesting a lot of chunks execute at the same time. Combined with #3939 this can lead to out-of-memory errors.

This is a profile from a time period when the store-gateways in a Mimir cluster went out of memory. Most of the memory was spent on the chunks SlabPool

Screenshot 2023-12-05 at 18 34 04

Proposal

I propose to implement an instance limit that caps the total number of bytes used for chunks. The proposed implementation is to provide a new implementation for SlabPool which limits the number of currently allocated bytes and returns an error when the allocation fails.

// Create a batched memory pool that can be released all at once.
chunksPool := pool.NewSafeSlabPool[byte](chunkBytesSlicePool, chunkBytesSlabSize)

@dimitarvdimitrov
Copy link
Contributor Author

I realized this is less trivial than anticipated because we'll have to change the memcached library to also account for failed allocations.

@KyriosGN0
Copy link
Contributor

Hi @dimitarvdimitrov, i would like to take a go at it, could you give a bit more info about the memcached lib ?
also should it be configureable parameter ?

@dimitarvdimitrov
Copy link
Contributor Author

we use a fork of gomemcache which allows to inject a bytes pool for cache get operations. The cache uses the pool to allocate []byte instead of allocating them via the go runtime (Allocator.Get() instead of make([]byte)). I believe these are the usages of the interface in grafana/gomemcache. And here in Mimir the allocator is provided to the cache.

If we're going the route of modifying gomemcache, I think it's ok to do a breaking change then we should do it in a non-breaking way because other downstream users of grafana/gomemcache (Tempo and Loki for example) may only implement the current interface.

@KyriosGN0
Copy link
Contributor

@dimitarvdimitrov so as i understand this, the change in the gomemcache lib should take into accound that we will pass it a low value and therefore it would not be able to get objects ?

@dimitarvdimitrov
Copy link
Contributor Author

I was thinking of returning an error from the allocator and bubbling this up in gomemcache back to the caller

@KyriosGN0
Copy link
Contributor

@dimitarvdimitrov i will try to code a DRAFT PR over the week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants