Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce chunk write queue memory usage #131

Merged
merged 9 commits into from
Jun 2, 2022

Conversation

replay
Copy link
Contributor

@replay replay commented Feb 8, 2022

This avoids wasting memory on the c.chunkRefMap by re-initializing it regularly. When re-initializing it, it gets initialized with a size which is half of the peak usage of the time period since the last re-init event, for this we also track the peak usage and reset it on every re-init event.

Very frequent re-initialization of the map would cause unnecessary allocations, to avoid that there are two factors which limit the frequency of the re-initializations:

  • There is a minimum interval of 10min between re-init events
  • In order to re-init the map the recorded peak usage since the last re-init event must be at least 10 (objects in c.chunkRefMap).

When re-initializing it we initialize it to half of the peak usage since the last re-init event to try to hit the sweet spot in the trade-off between initializing it to a very low size potentially resulting in many allocations to grow it, and initializing it to a large size potentially resulting in unused allocated memory.

With this solution we have the following advantages:

  • If a tenant's number of active series decreases over time then we want that their queue size also shrinks over time. By always resetting it to half of the previous peak it will shrink together with the usage over time
  • We don't want to initialize it to a size of 0 because this would cause a lot of allocations to grow it back to the size which it actually needs. By initializing it to half of the previous peak it will rarely have to be grown to more than double of the initialized size.
  • We don't want to initialize it too frequently because that would also cause avoidable allocations, so there is a minimum interval of 10min between re-init events

@replay replay force-pushed the reduce_chunk_write_queue_memory_usage branch from 6448d4b to 20f3746 Compare February 8, 2022 17:39
@replay replay changed the title [WIP] Reduce chunk write queue memory usage Reduce chunk write queue memory usage Feb 8, 2022
@replay replay marked this pull request as ready for review February 8, 2022 20:08
@replay replay force-pushed the reduce_chunk_write_queue_memory_usage branch 2 times, most recently from 8b4bfa2 to 7cb970a Compare February 9, 2022 22:33
@replay
Copy link
Contributor Author

replay commented Feb 9, 2022

I don't know how to make this CI pass... It seems to be complaining about stuff which I haven't changed. I don't want to fix everything it complains about as part of this PR because doing so would blow up the scope unnecessarily.

@replay
Copy link
Contributor Author

replay commented Feb 9, 2022

@codesome @pracucci @bboreham if you get a chance I'd appreciate some feedback on this proposal.
On one hand I don't like adding 2 relatively arbitrary constants, on the other hand I think this is way too deep in the internals to expose it via config flags.
This is to solve the issue which almost blocked the rollout of r171 due to an increase in memory usage, the context is in this Slack thread

Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functionality seems fine; I had some comments about naming / definition.

@@ -121,6 +168,8 @@ func (c *chunkWriteQueue) addJob(job chunkWriteJob) (err error) {
}
}()

// c.isRunningMtx serializes the adding of jobs to the c.chunkRefMap, if c.jobs is full then c.addJob() will block
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me this statement needs to go on the definition of isRunningMtx.
Ideally things should be named after what they do, so perhaps change the name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the comment. Kept the name isRunningMtx because it's also protecting isRunning field.

// freeChunkRefMap checks whether the conditions to free the chunkRefMap are met,
// if so it re-initializes it and frees the memory which it currently uses.
// The chunkRefMapMtx must be held when calling this method.
func (c *chunkWriteQueue) freeChunkRefMap() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me "free" is an implementation detail; what you are trying to do is "shrink" the data structure.
(Possibly needs a note that current Go runtime never releases the internal memory used for a map, which is why you are discarding the old one and making a new one)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed "free" to "shrink" everywhere, and added comment about Go runtime.

@bboreham
Copy link
Contributor

2 relatively arbitrary constants

I'm fine with them being constants; suggest putting a note in the diary to review the stats after 1 week in dev and after a few weeks in prod.

Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method looks good

tsdb/chunks/chunk_write_queue.go Outdated Show resolved Hide resolved
@codesome
Copy link
Member

I just had another thought: In the upstream PR, we thought that because addJob would block if channel is full, the map would not grow beyond. That seems to be right, but there was a subtle bug there. Is this below patch enough and we don't require complex recycling since we always need to be ready for the peak usage:

diff --git a/tsdb/chunks/chunk_write_queue.go b/tsdb/chunks/chunk_write_queue.go
index 5cdd2e81f..6502c570e 100644
--- a/tsdb/chunks/chunk_write_queue.go
+++ b/tsdb/chunks/chunk_write_queue.go
@@ -128,12 +128,12 @@ func (c *chunkWriteQueue) addJob(job chunkWriteJob) (err error) {
                return errors.New("queue is not started")
        }
 
+       c.jobs <- job
+
        c.chunkRefMapMtx.Lock()
        c.chunkRefMap[job.ref] = job.chk
        c.chunkRefMapMtx.Unlock()
 
-       c.jobs <- job
-
        return nil
 }

(basically moves the blocking channel add before we update the map, otherwise we just update the map and then wait for the channel now)

@replay
Copy link
Contributor Author

replay commented Feb 17, 2022

I don't think we should apply that patch, for two reasons:

  1. I don't think it solves a bug, because at the time when an object is pushed into c.jobs the lock c.isRunningMtx is held, that's what I was trying to point out with that comment on line :171 in this PR:
// c.isRunningMtx serializes the adding of jobs to the c.chunkRefMap, if c.jobs is full then c.addJob() will block
// while holding c.isRunningMtx, this guarantees that c.chunkRefMap won't ever grow beyond the queue size + 1.

So if another thread calls .addJob() it will already block on trying to acquire c.isRunningMtx.

  1. If we push the job into c.jobs before adding it into c.chunkRefMap[job.ref] then there's a chance that the consumer takes it from the chan and tries to delete it from the map before it's there, this would be a bug.

@codesome
Copy link
Member

Oh right, missed the other lock, makes sense.

@pstibrany
Copy link
Member

I've addressed review feedback, and would like to get this merged. Can you please re-review the PR?

I have a doubt about using 10 minutes timeout for shrinking the map. It seems too big to me. If map is empty, we can shrink it faster (eg. after 1 min) and pay the allocation price instead after this time.

My other concern is that PR only focuses on map size, but doesn't address channel size. Allocating channel with buffer of size 1000000 will allocate all memory upfront and never release it. Since chunkWriteJob uses 64-bytes, this will use 64 MB of heap for each open TSDB.

But let's see how this works in practice and decide if we need to address timeout and/or channel size based on our experience.

replay and others added 8 commits June 1, 2022 14:53
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
@pstibrany pstibrany force-pushed the reduce_chunk_write_queue_memory_usage branch from d7a75bc to 56e3a4a Compare June 1, 2022 12:53
Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems plausible.
Given the title of the PR, I would expect some data on results, i.e. memory usage measured before and after.

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Copy link
Contributor Author

@replay replay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up @pstibrany !
Your changes look good to me.
(I can't approve my own PR, otherwise I would)

@pstibrany
Copy link
Member

Given the title of the PR, I would expect some data on results, i.e. memory usage measured before and after.

Testing on Mimir ingesters with 7 tenants shows drop of memory allocated by chunks.newChunkWriteQueue by half (from from 814 MB to 427 MB when using queue size of 1000000). shrinkChunkRefMap didn't show up in the profile as place allocating objects "in-use". This is consistent with expectation: memory for map isn't kept allocated unless it's needed, but memory for the channel itself is still in-use (7 tenants, 1M channel buffer, 64 bytes entry = 427 MiB).

@pstibrany pstibrany merged commit 459f599 into main Jun 2, 2022
@pstibrany
Copy link
Member

#247 addresses memory usage of channel.

pstibrany added a commit to grafana/mimir that referenced this pull request Jun 7, 2022
Included changes:

grafana/mimir-prometheus#131
grafana/mimir-prometheus#247

These should result is lower memory usage by chunk mapper.
pstibrany added a commit to grafana/mimir that referenced this pull request Jun 8, 2022
Included changes:

grafana/mimir-prometheus#131
grafana/mimir-prometheus#247

These should result is lower memory usage by chunk mapper.
pstibrany added a commit to grafana/mimir that referenced this pull request Jun 8, 2022
Included changes:

grafana/mimir-prometheus#131
grafana/mimir-prometheus#247

These should result is lower memory usage by chunk mapper.
pstibrany added a commit to grafana/mimir that referenced this pull request Jun 8, 2022
* Update Prometheus with async chunk mapper changes.

Included changes:

grafana/mimir-prometheus#131
grafana/mimir-prometheus#247

These result is lower memory usage by chunk mapper.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
@pstibrany
Copy link
Member

Sent to upstream Prometheus as prometheus/prometheus#10873.

@pstibrany pstibrany deleted the reduce_chunk_write_queue_memory_usage branch November 28, 2022 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants