Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce chunk write queue memory usage 2 #10874

Merged
merged 3 commits into from
Jun 29, 2022

Conversation

pstibrany
Copy link
Contributor

@pstibrany pstibrany commented Jun 16, 2022

This PR reimplements chan chunkWriteJob with custom buffered queue that should use less memory, because it doesn't preallocate entire buffer for maximum queue size at once. Instead it allocates individual "segments" with smaller size.

As elements are added to the queue, they fill individual segments. When elements are removed from the queue (and segments), empty segments can be thrown away. This doesn't change memory usage of the queue when it's full, but decreases its memory footprint when queue is empty (queue will keep max 1 segment in such case).

This PR comes from grafana/mimir-prometheus#247. We use it in Grafana Mimir to reduce memory usage in Mimir clusters with thousands of open TSDBs in single process.

Related to #10873.

@pstibrany pstibrany requested a review from codesome as a code owner June 16, 2022 08:30
@pstibrany pstibrany changed the title Job queue Reduce chunk write queue memory usage Jun 16, 2022
@pstibrany pstibrany changed the title Reduce chunk write queue memory usage Reduce chunk write queue memory usage 2 Jun 16, 2022
@pstibrany
Copy link
Contributor Author

pstibrany commented Jun 16, 2022

See #10873 (comment) for first part of the story.

After applying fix from #10873 and then fix from this PR at ~14:00 to ingester-zone-b, we've got:

Legend:

We can see that applying both fixes (#10873 and from this PR) reduced memory usage of new chunk mapper to the same values as old chunk mapper.

(In this case, each zone had 20 TSDBs open altogether)

@pstibrany pstibrany force-pushed the chunk-mapper-chan branch 2 times, most recently from c7e3bdd to af02c17 Compare June 16, 2022 09:27
@pstibrany
Copy link
Contributor Author

pstibrany commented Jun 16, 2022

Here is a graph with much larger Mimir deployment with 5200 open TSDBs in each zone, after enabling new chunk mapper with both fixes in ingester-zone-a, while zones B and C still use old chunk mapper. Graph starts with rollout in zone-a, and shows cumulative memory usage for each zone (in GB):

This PR reimplements chan chunkWriteJob with custom buffered queue that should use less memory, because it doesn't preallocate entire buffer for maximum queue size at once. Instead it allocates individual "segments" with smaller size.

As elements are added to the queue, they fill individual segments. When elements are removed from the queue (and segments), empty segments can be thrown away. This doesn't change memory usage of the queue when it's full, but should decrease its memory footprint when it's empty (queue will keep max 1 segment in such case).

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
@pstibrany pstibrany force-pushed the chunk-mapper-chan branch from af02c17 to 5198a92 Compare June 17, 2022 07:44
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks!

However, I don't see the mentioned improvement in the second graph: #10874 (comment) - do you mean those small steps down when the load was potentially smaller?

Anyway, it's great to reduce the allocation when the queue is smaller, so LGTM. Thanks 💪🏽

Just FYI the only negative I see is that generally linked lists can trash L-caches when iterating. And we iterate jobs to process them. However, given the write chunk job requires disk write so higher I/O - linked list overhead probably does not matter. Did you maybe check the impact of this on latency for writing those chunks (queue feeling up quicker)? I wonder if we have even a means to measure this currently. 🤔

tsdb/chunks/chunk_write_queue.go Outdated Show resolved Hide resolved
tsdb/chunks/queue.go Outdated Show resolved Hide resolved
tsdb/chunks/queue.go Outdated Show resolved Hide resolved
@pstibrany
Copy link
Contributor Author

pstibrany commented Jun 29, 2022

This looks good, thanks!

Thank you for review!

However, I don't see the mentioned improvement in the second graph: #10874 (comment) - do you mean those small steps down when the load was potentially smaller?

The improvement is in the yellow line. At first you can see it being better (lower) than green (zone-a), but worse than blue (zone-c). After applying fix from this PR (roughtly at 14 UTC on the graph), it got to the same level as blue (zone-c, running old chunk mapper).

Anyway, it's great to reduce the allocation when the queue is smaller, so LGTM. Thanks 💪🏽

Just FYI the only negative I see is that generally linked lists can trash L-caches when iterating. And we iterate jobs to process them. However, given the write chunk job requires disk write so higher I/O - linked list overhead probably does not matter. Did you maybe check the impact of this on latency for writing those chunks (queue feeling up quicker)? I wonder if we have even a means to measure this currently. 🤔

We iterate full segment first (slice of 8192 jobs), before jumping (using linked list) to the next segment. As you point out each chunk job performs IO. I haven't checked the impact on latency of writing the chunks though.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
@bwplotka
Copy link
Member

Restarted windows test flake

@bwplotka
Copy link
Member

Happy to merge, unless you want to check it @codesome

@codesome
Copy link
Member

1 less PR to review for me? Yes please!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants