Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dissemination large shard count #22977

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Aug 21, 2024

Previously each time the leadership update request was received by the
metadata_dissemination_handler it created a copy for each of the shard
on the shard handling the request. This is inefficient and may lead to
OOM on the handler shard (especially for very large machines).

Instead of creating a copy for each shard we can simply use the const
reference as the updates vector doesn't change and it is safe to access
it from other cores.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

Improvements

  • optimized memory usage with large core count nodes

vlog(clusterlog.trace, "Received a metadata update");
co_await ss::parallel_for_each(
boost::irange<ss::shard_id>(0, ss::smp::count),
[this, leaders = std::move(leaders)](ss::shard_id shard) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason you switched to ss::do_with?

Just doing &leaders should do the same no? The coroutine will keep it alive until ss::parallel_for_each returns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is called very often, i dropped coroutines to prevent allocations, in this case the coroutine do not help much with readability

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use _leaders.invoke_on_all(...)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is called very often, i dropped coroutines to prevent allocations, in this case the coroutine do not help much with readability

I really wouldn't overthink this in general. do_with will need an extra alloc as well to store the state.

@StephanDollberg
Copy link
Member

This path isn't covered by any microbench right now, right? (Just asking)

@mmaslankaprv
Copy link
Member Author

This path isn't covered by any microbench right now, right? (Just asking)

no it is not.

The `update_leader_request` may only contain few of ntp leader updates.
In this case using a `fragmented_vector` with large chunk size is a
memory waste

Signed-off-by: Michał Maślanka <michal@redpanda.com>
Previously each time the leadership update request was received by the
`metadata_dissemination_handler` it created a copy for each of the shard
on the shard handling the request. This is inefficient and may lead to
OOM on the handler shard (especially for very large machines).

Instead of creating a copy for each shard we can simply use the const
reference as the updates vector doesn't change and it is safe to access
it from other cores.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
@mmaslankaprv mmaslankaprv force-pushed the fix-dissemination-large-shard-count branch from 377dceb to 8e4f51f Compare August 21, 2024 12:09
@vbotbuildovich
Copy link
Collaborator

@mmaslankaprv mmaslankaprv merged commit ec8493e into redpanda-data:dev Aug 21, 2024
17 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants