Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node-local core assignment: rebalancing #19864

Merged
merged 14 commits into from
Jun 26, 2024

Conversation

ztlpn
Copy link
Contributor

@ztlpn ztlpn commented Jun 17, 2024

Introduce rebalancing partitions across cores triggered by the following events:

  • core count change (only increase currently supported)
  • partition replica removal from the node
  • manually triggerred with admin API

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

Features

  • If the node_local_core_assignment flag is enabled, Redpanda will try to maintain balanced distribution of partition replicas across cores.

@ztlpn ztlpn requested a review from a team as a code owner June 17, 2024 18:53
@ztlpn ztlpn changed the title Node-local partition assignment: rebalancing Node-local core assignment: rebalancing Jun 17, 2024
@ztlpn ztlpn force-pushed the flex-assignment-balancing branch from 061a14a to 3a4cc8b Compare June 18, 2024 09:54
kbatuigas
kbatuigas previously approved these changes Jun 18, 2024
Copy link

@kbatuigas kbatuigas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one clarifying question

, shard_balancing_on_core_count_change(
*this,
"shard_balancing_on_core_count_change",
"If enabled, and if after a restart the number of cores changes, "

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"If enabled, and if after a restart the number of cores changes, "
"If 'true', Redpanda moves partitions between shards if the number of cores changes after a restart, "

@ztlpn I see that currently in our docs we define "shard" as "core" or "logical CPU core". Is there now a distinction in this case? I'm not sure if they are interchangeable here e.g. "Redpanda moves partitions between shards if the number of shards changes after a restart"

Copy link
Contributor Author

@ztlpn ztlpn Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kbatuigas They are mostly interchangeable. To be very precise, "shard" is a seastar concept, referring to a compartmentalized part of the Redpanda application that runs as a single thread. Actually you can run Redpanda with fewer or more shards than CPU cores, but the most common and sensible configuration is a 1-to-1 mapping. That's why we say that they are interchangeable.

I've been mostly using the word "core" for user-visible stuff such as configuration property names (because it is a concept that is more familiar to users), but switching to "shard" for precise and/or internal wording. Let me know if it makes sense or if you want to change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though for consistency I should probably rename the properties to core_balancing_* then :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think consistency is good. If "shard" is mostly an implementation detail and does not impact how customers should understand or use the property, it makes sense to stick to core_. Thank you!

src/v/config/configuration.cc Outdated Show resolved Hide resolved
src/v/config/configuration.cc Outdated Show resolved Hide resolved
src/v/config/configuration.cc Outdated Show resolved Hide resolved
src/v/config/configuration.cc Outdated Show resolved Hide resolved
src/v/cluster/shard_placement_table.h Outdated Show resolved Hide resolved
src/v/cluster/shard_balancer.cc Show resolved Hide resolved
mmaslankaprv
mmaslankaprv previously approved these changes Jun 25, 2024
Copy link
Member

@mmaslankaprv mmaslankaprv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great PR

@ztlpn ztlpn dismissed stale reviews from mmaslankaprv and kbatuigas via 7cc114d June 25, 2024 23:50
@ztlpn ztlpn force-pushed the flex-assignment-balancing branch from 3a4cc8b to 7cc114d Compare June 25, 2024 23:50
ztlpn added 7 commits June 26, 2024 09:52
Save core count of last successful rebalance in kvstore, and if it
doesn't match the current core count, trigger rebalance on startup.
To maintain balanced shard counts we need to rebalance after partitions
are moved away from the node because the distribution of remaining partitions
might be unbalanced.
@ztlpn ztlpn force-pushed the flex-assignment-balancing branch from 7cc114d to 928723d Compare June 26, 2024 07:52
@ztlpn ztlpn merged commit 1c40c32 into redpanda-data:dev Jun 26, 2024
19 checks passed
@ztlpn ztlpn deleted the flex-assignment-balancing branch June 26, 2024 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants