Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node-local shard assignments for created/moved partitions #18581

Merged
merged 19 commits into from
Jun 10, 2024

Conversation

ztlpn
Copy link
Contributor

@ztlpn ztlpn commented May 21, 2024

This PR implements node-local core assignment - nodes decide themselves on which core to put partitions instead of using global assignments. Partitions are placed so that topic-aware core counts are balanced. No online balancing is preformed yet (i.e. cores only for newly appearing partitions are chosen).

More detailed changes rundown:

  • Implemented migrator for the feature flag, so that we will switch to node-local assignment only after all nodes in the cluster successfully persist their shard placement tables
  • Implemented topic-aware partition placement in shard_balancer. For this it needs to maintain per-topic core counts.
  • Added admin API to manually change replica core on a node
  • Modified partition_allocator to stop tracking and assigning cores
  • Modified partition admin APIs to ignore core values passed by the client
  • Modified tests to accommodate both new and old way (for upgraded clusters) of assigning cores.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

Improvements

  • Implemented node-local core assignment - nodes decide themselves on which core to put partitions instead of using global assignments. This feature is enabled by default for new clusters.

@ztlpn
Copy link
Contributor Author

ztlpn commented May 21, 2024

/ci-repeat

@ztlpn ztlpn force-pushed the flex-assignment-local branch from a995272 to 9f26058 Compare May 22, 2024 10:16
@ztlpn
Copy link
Contributor Author

ztlpn commented May 22, 2024

/ci-repeat

@ztlpn ztlpn force-pushed the flex-assignment-local branch from e01b9e1 to b9b3207 Compare May 23, 2024 14:57
@ztlpn
Copy link
Contributor Author

ztlpn commented May 23, 2024

/ci-repeat

1 similar comment
@ztlpn
Copy link
Contributor Author

ztlpn commented May 27, 2024

/ci-repeat

@ztlpn ztlpn force-pushed the flex-assignment-local branch from 608f5c9 to 5ba91dc Compare May 28, 2024 17:31
@ztlpn
Copy link
Contributor Author

ztlpn commented May 28, 2024

/ci-repeat

@ztlpn ztlpn force-pushed the flex-assignment-local branch from 5ba91dc to 78ac5b8 Compare May 28, 2024 18:29
@ztlpn
Copy link
Contributor Author

ztlpn commented May 28, 2024

/ci-repeat

@ztlpn ztlpn force-pushed the flex-assignment-local branch from 78ac5b8 to 71e7417 Compare May 29, 2024 22:52
@ztlpn
Copy link
Contributor Author

ztlpn commented May 29, 2024

/ci-repeat

@ztlpn ztlpn force-pushed the flex-assignment-local branch 2 times, most recently from 0000f63 to 8e25a46 Compare June 3, 2024 22:31
@ztlpn
Copy link
Contributor Author

ztlpn commented Jun 3, 2024

/ci-repeat

We are going to use this feature flag to mark transition to node-local
core assignment, so rename it. It will also require migration (during
which we will wait for all nodes to persist their shard placement
tables), so mark it as requires_migration.
@ztlpn ztlpn force-pushed the flex-assignment-local branch from 8e25a46 to c645225 Compare June 4, 2024 12:17
@ztlpn ztlpn marked this pull request as ready for review June 4, 2024 12:34
@ztlpn ztlpn requested a review from a team as a code owner June 4, 2024 12:34
micheleRP
micheleRP previously approved these changes Jun 4, 2024
Copy link

@micheleRP micheleRP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good for doc!

@ztlpn
Copy link
Contributor Author

ztlpn commented Jun 4, 2024

Test failure is #18386

@mmaslankaprv
Copy link
Member

this is great, mostly nit comments

ztlpn added 16 commits June 7, 2024 12:21
It is inconvenient to override start/stop methods because they have to
do some common bookkeeping (such as setting the abort source). Instead,
give derived classes a possibility to override do_migrate() method if
the migration is more complicated than executing some action on the
controller leader (for these simple cases they can override do_mutate).
When enabling node-local shard placement, we need 2 stages: first, we
need to persist current state of the shard_placement_table to kvstore,
and then, after *all* nodes finish this, shard values in topic_table
become obsolete.

Implement this 2-stage process using feature migrator and a feature
barrier.
For the purpose of balancing assignments we need to maintain overall and
per-topic partition shard counts.
Assign shards based on local information, not on assignments from
topic_table.
The method allows changing a partition replica shard assignment from any
cluster node (not just the node hosting that replica).
No functional changes in this commit.
If node-local core assignment is active, we stop doing per-shard
accounting and assign invalid/zero shard when allocating new partition
replicas.
If node-local core assignment is enabled, we can ignore replica core
values passed by the client.
@ztlpn ztlpn force-pushed the flex-assignment-local branch from c645225 to 4ba1aed Compare June 7, 2024 10:38
@ztlpn ztlpn requested a review from mmaslankaprv June 7, 2024 10:38
mmaslankaprv
mmaslankaprv previously approved these changes Jun 7, 2024
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jun 7, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/50000#018ff29e-4606-40c5-9480-b17765609ed1:

"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_learner_gap_metrics"
"rptest.tests.shard_placement_test.ShardPlacementTest.test_upgrade"

new failures in https://buildkite.com/redpanda/redpanda/builds/50000#018ff29e-4608-49f3-adc2-acceb1e8db8a:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_admin_api_recovery.cloud_storage_type=CloudStorageType.S3"

ztlpn added 2 commits June 7, 2024 19:17
Add a test checking basic functionality of node-local shard placement
after upgrade.
In tests we need to support both the old way to dispatch x-core
movements (via admin.set_partition_replicas) and the new way (via
admin.set_partition_replica_core) depending on whether the cluster is
new or in the process of upgrade. To allow that, change functions in
PartitionMovementMixin:
1) allow omitting "core" field in replica assignment dicts (absent field
   means that the core doesn't matter)
2) add node_local_core_assignment flag to switch between the old and the
   new way.

Also, modify available_policy for the feature flag to ensure that for
most tests node-local core assignment will be enabled.
@ztlpn
Copy link
Contributor Author

ztlpn commented Jun 7, 2024

Test failures triage:

  • NodesDecommissioningTest.test_learner_gap_metrics: also happens in dev
  • ShardPlacementTest.test_upgrade: test problem (didn't wait for the feature to become available, fixed)
  • TopicRecoveryTest.test_admin_api_recovery: unrelated, looks similar to what was reported here

@ztlpn ztlpn requested a review from mmaslankaprv June 7, 2024 20:07
@ztlpn ztlpn merged commit dcbba8c into redpanda-data:dev Jun 10, 2024
19 checks passed
@ztlpn ztlpn deleted the flex-assignment-local branch June 10, 2024 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants