-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prioritize primary shard movement during shard relocation #1445
Conversation
Signed-off-by: Ankit Jain <jain.ankitk@gmail.com>
Signed-off-by: Ankit Jain <jain.ankitk@gmail.com>
✅ Gradle Wrapper Validation success d85c55c |
Can one of the admins verify this patch? |
✅ DCO Check Passed d85c55c |
Signed-off-by: Ankit Jain <jain.ankitk@gmail.com>
✅ DCO Check Passed 9f5b306 |
✅ Gradle Wrapper Validation success 9f5b306 |
✅ Gradle Precommit success 9f5b306 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things missing here:
- What problem is this solving? The linked issue doesn't explain the problem.
- The PR is missing tests. This is a change to code in the critical path. Unit and integration tests should be included.
- Are there benchmarks that show this isn't introducing undesired regressions for common cases? Are there benchmarks that show this is improving performance for certain use cases?
- Is this an enhancement or new feature?
- What version is this targeting? 2.0? 1.2? 1.3?
Thank you @nknize for initial review.
Let us say we exclude some set of nodes (based on some attribute) from cluster setting, current implementation of BalancedShardsAllocator iterates over them in breadth first over nodes picking 1 shard from each node and repeating the process. The shards from each node are picked randomly. It could happen that we relocate the p and r of shard1 first leaving behind both p and r of shard2. If the excluded nodes were to go down the cluster becomes red. Instead if we prioritize the p of both shard1 and shard2 first, cluster does not become red if the excluded nodes were to go down before relocating other shards
While existing UTs give sufficient coverage, I am planning to add more unit and integration tests for testing this specific functionality
I don't expect this to improve performance as the change is looking to improve robustness. There should not be any or regression as the implementation clearly abstracts out functionality
This is an enhancement to improve robustness.
I am looking to target this for 1.3 |
Benchmark results for shard relocation:
|
@nknize - I have uploaded the results from shard relocation benchmark. Kindly review. |
Signed-off-by: Ankit Jain <jain.ankitk@gmail.com>
✅ Gradle Wrapper Validation success 8e7c453 |
Thank you so much for the information @jainankitk! This is super useful and I think a good example of a thorough PR for a critical core change.
It looks like this is the crux of the problem? Can you add an integration test that simulates this use case? I don't think existing UTs provide sufficient coverage beyond detecting basic logic failures. We need a thorough integration test that validates and verifies the change is doing what we expect it to do and preventing those RED scenarios.
I'm not so sure I agree. It looks like performance is a function of
Possibly at the cost of performance. We might consider placing this behind a cluster-wide setting and discuss what the default should be? Or is this concerning enough that it's worth paying the performance penalty? |
I will try to add integration test for simulating this use case and ensure it passes after this code change
Though performance numbers don't look concerning to me, as the absolute time difference is fairly small, it is worth placing behind cluster-wide setting. We can begin with default disabled, and consider switching the default once we get more confidence? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today, shards are balanced across nodes based on count as the key factor - irrespective of whether they are primary/replica. There will be multiple scenarios where certain nodes (eg. a restarted node) will only have replica shards.
By ordering the relocation and considering primaries first, I like how this change can help with robustness but i have concerns that this change can yield a net reduction to relocation speeds. By only considering primaries first, overall relocation speed for the cluster can drop with a tail on these nodes that have more replicas than primary. Could you please add a test case on relocation benchmarks for the same?
Modifying from simple count to primary count based balancing logic in the allocator will address this limitation. In the meantime, gating this change under a setting (default to disabled) could be a good option to experiment
private static Map<Boolean, Integer> map = new HashMap<Boolean, Integer>() { | ||
{ | ||
put(true, 0); | ||
put(false, 1); | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the only use case for indexing into the shards array (0 or 1), will enum be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am missing something here, but enum will define new type which is different than returned from shardRouting.primary() boolean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right enum wont work here. Can leave it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe
org.opensearch.common.collect.Map.of(true, 0, false, 1)
server/src/main/java/org/opensearch/cluster/routing/RoutingNode.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Ankit Jain <jain.ankitk@gmail.com>
✅ Gradle Wrapper Validation success fc281a8 |
Signed-off-by: Ankit Jain <jain.ankitk@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thx for putting this behind a cluster setting! Tests look good as well. LGTM!
Merged to main. I updated the commit message to be more descriptive of the problem this change addresses. Can we get a separate PR to add some more thorough documentation of this new cluster setting and behavior (including the performance as a function of scale?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple minor comments, otherwise looks good to me.
When some node or set of nodes is excluded (based on some cluster setting) BalancedShardsAllocator iterates over them in breadth first order picking 1 shard from each node and repeating the process until all shards are balanced. Since shards from each node are picked randomly it's possible the p and r of shard1 is relocated first leaving behind both p and r of shard2. If the excluded nodes were to go down the cluster becomes red. This commit introduces a new setting "cluster.routing.allocation.move.primary_first" that prioritizes the p of both shard1 and shard2 first so the cluster does not become red if the excluded nodes were to go down before relocating other shards. Note that with this setting enabled performance of this change is a direct function of number of indices, shards, replicas, and nodes. The larger the indices, replicas, and distribution scale, the slower the allocation becomes. This should be used with care. Signed-off-by: Ankit Jain <jain.ankitk@gmail.com> (cherry picked from commit 6eb8f6f)
) When some node or set of nodes is excluded (based on some cluster setting) BalancedShardsAllocator iterates over them in breadth first order picking 1 shard from each node and repeating the process until all shards are balanced. Since shards from each node are picked randomly it's possible the p and r of shard1 is relocated first leaving behind both p and r of shard2. If the excluded nodes were to go down the cluster becomes red. This commit introduces a new setting "cluster.routing.allocation.move.primary_first" that prioritizes the p of both shard1 and shard2 first so the cluster does not become red if the excluded nodes were to go down before relocating other shards. Note that with this setting enabled performance of this change is a direct function of number of indices, shards, replicas, and nodes. The larger the indices, replicas, and distribution scale, the slower the allocation becomes. This should be used with care. Signed-off-by: Ankit Jain <jain.ankitk@gmail.com> (cherry picked from commit 6eb8f6f) Co-authored-by: Ankit Jain <jain.ankitk@gmail.com>
…relocation (#8875) (#9153) When some node or set of nodes is excluded, the shards are moved away in random order. When segment replication is enabled for a cluster, we might end up in a mixed version state where replicas will be on lower version and unable to read segments sent from higher version primaries and fail. To avoid this, we could prioritize replica shard movement to avoid entering this situation. Adding a new setting called shard movement strategy - `SHARD_MOVEMENT_STRATEGY_SETTING` - that will allow us to specify in which order we want to move our shards: `NO_PREFERENCE` (default), `PRIMARY_FIRST` or `REPLICA_FIRST`. The `PRIMARY_FIRST` option will perform the same behavior as the previous setting `SHARD_MOVE_PRIMARY_FIRST_SETTING` which will be now deprecated in favor of the shard movement strategy setting. Expected behavior: If `SHARD_MOVEMENT_STRATEGY_SETTING` is changed from its default behavior to be either `PRIMARY_FIRST` or `REPLICA_FIRST` then we perform this behavior whether or not `SHARD_MOVE_PRIMARY_FIRST_SETTING` is enabled. If `SHARD_MOVEMENT_STRATEGY_SETTING` is still at its default setting of `NO_PREFERENCE` and `SHARD_MOVE_PRIMARY_FIRST_SETTING` is enabled we move the primary shards first. This ensures that users still using this setting will not see any changes in behavior. Reference: #1445 Parent issue: #3881 --------- Signed-off-by: Poojita Raj <poojiraj@amazon.com> (cherry picked from commit c6e4bcd)
Description
The primary shards are always picked up first from node for shard movement. That is achieved by bucketing the shards into primary/replicas and iterating over primaries first.
Issues Resolved
#1349
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.