Prioritize primary shard movement during shard relocation #1445

jainankitk · 2021-10-26T22:32:02Z

Description

The primary shards are always picked up first from node for shard movement. That is achieved by bucketing the shards into primary/replicas and iterating over primaries first.

Issues Resolved

#1349

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Ankit Jain <jain.ankitk@gmail.com>

opensearch-ci-bot · 2021-10-26T22:33:59Z

✅ Gradle Wrapper Validation success d85c55c

opensearch-ci-bot · 2021-10-26T22:34:02Z

Can one of the admins verify this patch?

opensearch-ci-bot · 2021-10-26T22:34:49Z

✅ DCO Check Passed d85c55c

opensearch-ci-bot · 2021-10-26T22:37:44Z

❌ Gradle Precommit failure d85c55c
Log 1430

Signed-off-by: Ankit Jain <jain.ankitk@gmail.com>

opensearch-ci-bot · 2021-10-26T23:19:13Z

✅ DCO Check Passed 9f5b306

opensearch-ci-bot · 2021-10-26T23:19:54Z

✅ Gradle Wrapper Validation success 9f5b306

opensearch-ci-bot · 2021-10-26T23:28:19Z

✅ Gradle Precommit success 9f5b306

nknize

A few things missing here:

What problem is this solving? The linked issue doesn't explain the problem.
The PR is missing tests. This is a change to code in the critical path. Unit and integration tests should be included.
Are there benchmarks that show this isn't introducing undesired regressions for common cases? Are there benchmarks that show this is improving performance for certain use cases?
Is this an enhancement or new feature?
What version is this targeting? 2.0? 1.2? 1.3?

jainankitk · 2021-10-28T18:36:21Z

Thank you @nknize for initial review.

What problem is this solving? The linked issue doesn't explain the problem.

Let us say we exclude some set of nodes (based on some attribute) from cluster setting, current implementation of BalancedShardsAllocator iterates over them in breadth first over nodes picking 1 shard from each node and repeating the process. The shards from each node are picked randomly. It could happen that we relocate the p and r of shard1 first leaving behind both p and r of shard2. If the excluded nodes were to go down the cluster becomes red. Instead if we prioritize the p of both shard1 and shard2 first, cluster does not become red if the excluded nodes were to go down before relocating other shards

The PR is missing tests. This is a change to code in the critical path. Unit and integration tests should be included.

While existing UTs give sufficient coverage, I am planning to add more unit and integration tests for testing this specific functionality

Are there benchmarks that show this isn't introducing undesired regressions for common cases? Are there benchmarks that show this is improving performance for certain use cases?

I don't expect this to improve performance as the change is looking to improve robustness. There should not be any or regression as the implementation clearly abstracts out functionality

Is this an enhancement or new feature?

This is an enhancement to improve robustness.

What version is this targeting? 2.0? 1.2? 1.3?

I am looking to target this for 1.3

jainankitk · 2021-11-02T17:55:50Z

Benchmark results for shard relocation:

benchmark                                       indices         shards          replica         source nodes    target nodes    conc recovery   without change          with change             percentage diff    
measureExclusionOnZoneAwareStartedShard         10              2               0               1               1               1               2.739 ± 0.078 ms/op     2.782 ± 0.082 ms/op     1.57
measureExclusionOnZoneAwareStartedShard         10              3               0               1               1               2               2.701 ± 0.080 ms/op     2.689 ± 0.085 ms/op     -0.44
measureExclusionOnZoneAwareStartedShard         10              10              0               1               1               5               2.783 ± 0.082 ms/op     2.800 ± 0.076 ms/op     0.61
measureExclusionOnZoneAwareStartedShard         100             1               0               1               1               10              2.946 ± 0.086 ms/op     2.961 ± 0.075 ms/op     0.51
measureExclusionOnZoneAwareStartedShard         100             3               0               1               1               10              3.139 ± 0.116 ms/op     3.128 ± 0.088 ms/op     -0.35
measureExclusionOnZoneAwareStartedShard         100             10              0               1               1               10              3.996 ± 0.104 ms/op     4.044 ± 0.112 ms/op     1.20
measureExclusionOnZoneAwareStartedShard         10              2               0               10              10              1               2.802 ± 0.071 ms/op     2.790 ± 0.072 ms/op     -0.43
measureExclusionOnZoneAwareStartedShard         10              3               0               10              5               2               2.818 ± 0.086 ms/op     2.804 ± 0.082 ms/op     -0.50
measureExclusionOnZoneAwareStartedShard         10              10              0               10              5               5               3.007 ± 0.083 ms/op     3.024 ± 0.107 ms/op     0.57
measureExclusionOnZoneAwareStartedShard         100             1               0               5               10              5               3.104 ± 0.080 ms/op     3.095 ± 0.076 ms/op     -0.29
measureExclusionOnZoneAwareStartedShard         100             3               0               10              5               5               3.332 ± 0.090 ms/op     3.416 ± 0.100 ms/op     2.52
measureExclusionOnZoneAwareStartedShard         100             10              0               10              20              6               4.843 ± 0.146 ms/op     4.947 ± 0.150 ms/op     2.15
measureExclusionOnZoneAwareStartedShard         10              1               1               10              10              1               2.831 ± 0.075 ms/op     2.825 ± 0.072 ms/op     -0.21
measureExclusionOnZoneAwareStartedShard         10              3               1               10              3               3               2.807 ± 0.069 ms/op     2.823 ± 0.076 ms/op     0.57
measureExclusionOnZoneAwareStartedShard         10              10              1               5               12              5               3.192 ± 0.103 ms/op     3.140 ± 0.074 ms/op     -1.63
measureExclusionOnZoneAwareStartedShard         100             1               1               10              10              6               3.708 ± 0.085 ms/op     3.620 ± 0.084 ms/op     -2.37
measureExclusionOnZoneAwareStartedShard         100             3               1               10              5               8               3.927 ± 0.107 ms/op     3.976 ± 0.121 ms/op     1.25
measureExclusionOnZoneAwareStartedShard         100             10              1               8               17              8               5.873 ± 0.137 ms/op     6.216 ± 0.189 ms/op     5.84
measureExclusionOnZoneAwareStartedShard         10              1               2               10              10              1               2.891 ± 0.079 ms/op     2.824 ± 0.071 ms/op     -2.32
measureExclusionOnZoneAwareStartedShard         10              3               2               10              5               3               3.039 ± 0.077 ms/op     3.042 ± 0.089 ms/op     0.10
measureExclusionOnZoneAwareStartedShard         10              10              2               5               10              5               3.657 ± 0.103 ms/op     3.673 ± 0.150 ms/op     0.44
measureExclusionOnZoneAwareStartedShard         100             1               2               10              8               7               4.465 ± 0.107 ms/op     4.581 ± 0.121 ms/op     2.60
measureExclusionOnZoneAwareStartedShard         100             3               2               13              17              5               5.197 ± 0.180 ms/op     5.330 ± 0.180 ms/op     2.56
measureExclusionOnZoneAwareStartedShard         100             10              2               10              20              8               8.709 ± 0.275 ms/op     8.737 ± 0.288 ms/op     0.32
measureExclusionOnZoneAwareStartedShard         10              2               1               20              20              1               3.026 ± 0.074 ms/op     2.995 ± 0.069 ms/op     -1.02
measureExclusionOnZoneAwareStartedShard         10              3               1               20              30              1               3.084 ± 0.093 ms/op     3.086 ± 0.082 ms/op     0.06
measureExclusionOnZoneAwareStartedShard         10              10              1               20              10              3               3.244 ± 0.072 ms/op     3.280 ± 0.089 ms/op     1.11
measureExclusionOnZoneAwareStartedShard         100             1               1               20              5               5               3.306 ± 0.102 ms/op     3.350 ± 0.162 ms/op     1.33
measureExclusionOnZoneAwareStartedShard         100             3               1               20              23              6               5.350 ± 0.142 ms/op     5.380 ± 0.162 ms/op     0.56
measureExclusionOnZoneAwareStartedShard         100             10              1               40              20              8               8.227 ± 0.315 ms/op     8.410 ± 0.294 ms/op     2.22
measureExclusionOnZoneAwareStartedShard         10              3               2               50              30              1               3.632 ± 0.108 ms/op     3.552 ± 0.119 ms/op     -2.20
measureExclusionOnZoneAwareStartedShard         10              3               2               50              25              1               3.492 ± 0.096 ms/op     3.413 ± 0.101 ms/op     -2.26
measureExclusionOnZoneAwareStartedShard         10              10              1               50              33              2               4.184 ± 0.183 ms/op     4.258 ± 0.127 ms/op     1.77
measureExclusionOnZoneAwareStartedShard         100             1               1               40              50              2               4.653 ± 0.133 ms/op     4.570 ± 0.130 ms/op     -1.78
measureExclusionOnZoneAwareStartedShard         100             3               1               50              70              3               7.001 ± 0.205 ms/op     7.151 ± 0.221 ms/op     2.14
measureExclusionOnZoneAwareStartedShard         100             10              1               60              50              3               9.056 ± 0.314 ms/op     9.085 ± 0.303 ms/op     0.32
measureExclusionOnZoneAwareStartedShard         10              10              2               50              50              1               4.539 ± 0.138 ms/op     4.381 ± 0.113 ms/op     -3.48
measureExclusionOnZoneAwareStartedShard         10              3               2               50              30              1               3.611 ± 0.107 ms/op     3.526 ± 0.077 ms/op     -2.35
measureExclusionOnZoneAwareStartedShard         10              10              2               50              40              2               5.209 ± 0.165 ms/op     5.049 ± 0.153 ms/op     -3.07
measureExclusionOnZoneAwareStartedShard         100             1               2               40              50              2               5.591 ± 0.164 ms/op     5.223 ± 0.173 ms/op     -6.58
measureExclusionOnZoneAwareStartedShard         100             3               2               50              30              6               8.810 ± 0.307 ms/op     8.626 ± 0.283 ms/op     -2.09
measureExclusionOnZoneAwareStartedShard         100             10              2               33              55              6               12.814 ± 0.413 ms/op    13.018 ± 0.410 ms/op    1.59
measureExclusionOnZoneAwareStartedShard         500             60              1               100             100             12              241.278 ± 9.901 ms/op   255.860 ± 8.410 ms/op   6.04
measureExclusionOnZoneAwareStartedShard         500             60              1               100             40              12              210.525 ± 8.003 ms/op   222.254 ± 6.678 ms/op   5.57
measureExclusionOnZoneAwareStartedShard         500             60              1               40              100             12              192.286 ± 8.678 ms/op   203.785 ± 7.117 ms/op   5.98
measureExclusionOnZoneAwareStartedShard         50              60              1               100             100             6               33.909 ± 1.247 ms/op    34.418 ± 1.283 ms/op    1.50
measureExclusionOnZoneAwareStartedShard         50              60              1               100             40              6               19.245 ± 0.744 ms/op    19.199 ± 0.621 ms/op    -0.24
measureExclusionOnZoneAwareStartedShard         50              60              1               40              100             6               18.052 ± 0.677 ms/op    18.617 ± 0.735 ms/op    3.13

benchmark                               indices         shards          replica         source nodes    target nodes    conc recovery   without change          with change             percentage diff 
measureShardRelocationComplete          10              2               0               1               1               1               2.840 ± 0.084 ms/op     2.836 ± 0.085 ms/op     -0.14
measureShardRelocationComplete          10              3               0               1               1               2               2.764 ± 0.070 ms/op     2.788 ± 0.082 ms/op     0.87
measureShardRelocationComplete          10              10              0               1               1               5               3.002 ± 0.079 ms/op     3.011 ± 0.070 ms/op     0.30
measureShardRelocationComplete          100             1               0               1               1               10              3.338 ± 0.085 ms/op     3.372 ± 0.094 ms/op     1.02
measureShardRelocationComplete          100             3               0               1               1               10              3.833 ± 0.114 ms/op     3.838 ± 0.110 ms/op     0.13
measureShardRelocationComplete          100             10              0               1               1               10              5.827 ± 0.191 ms/op     6.012 ± 0.181 ms/op     3.17
measureShardRelocationComplete          10              2               0               10              10              1               2.914 ± 0.072 ms/op     2.926 ± 0.093 ms/op     0.41
measureShardRelocationComplete          10              3               0               10              5               2               2.936 ± 0.084 ms/op     2.994 ± 0.081 ms/op     1.98
measureShardRelocationComplete          10              10              0               10              5               5               3.297 ± 0.090 ms/op     3.321 ± 0.090 ms/op     0.73
measureShardRelocationComplete          100             1               0               5               10              5               3.596 ± 0.079 ms/op     3.611 ± 0.073 ms/op     0.42
measureShardRelocationComplete          100             3               0               10              5               5               4.206 ± 0.116 ms/op     4.283 ± 0.108 ms/op     1.83
measureShardRelocationComplete          100             10              0               10              20              6               7.453 ± 0.211 ms/op     7.685 ± 0.249 ms/op     3.11
measureShardRelocationComplete          10              1               1               10              10              1               2.966 ± 0.078 ms/op     2.997 ± 0.083 ms/op     1.05
measureShardRelocationComplete          10              3               1               10              3               3               2.996 ± 0.095 ms/op     3.023 ± 0.078 ms/op     0.90
measureShardRelocationComplete          10              10              1               5               12              5               3.597 ± 0.113 ms/op     3.624 ± 0.091 ms/op     0.75
measureShardRelocationComplete          100             1               1               10              10              6               4.482 ± 0.156 ms/op     4.576 ± 0.131 ms/op     2.10
measureShardRelocationComplete          100             3               1               10              5               8               5.260 ± 0.156 ms/op     5.485 ± 0.144 ms/op     4.28
measureShardRelocationComplete          100             10              1               8               17              8               10.017 ± 0.313 ms/op    10.639 ± 0.307 ms/op    6.21
measureShardRelocationComplete          10              1               2               10              10              1               3.059 ± 0.077 ms/op     3.027 ± 0.084 ms/op     -1.05
measureShardRelocationComplete          10              3               2               10              5               3               3.217 ± 0.072 ms/op     3.295 ± 0.089 ms/op     2.42
measureShardRelocationComplete          10              10              2               5               10              5               4.161 ± 0.110 ms/op     4.197 ± 0.115 ms/op     0.87
measureShardRelocationComplete          100             1               2               10              8               7               5.320 ± 0.178 ms/op     5.659 ± 0.163 ms/op     6.37
measureShardRelocationComplete          100             3               2               13              17              5               7.248 ± 0.276 ms/op     7.607 ± 0.211 ms/op     4.95
measureShardRelocationComplete          100             10              2               10              20              8               14.381 ± 0.527 ms/op    15.240 ± 0.410 ms/op    5.97
measureShardRelocationComplete          10              2               1               20              20              1               3.263 ± 0.088 ms/op     3.251 ± 0.082 ms/op     -0.37
measureShardRelocationComplete          10              3               1               20              30              1               3.333 ± 0.072 ms/op     3.340 ± 0.077 ms/op     0.21
measureShardRelocationComplete          10              10              1               20              10              3               3.716 ± 0.119 ms/op     3.764 ± 0.090 ms/op     1.29
measureShardRelocationComplete          100             1               1               20              5               5               3.858 ± 0.104 ms/op     3.990 ± 0.128 ms/op     3.42
measureShardRelocationComplete          100             3               1               20              23              6               7.026 ± 0.199 ms/op     7.357 ± 0.273 ms/op     4.71
measureShardRelocationComplete          100             10              1               40              20              8               12.798 ± 0.400 ms/op    13.682 ± 0.396 ms/op    6.91
measureShardRelocationComplete          10              3               2               50              30              1               4.012 ± 0.128 ms/op     3.930 ± 0.099 ms/op     -2.04
measureShardRelocationComplete          10              3               2               50              25              1               3.902 ± 0.147 ms/op     3.724 ± 0.092 ms/op     -4.56
measureShardRelocationComplete          10              10              1               50              33              2               4.776 ± 0.133 ms/op     4.868 ± 0.145 ms/op     1.93
measureShardRelocationComplete          100             1               1               40              50              2               5.557 ± 0.131 ms/op     5.739 ± 0.185 ms/op     3.28
measureShardRelocationComplete          100             3               1               50              70              3               8.994 ± 0.285 ms/op     9.114 ± 0.305 ms/op     1.33
measureShardRelocationComplete          100             10              1               60              50              3               13.973 ± 0.465 ms/op    14.191 ± 0.415 ms/op    1.56
measureShardRelocationComplete          10              10              2               50              50              1               5.255 ± 0.150 ms/op     5.182 ± 0.154 ms/op     -1.39
measureShardRelocationComplete          10              3               2               50              30              1               3.991 ± 0.103 ms/op     3.934 ± 0.110 ms/op     -1.43
measureShardRelocationComplete          10              10              2               50              40              2               5.965 ± 0.210 ms/op     5.915 ± 0.193 ms/op     -0.84
measureShardRelocationComplete          100             1               2               40              50              2               6.508 ± 0.223 ms/op     6.691 ± 0.189 ms/op     2.81
measureShardRelocationComplete          100             3               2               50              30              6               11.108 ± 0.412 ms/op    11.218 ± 0.262 ms/op    0.99
measureShardRelocationComplete          100             10              2               33              55              6               20.016 ± 0.746 ms/op    21.353 ± 0.765 ms/op    6.68
measureShardRelocationComplete          500             60              1               100             100             12              540.946 ± 20.389 ms/op  561.109 ± 20.277 ms/op  3.73
measureShardRelocationComplete          500             60              1               100             40              12              504.320 ± 19.405 ms/op  519.140 ± 17.360 ms/op  2.94
measureShardRelocationComplete          500             60              1               40              100             12              461.199 ± 19.855 ms/op  468.763 ± 17.183 ms/op  1.64
measureShardRelocationComplete          50              60              1               100             100             6               50.915 ± 2.835 ms/op    53.403 ± 2.160 ms/op    4.89
measureShardRelocationComplete          50              60              1               100             40              6               33.995 ± 1.205 ms/op    36.410 ± 1.409 ms/op    7.10
measureShardRelocationComplete          50              60              1               40              100             6               32.178 ± 1.052 ms/op    33.855 ± 1.202 ms/op    5.21

jainankitk · 2021-11-02T19:42:08Z

@nknize - I have uploaded the results from shard relocation benchmark. Kindly review.
IMO, there is not much difference in performance with and without the change.

Signed-off-by: Ankit Jain <jain.ankitk@gmail.com>

opensearch-ci-bot · 2021-11-03T00:16:20Z

✅ Gradle Wrapper Validation success 8e7c453

opensearch-ci-bot · 2021-11-03T00:28:42Z

❌ Gradle Check failure 8e7c453
Log 964

Reports 964

opensearch-ci-bot · 2021-11-03T00:30:41Z

❌ Gradle Precommit failure 8e7c453
Log 1503

nknize · 2021-11-04T15:46:16Z

Thank you so much for the information @jainankitk! This is super useful and I think a good example of a thorough PR for a critical core change.

If the excluded nodes were to go down the cluster becomes red. ...While existing UTs give sufficient coverage

It looks like this is the crux of the problem? Can you add an integration test that simulates this use case? I don't think existing UTs provide sufficient coverage beyond detecting basic logic failures. We need a thorough integration test that validates and verifies the change is doing what we expect it to do and preventing those RED scenarios.

IMO, there is not much difference in performance with and without the change.

I'm not so sure I agree. It looks like performance is a function of indices, shards, replica, and source nodes? As these values increase so does the regression. This is a bit of a flag that this may not scale well with users operating with a large number of indices.

measureExclusionOnZoneAwareStartedShard         100             10              1               8               17              8               5.873 ± 0.137 ms/op     6.216 ± 0.189 ms/op     5.84
measureExclusionOnZoneAwareStartedShard         100             10              2               33              55              6               12.814 ± 0.413 ms/op    13.018 ± 0.410 ms/op    1.59
measureExclusionOnZoneAwareStartedShard         500             60              1               100             100             12              241.278 ± 9.901 ms/op   255.860 ± 8.410 ms/op   6.04
measureExclusionOnZoneAwareStartedShard         500             60              1               100             40              12              210.525 ± 8.003 ms/op   222.254 ± 6.678 ms/op   5.57
measureExclusionOnZoneAwareStartedShard         500             60              1               40              100             12              192.286 ± 8.678 ms/op   203.785 ± 7.117 ms/op   5.98


measureShardRelocationComplete          100             1               1               10              10              6               4.482 ± 0.156 ms/op     4.576 ± 0.131 ms/op     2.10
measureShardRelocationComplete          100             3               1               10              5               8               5.260 ± 0.156 ms/op     5.485 ± 0.144 ms/op     4.28
measureShardRelocationComplete          100             10              1               8               17              8               10.017 ± 0.313 ms/op    10.639 ± 0.307 ms/op    6.21
measureShardRelocationComplete          100             10              2               33              55              6               20.016 ± 0.746 ms/op    21.353 ± 0.765 ms/op    6.68
measureShardRelocationComplete          500             60              1               100             100             12              540.946 ± 20.389 ms/op  561.109 ± 20.277 ms/op  3.73
measureShardRelocationComplete          500             60              1               100             40              12              504.320 ± 19.405 ms/op  519.140 ± 17.360 ms/op  2.94
measureShardRelocationComplete          500             60              1               40              100             12              461.199 ± 19.855 ms/op  468.763 ± 17.183 ms/op  1.64
measureShardRelocationComplete          50              60              1               100             100             6               50.915 ± 2.835 ms/op    53.403 ± 2.160 ms/op    4.89
measureShardRelocationComplete          50              60              1               100             40              6               33.995 ± 1.205 ms/op    36.410 ± 1.409 ms/op    7.10
measureShardRelocationComplete          50              60              1               40              100             6               32.178 ± 1.052 ms/op    33.855 ± 1.202 ms/op    5.21

This is an enhancement to improve robustness.

Possibly at the cost of performance. We might consider placing this behind a cluster-wide setting and discuss what the default should be? Or is this concerning enough that it's worth paying the performance penalty?

jainankitk · 2021-11-08T22:47:39Z

It looks like this is the crux of the problem? Can you add an integration test that simulates this use case? I don't think existing UTs provide sufficient coverage beyond detecting basic logic failures. We need a thorough integration test that validates and verifies the change is doing what we expect it to do and preventing those RED scenarios.

I will try to add integration test for simulating this use case and ensure it passes after this code change

I'm not so sure I agree. It looks like performance is a function of indices, shards, replica, and source nodes? As these values increase so does the regression. This is a bit of a flag that this may not scale well with users operating with a large number of indices.
Possibly at the cost of performance. We might consider placing this behind a cluster-wide setting and discuss what the default should be? Or is this concerning enough that it's worth paying the performance penalty?

Though performance numbers don't look concerning to me, as the absolute time difference is fairly small, it is worth placing behind cluster-wide setting. We can begin with default disabled, and consider switching the default once we get more confidence?

malpani

Today, shards are balanced across nodes based on count as the key factor - irrespective of whether they are primary/replica. There will be multiple scenarios where certain nodes (eg. a restarted node) will only have replica shards.

By ordering the relocation and considering primaries first, I like how this change can help with robustness but i have concerns that this change can yield a net reduction to relocation speeds. By only considering primaries first, overall relocation speed for the cluster can drop with a tail on these nodes that have more replicas than primary. Could you please add a test case on relocation benchmarks for the same?

Modifying from simple count to primary count based balancing logic in the allocator will address this limitation. In the meantime, gating this change under a setting (default to disabled) could be a good option to experiment

malpani · 2021-11-09T01:23:01Z

server/src/main/java/org/opensearch/cluster/routing/RoutingNode.java

+        private static Map<Boolean, Integer> map = new HashMap<Boolean, Integer>() {
+            {
+                put(true, 0);
+                put(false, 1);
+            }
+        };


is the only use case for indexing into the shards array (0 or 1), will enum be better?

Maybe I am missing something here, but enum will define new type which is different than returned from shardRouting.primary() boolean?

you are right enum wont work here. Can leave it as is.

Maybe

org.opensearch.common.collect.Map.of(true, 0, false, 1)

server/src/main/java/org/opensearch/cluster/routing/RoutingNode.java

Signed-off-by: Ankit Jain <jain.ankitk@gmail.com>

opensearch-ci-bot · 2021-11-09T04:20:21Z

✅ Gradle Wrapper Validation success fc281a8

opensearch-ci-bot · 2021-11-09T04:26:52Z

❌ Gradle Precommit failure fc281a8
Log 1524

opensearch-ci-bot · 2021-11-09T04:30:19Z

❌ Gradle Check failure fc281a8
Log 1013

Reports 1013

Signed-off-by: Ankit Jain <jain.ankitk@gmail.com>

nknize

Nice! Thx for putting this behind a cluster setting! Tests look good as well. LGTM!

nknize · 2022-01-20T23:04:14Z

Merged to main. I updated the commit message to be more descriptive of the problem this change addresses. Can we get a separate PR to add some more thorough documentation of this new cluster setting and behavior (including the performance as a function of scale?)

andrross

Just a couple minor comments, otherwise looks good to me.

server/src/main/java/org/opensearch/cluster/routing/RoutingNode.java

When some node or set of nodes is excluded (based on some cluster setting) BalancedShardsAllocator iterates over them in breadth first order picking 1 shard from each node and repeating the process until all shards are balanced. Since shards from each node are picked randomly it's possible the p and r of shard1 is relocated first leaving behind both p and r of shard2. If the excluded nodes were to go down the cluster becomes red. This commit introduces a new setting "cluster.routing.allocation.move.primary_first" that prioritizes the p of both shard1 and shard2 first so the cluster does not become red if the excluded nodes were to go down before relocating other shards. Note that with this setting enabled performance of this change is a direct function of number of indices, shards, replicas, and nodes. The larger the indices, replicas, and distribution scale, the slower the allocation becomes. This should be used with care. Signed-off-by: Ankit Jain <jain.ankitk@gmail.com> (cherry picked from commit 6eb8f6f)

) When some node or set of nodes is excluded (based on some cluster setting) BalancedShardsAllocator iterates over them in breadth first order picking 1 shard from each node and repeating the process until all shards are balanced. Since shards from each node are picked randomly it's possible the p and r of shard1 is relocated first leaving behind both p and r of shard2. If the excluded nodes were to go down the cluster becomes red. This commit introduces a new setting "cluster.routing.allocation.move.primary_first" that prioritizes the p of both shard1 and shard2 first so the cluster does not become red if the excluded nodes were to go down before relocating other shards. Note that with this setting enabled performance of this change is a direct function of number of indices, shards, replicas, and nodes. The larger the indices, replicas, and distribution scale, the slower the allocation becomes. This should be used with care. Signed-off-by: Ankit Jain <jain.ankitk@gmail.com> (cherry picked from commit 6eb8f6f) Co-authored-by: Ankit Jain <jain.ankitk@gmail.com>

…relocation (#8875) (#9153) When some node or set of nodes is excluded, the shards are moved away in random order. When segment replication is enabled for a cluster, we might end up in a mixed version state where replicas will be on lower version and unable to read segments sent from higher version primaries and fail. To avoid this, we could prioritize replica shard movement to avoid entering this situation. Adding a new setting called shard movement strategy - `SHARD_MOVEMENT_STRATEGY_SETTING` - that will allow us to specify in which order we want to move our shards: `NO_PREFERENCE` (default), `PRIMARY_FIRST` or `REPLICA_FIRST`. The `PRIMARY_FIRST` option will perform the same behavior as the previous setting `SHARD_MOVE_PRIMARY_FIRST_SETTING` which will be now deprecated in favor of the shard movement strategy setting. Expected behavior: If `SHARD_MOVEMENT_STRATEGY_SETTING` is changed from its default behavior to be either `PRIMARY_FIRST` or `REPLICA_FIRST` then we perform this behavior whether or not `SHARD_MOVE_PRIMARY_FIRST_SETTING` is enabled. If `SHARD_MOVEMENT_STRATEGY_SETTING` is still at its default setting of `NO_PREFERENCE` and `SHARD_MOVE_PRIMARY_FIRST_SETTING` is enabled we move the primary shards first. This ensures that users still using this setting will not see any changes in behavior. Reference: #1445 Parent issue: #3881 --------- Signed-off-by: Poojita Raj <poojiraj@amazon.com> (cherry picked from commit c6e4bcd)

jainankitk added 2 commits October 26, 2021 15:30

Delegating the node level shard management to RoutingNode

be2dcf0

Signed-off-by: Ankit Jain <jain.ankitk@gmail.com>

Prioritizing primary shards movement

d85c55c

Signed-off-by: Ankit Jain <jain.ankitk@gmail.com>

Fixing gradle precommit violations

9f5b306

Signed-off-by: Ankit Jain <jain.ankitk@gmail.com>

nknize self-requested a review October 28, 2021 15:34

nknize requested changes Oct 28, 2021

View reviewed changes

nknize added the feedback needed Issue or PR needs feedback label Oct 28, 2021

Adding unit test for RoutingNode

8e7c453

Signed-off-by: Ankit Jain <jain.ankitk@gmail.com>

nknize added v1.3.0 discuss Issues intended to help drive brainstorming and decision making labels Nov 4, 2021

malpani suggested changes Nov 9, 2021

View reviewed changes

Adding cluster level setting for iterating primary first shards

fc281a8

Signed-off-by: Ankit Jain <jain.ankitk@gmail.com>

Adding review comments and gradle precommit violations

e71ab60

Signed-off-by: Ankit Jain <jain.ankitk@gmail.com>

nknize added enhancement Enhancement or improvement to existing feature or request Indexing & Search and removed feedback needed Issue or PR needs feedback discuss Issues intended to help drive brainstorming and decision making labels Jan 20, 2022

nknize approved these changes Jan 20, 2022

View reviewed changes

nknize merged commit 6eb8f6f into opensearch-project:main Jan 20, 2022

nknize added pending backport Identifies an issue or PR that still needs to be backported v2.0.0 Version 2.0.0 labels Jan 20, 2022

andrross approved these changes Jan 20, 2022

View reviewed changes

server/src/main/java/org/opensearch/cluster/routing/RoutingNode.java Show resolved Hide resolved

server/src/main/java/org/opensearch/cluster/routing/RoutingNode.java Show resolved Hide resolved

jainankitk deleted the prim-first branch January 21, 2022 00:17

saratvemulapalli mentioned this pull request Jan 21, 2022

[BUG] gradle precommit is broken due to transportClientRatio #1954

Closed

VachaShah mentioned this pull request Jan 21, 2022

[BUG] Timeout on org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation #1957

Closed

owaiskazi19 mentioned this pull request Jan 27, 2022

Added timeout to ensureGreen() for testClusterGreenAfterPartialRelocation #1983

Merged

5 tasks

jainankitk mentioned this pull request Feb 7, 2022

Stabilizing org.opensearch.cluster.routing.MovePrimaryFirstTests.test… #2048

Merged

3 tasks

VachaShah mentioned this pull request Feb 9, 2022

Added timeout to ensureGreen() for testClusterGreenAfterPartialReloca… #2074

Merged

3 tasks

VachaShah added the backport 1.x label Feb 9, 2022

opensearch-trigger-bot bot mentioned this pull request Feb 9, 2022

[Backport 1.x] Prioritize primary shard movement during shard allocation #2079

Merged

jainankitk changed the title ~~Prioritize primary shard movement during shard allocation~~ Prioritize primary shard movement during shard relocation Feb 21, 2022

dblock mentioned this pull request May 6, 2022

Adding @Bukhtawar to OpenSearch maintainers. #3231

Merged

1 task

mch2 mentioned this pull request Jun 14, 2023

[Segment Replication] Support for mixed cluster versions (Rolling Upgrade) #3881

Closed

Poojita-Raj mentioned this pull request Jun 26, 2023

[Segment Replication] Prioritize replica shard movement during shard relocation #8265

Closed

Poojita-Raj mentioned this pull request Jul 25, 2023

[Segment Replication] Prioritize replica shard movement during shard relocation #8875

Merged

6 tasks

Poojita-Raj mentioned this pull request Aug 17, 2023

[DOC] Add documentation for cluster routing allocation settings opensearch-project/documentation-website#4827

Closed

4 tasks

cwillum mentioned this pull request Sep 5, 2023

Add cluster setting for shard movement strategy opensearch-project/documentation-website#4955

Merged

1 task

Bukhtawar mentioned this pull request Aug 19, 2024

Adding @jainankitk as a Maintainer #15304

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prioritize primary shard movement during shard relocation #1445

Prioritize primary shard movement during shard relocation #1445

jainankitk commented Oct 26, 2021 •

edited

Loading

opensearch-ci-bot commented Oct 26, 2021

opensearch-ci-bot commented Oct 26, 2021

opensearch-ci-bot commented Oct 26, 2021

opensearch-ci-bot commented Oct 26, 2021

opensearch-ci-bot commented Oct 26, 2021

opensearch-ci-bot commented Oct 26, 2021

opensearch-ci-bot commented Oct 26, 2021

nknize left a comment •

edited

Loading

jainankitk commented Oct 28, 2021 •

edited

Loading

jainankitk commented Nov 2, 2021 •

edited

Loading

jainankitk commented Nov 2, 2021 •

edited

Loading

opensearch-ci-bot commented Nov 3, 2021

opensearch-ci-bot commented Nov 3, 2021

opensearch-ci-bot commented Nov 3, 2021

nknize commented Nov 4, 2021 •

edited

Loading

jainankitk commented Nov 8, 2021

malpani left a comment

malpani Nov 9, 2021

jainankitk Nov 9, 2021

malpani Nov 9, 2021

Bukhtawar Nov 23, 2021

opensearch-ci-bot commented Nov 9, 2021

opensearch-ci-bot commented Nov 9, 2021

opensearch-ci-bot commented Nov 9, 2021

nknize left a comment

nknize commented Jan 20, 2022

andrross left a comment

Prioritize primary shard movement during shard relocation #1445

Prioritize primary shard movement during shard relocation #1445

Conversation

jainankitk commented Oct 26, 2021 • edited Loading

Description

Issues Resolved

Check List

opensearch-ci-bot commented Oct 26, 2021

opensearch-ci-bot commented Oct 26, 2021

opensearch-ci-bot commented Oct 26, 2021

opensearch-ci-bot commented Oct 26, 2021

opensearch-ci-bot commented Oct 26, 2021

opensearch-ci-bot commented Oct 26, 2021

opensearch-ci-bot commented Oct 26, 2021

nknize left a comment • edited Loading

Choose a reason for hiding this comment

jainankitk commented Oct 28, 2021 • edited Loading

jainankitk commented Nov 2, 2021 • edited Loading

jainankitk commented Nov 2, 2021 • edited Loading

opensearch-ci-bot commented Nov 3, 2021

opensearch-ci-bot commented Nov 3, 2021

opensearch-ci-bot commented Nov 3, 2021

nknize commented Nov 4, 2021 • edited Loading

jainankitk commented Nov 8, 2021

malpani left a comment

Choose a reason for hiding this comment

malpani Nov 9, 2021

Choose a reason for hiding this comment

jainankitk Nov 9, 2021

Choose a reason for hiding this comment

malpani Nov 9, 2021

Choose a reason for hiding this comment

Bukhtawar Nov 23, 2021

Choose a reason for hiding this comment

opensearch-ci-bot commented Nov 9, 2021

opensearch-ci-bot commented Nov 9, 2021

opensearch-ci-bot commented Nov 9, 2021

nknize left a comment

Choose a reason for hiding this comment

nknize commented Jan 20, 2022

andrross left a comment

Choose a reason for hiding this comment

jainankitk commented Oct 26, 2021 •

edited

Loading

nknize left a comment •

edited

Loading

jainankitk commented Oct 28, 2021 •

edited

Loading

jainankitk commented Nov 2, 2021 •

edited

Loading

jainankitk commented Nov 2, 2021 •

edited

Loading

nknize commented Nov 4, 2021 •

edited

Loading