[META] Making all copies of shards spread evenly across all Awareness Attribute #3367

gbbafna · 2022-05-18T06:15:05Z

Is your feature request related to a problem? Please describe.

In cloud HA deployments , customer usually deploy over multiple zones. zone is usually the awareness.attributes in there . However, there is no enforcement of all copies spread evenly across all zones . This can cause uneven distribution of shards and also create shard hotspots. Failure in a single zone might also cause data loss and unavailability for that shard if the copies aren't evenly spread out.

Describe the solution you'd like

There are two solutions to this approach :

[Choosen Approach]A boolean cluster level setting routing.allocation.awareness.balance which is false by default . When true, we would validate that total copies is always a maximum of awareness attribute value count . If not, we will throw a validation exception. If there are multiple awareness attributes, the balance needs to ensure that every variant of awareness_attribute is equally balance. For ex, if there are 2 Awareness Attributes, zones and rack ids, each having 2 possible values , total copies needs to be multiple of 2.
A boolean cluster level setting auto_balance_across_awareness_attribute. If this is true, we would increase the total copies to be a multiple of AZ count . For instance, there are 3 AZs and index creation request comes with 7 replica. OpenSearch will create 8 replica, to ensure that there are total 9 copies .

Both the solutions will take in effect only upon cluster.routing.allocation.awareness.attributes and cluster.routing.allocation.awareness.force.zone.values being set . If not, the setting will not take in effect .

Trade offs

First approach : The plugins like ISM, CCR needs to do proactive validation while creation and updation of policy. If not, the actions/replication will fail silently at later point of time. As and when new policies or index creation paths are created , we will need to keep adding the validation there for a good experience.

Second approach : Since the replica count is adjusted by OpenSearch, the plugin and new index creation/modification paths don't need any handling and is very low maintenance. However, the fact that we are deviating from API supplied parameter may not look like a good user experience.

User Experience

User sets cluster.routing.allocation.awareness.attributes and cluster.routing.allocation.awareness.force.zone.values
If user enables routing.allocation.awareness.balance , the total copy needs to be a maximum of all possible values of awareness attribute. If not , we will do one of the following

Reject the create/update index
Auto expand the replica count as per need.

Why it should be built

This is to ensure that OpenSearch cluster remains well balanced as well as resilient to failures of zone/Rack etc.

What will it take to execute?

Changes in OpenSearch as well Plugins to honor the new flag .

The text was updated successfully, but these errors were encountered:

gbbafna · 2022-05-23T05:50:41Z

Requesting community and also pinging @shwetathareja @dblock @reta @nknize to provide feedback on above .

Bukhtawar · 2022-05-23T18:30:00Z

Thanks @gbbafna for opening this issue
Some initial thoughts

I see value in doing both, starting with a validation exception post enabling the settings enforce_awareness_attribute_balance (suggested name)
The auto_balance seems very similar to auto_expand_replica : 0-all, and looks like a good reasonable extension to 1.
The behaviour with both 1 and 2 at the time of index creation with non-compliant replica count could be a validation exception so as to not acknowledge the response with conflicting settings. Is the idea not to throw a validation exception in case of 2.
We should ensure all API responses(cat indices/cat shards etc) are consistent with the settings, reflecting the latest index metadata

reta · 2022-05-23T21:24:11Z

Thanks @gbbafna , I clearly see the benefits but there are some concerns as well. Primarily, the basic premise I have seen in many deployments is that clusters are quite dynamic by nature, with that:

The auto_balance_xxx (2nd approach) should probably disable the manual replica management. In this regards I sound @Bukhtawar 's concern related to auto_expand_replica. It seems to me the number of replicas should be dictated by cluster topology + awareness attributes (with caveats to preserve HA).
Cluster may not only grow (new nodes / zones / racks added) but also shrink on purpose (decommissioning nodes / racks / zones), it plays well with 1st point (disable the manual replica management). The behavior should be aligned with delayed allocation rules [1] to trigger the process.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/delayed-allocation.html

gbbafna · 2022-05-24T05:40:42Z

Thanks @Bukhtawar , @reta for your feedbacks.

Thanks @gbbafna for opening this issue Some initial thoughts

1. I see value in doing both, starting with a validation exception post enabling the settings `enforce_awareness_attribute_balance` (suggested name)

Makes sense. We can break this into two parts and start with enforce_awareness_attribute_balance first.

2. The auto_balance seems very similar to `auto_expand_replica` : 0-all, and looks like a good reasonable extension to 1.

Agreed. For search use cases, replica count can be a multiple of zone attribute values , so as to scale for reads. We can have auto_balance_xxx as well as count_per_awareness setting on index level for same.

3. The behaviour with both 1 and 2 at the time of index creation with non-compliant replica count could be a validation exception so as to not acknowledge the response with conflicting settings. Is the idea not to throw a validation exception in case of 2.

In case of 2, replica count will not even be an accepted parameter at the time of index creation.

4. We should ensure all API responses(cat indices/cat shards etc) are consistent with the settings, reflecting the latest index metadata

Agreed .

Thanks @gbbafna , I clearly see the benefits but there are some concerns as well. Primarily, the basic premise I have seen in many deployments is that clusters are quite dynamic by nature, with that:
1. The `auto_balance_xxx` (2nd approach) should probably disable the manual replica management. In this regards I sound @Bukhtawar 's concern related to `auto_expand_replica`. It seems to me the number of replicas should be dictated by cluster topology + awareness attributes (with caveats to preserve HA).

Yes.

2. Cluster may not only grow (new nodes / zones / racks added) but also shrink on purpose (decommissioning nodes / racks / zones), it plays well with 1st point (disable the manual replica management). The behavior should be aligned with delayed allocation rules [1] to trigger the process.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/delayed-allocation.html

Taking full control of replica count management makes a lot of sense for ease of management . We can have auto_balance_xxx as well as count_per_awareness on index level to completely disable manual replica management.

Summarizing above points, I propose breaking the feature in two parts :

Start with enforce_awareness_attribute_balance setting : I can get started on LLD for this.
auto_balance : This will expand the replica as per current awareness attributes , more on the lines of auto_expand_replica. This will also have count_per_awareness on index level setting . This can be a follow up item .

reta · 2022-05-24T16:37:46Z

Thanks, @gbbafna just curious how enforce_awareness_attribute_balance would impact the changes in awareness attributes? Fe the cluster + indices were created in one zone initially (with rack awareness only), but later on scaled across multiple zones (with another attribute of zone awareness added). What will happen with existing indices in this case?

gbbafna · 2022-05-25T05:37:39Z

Thanks, @gbbafna just curious how enforce_awareness_attribute_balance would impact the changes in awareness attributes? Fe the cluster + indices were created in one zone initially (with rack awareness only), but later on scaled across multiple zones (with another attribute of zone awareness added). What will happen with existing indices in this case?

The validation setting enforce_awareness_attribute_balance is enforced only at the time of index creation and updation . So for the above mentioned case, the existing indices would remain as it is . Only the update calls to those will have the awareness checks . It is not the ideal user experience, but better than the existing behavior. We are putting that responsibility to the user to modify replica count of existing indices during their scaling up/down. Hence, this behavior is kept status quo .

The second part of the feature auto_balance would take care of this and is completely hands free. But this is a much bigger change , which we can take up as a follow up item . In auto_balance , we will also need to add validations done in first part. So, enforce_awareness_attribute_balance is a precursor to auto_balance .

elfisher · 2022-05-26T14:07:55Z

A few questions:

the awareness.attributes

What else is this used for?

A boolean cluster level setting balanced_across_awareness_attribute which is false by default . When true, we would validate that total copies is always a multiple of awareness attribute value count . If not, we will throw a validation exception...

Is this on indexing, rebalancing, some other operation or multiple?

A boolean cluster level setting auto_balance_across_awareness_attribute. If this is true, we would increase the total copies to be a multiple of AZ count . For instance, there are 3 AZs and index creation request comes with 7 replica. OpenSearch will create 8 replica, to ensure that there are total 9 copies .

How is the AZ count specified?

User Experience

User sets cluster.routing.allocation.awareness.attributes and cluster.routing.allocation.awareness.force.zone.values
If user enables balanced_across_awareness_attribute , the total copy needs to be a multiple of all possible values of awareness attribute. If not , we will do one of the following
Reject the create/update index
Auto expand the replica count as per need.

Are these live reloaded settings or require cluster restart? Any new REST APIs?

gbbafna · 2022-05-27T09:59:23Z

A few questions:

the awareness.attributes

What else is this used for?

This is used to distribute the shards across the AZs/racks. If there are 2 copies of a shard and 2 zones, 1 zone will have 1 copy of shard.

A boolean cluster level setting balanced_across_awareness_attribute which is false by default . When true, we would validate that total copies is always a multiple of awareness attribute value count . If not, we will throw a validation exception...

Is this on indexing, rebalancing, some other operation or multiple?

This is on operations which create/modify index .

A boolean cluster level setting auto_balance_across_awareness_attribute. If this is true, we would increase the total copies to be a multiple of AZ count . For instance, there are 3 AZs and index creation request comes with 7 replica. OpenSearch will create 8 replica, to ensure that there are total 9 copies .

How is the AZ count specified?

Assuming cluster.routing.allocation.awareness.attributes specifies zone as the attribute, AZ count is specified via cluster.routing.allocation.awareness.force.zone.values .

Below settings illustrate same for two awareness attributes.

cluster.routing.allocation.awareness.attributes : ["zone", "rack"]
cluster.routing.allocation.awareness.force.zone.values: zone-1, zone-2
cluster.routing.allocation.awareness.force.rack.values: rack-1, rack-2

User Experience

User sets cluster.routing.allocation.awareness.attributes and cluster.routing.allocation.awareness.force.zone.values
If user enables balanced_across_awareness_attribute , the total copy needs to be a multiple of all possible values of awareness attribute. If not , we will do one of the following
Reject the create/update index
Auto expand the replica count as per need.

Are these live reloaded settings or require cluster restart? Any new REST APIs?

These are dynamic cluster level settings . Existing APIs to update settings will be reused to modify these

elfisher · 2022-07-20T13:44:00Z

@gbbafna is this tracking 2.2? Can we add two labels to this? 1/ "roadmap" to highlight this improvement on the project roadmap 2/ the version of OpenSearch this is targeting? Thanks!

gbbafna · 2022-07-21T08:54:14Z

Yes, we are tracking 2.2 for this. @Bukhtawar , can you please help with same as I don't have permissions ?

elfisher · 2022-07-21T14:38:13Z

Thanks! I see it now. Can we also open an issue in the docs repo to track any documentation updates that might need to happen for this?

kartg · 2022-08-03T17:28:47Z

@gbbafna can this issue be closed? I see #3461 which tracks the first solution here, with #3462 as the PR to main and #4086 as the backport to 2.x

gbbafna added enhancement Enhancement or improvement to existing feature or request untriaged labels May 18, 2022

Bukhtawar added discuss Issues intended to help drive brainstorming and decision making and removed untriaged labels May 18, 2022

gbbafna changed the title ~~[RFC] Making all copies of shards spread evenly across all Awareness Attribute~~ [PROPOSAL] Making all copies of shards spread evenly across all Awareness Attribute May 23, 2022

gbbafna changed the title ~~[PROPOSAL] Making all copies of shards spread evenly across all Awareness Attribute~~ [META] Making all copies of shards spread evenly across all Awareness Attribute May 27, 2022

gbbafna mentioned this issue May 27, 2022

Validation of shard copies to be a multiple of unique values of awareness attribute #3461

Closed

Bukhtawar added the v2.2.0 label Jul 21, 2022

This was referenced Jul 22, 2022

[FEATURE] Documentation for making all copies of shards spread evenly across all Awareness Attributes opensearch-project/documentation-website#828

Closed

Index Setting Validation before starting replication opensearch-project/cross-cluster-replication#460

Closed

gbbafna closed this as completed Aug 4, 2022

nateynateynate mentioned this issue Aug 4, 2022

[DOC] Feature in 2.2 requires two variables documented. opensearch-project/documentation-website#850

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[META] Making all copies of shards spread evenly across all Awareness Attribute #3367

[META] Making all copies of shards spread evenly across all Awareness Attribute #3367

gbbafna commented May 18, 2022 •

edited

Loading

gbbafna commented May 23, 2022 •

edited

Loading

Bukhtawar commented May 23, 2022

reta commented May 23, 2022

gbbafna commented May 24, 2022

reta commented May 24, 2022

gbbafna commented May 25, 2022 •

edited

Loading

elfisher commented May 26, 2022 •

edited

Loading

gbbafna commented May 27, 2022

elfisher commented Jul 20, 2022

gbbafna commented Jul 21, 2022

elfisher commented Jul 21, 2022

kartg commented Aug 3, 2022

[META] Making all copies of shards spread evenly across all Awareness Attribute #3367

[META] Making all copies of shards spread evenly across all Awareness Attribute #3367

Comments

gbbafna commented May 18, 2022 • edited Loading

gbbafna commented May 23, 2022 • edited Loading

Bukhtawar commented May 23, 2022

reta commented May 23, 2022

gbbafna commented May 24, 2022

reta commented May 24, 2022

gbbafna commented May 25, 2022 • edited Loading

elfisher commented May 26, 2022 • edited Loading

gbbafna commented May 27, 2022

elfisher commented Jul 20, 2022

gbbafna commented Jul 21, 2022

elfisher commented Jul 21, 2022

kartg commented Aug 3, 2022

gbbafna commented May 18, 2022 •

edited

Loading

gbbafna commented May 23, 2022 •

edited

Loading

gbbafna commented May 25, 2022 •

edited

Loading

elfisher commented May 26, 2022 •

edited

Loading