Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META] Making all copies of shards spread evenly across all Awareness Attribute #3367

Closed
gbbafna opened this issue May 18, 2022 · 12 comments
Closed
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request v2.2.0

Comments

@gbbafna
Copy link
Collaborator

gbbafna commented May 18, 2022

Is your feature request related to a problem? Please describe.

In cloud HA deployments , customer usually deploy over multiple zones. zone is usually the awareness.attributes in there . However, there is no enforcement of all copies spread evenly across all zones . This can cause uneven distribution of shards and also create shard hotspots. Failure in a single zone might also cause data loss and unavailability for that shard if the copies aren't evenly spread out.

Describe the solution you'd like

There are two solutions to this approach :

  1. [Choosen Approach]A boolean cluster level setting routing.allocation.awareness.balance which is false by default . When true, we would validate that total copies is always a maximum of awareness attribute value count . If not, we will throw a validation exception. If there are multiple awareness attributes, the balance needs to ensure that every variant of awareness_attribute is equally balance. For ex, if there are 2 Awareness Attributes, zones and rack ids, each having 2 possible values , total copies needs to be multiple of 2.
  2. A boolean cluster level setting auto_balance_across_awareness_attribute. If this is true, we would increase the total copies to be a multiple of AZ count . For instance, there are 3 AZs and index creation request comes with 7 replica. OpenSearch will create 8 replica, to ensure that there are total 9 copies .

Both the solutions will take in effect only upon cluster.routing.allocation.awareness.attributes and cluster.routing.allocation.awareness.force.zone.values being set . If not, the setting will not take in effect .

Trade offs

First approach : The plugins like ISM, CCR needs to do proactive validation while creation and updation of policy. If not, the actions/replication will fail silently at later point of time. As and when new policies or index creation paths are created , we will need to keep adding the validation there for a good experience.

Second approach : Since the replica count is adjusted by OpenSearch, the plugin and new index creation/modification paths don't need any handling and is very low maintenance. However, the fact that we are deviating from API supplied parameter may not look like a good user experience.

User Experience

  1. User sets cluster.routing.allocation.awareness.attributes and cluster.routing.allocation.awareness.force.zone.values
  2. If user enables routing.allocation.awareness.balance , the total copy needs to be a maximum of all possible values of awareness attribute. If not , we will do one of the following
  • Reject the create/update index
  • Auto expand the replica count as per need.

Why it should be built

This is to ensure that OpenSearch cluster remains well balanced as well as resilient to failures of zone/Rack etc.

What will it take to execute?

Changes in OpenSearch as well Plugins to honor the new flag .

@gbbafna gbbafna added enhancement Enhancement or improvement to existing feature or request untriaged labels May 18, 2022
@Bukhtawar Bukhtawar added discuss Issues intended to help drive brainstorming and decision making and removed untriaged labels May 18, 2022
@gbbafna
Copy link
Collaborator Author

gbbafna commented May 23, 2022

Requesting community and also pinging @shwetathareja @dblock @reta @nknize to provide feedback on above .

@gbbafna gbbafna changed the title [RFC] Making all copies of shards spread evenly across all Awareness Attribute [PROPOSAL] Making all copies of shards spread evenly across all Awareness Attribute May 23, 2022
@Bukhtawar
Copy link
Collaborator

Thanks @gbbafna for opening this issue
Some initial thoughts

  1. I see value in doing both, starting with a validation exception post enabling the settings enforce_awareness_attribute_balance (suggested name)
  2. The auto_balance seems very similar to auto_expand_replica : 0-all, and looks like a good reasonable extension to 1.
  3. The behaviour with both 1 and 2 at the time of index creation with non-compliant replica count could be a validation exception so as to not acknowledge the response with conflicting settings. Is the idea not to throw a validation exception in case of 2.
  4. We should ensure all API responses(cat indices/cat shards etc) are consistent with the settings, reflecting the latest index metadata

@reta
Copy link
Collaborator

reta commented May 23, 2022

Thanks @gbbafna , I clearly see the benefits but there are some concerns as well. Primarily, the basic premise I have seen in many deployments is that clusters are quite dynamic by nature, with that:

  1. The auto_balance_xxx (2nd approach) should probably disable the manual replica management. In this regards I sound @Bukhtawar 's concern related to auto_expand_replica. It seems to me the number of replicas should be dictated by cluster topology + awareness attributes (with caveats to preserve HA).
  2. Cluster may not only grow (new nodes / zones / racks added) but also shrink on purpose (decommissioning nodes / racks / zones), it plays well with 1st point (disable the manual replica management). The behavior should be aligned with delayed allocation rules [1] to trigger the process.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/delayed-allocation.html

@gbbafna
Copy link
Collaborator Author

gbbafna commented May 24, 2022

Thanks @Bukhtawar , @reta for your feedbacks.

Thanks @gbbafna for opening this issue Some initial thoughts

1. I see value in doing both, starting with a validation exception post enabling the settings `enforce_awareness_attribute_balance` (suggested name)

Makes sense. We can break this into two parts and start with enforce_awareness_attribute_balance first.

2. The auto_balance seems very similar to `auto_expand_replica` : 0-all, and looks like a good reasonable extension to 1.

Agreed. For search use cases, replica count can be a multiple of zone attribute values , so as to scale for reads. We can have auto_balance_xxx as well as count_per_awareness setting on index level for same.

3. The behaviour with both 1 and 2 at the time of index creation with non-compliant replica count could be a validation exception so as to not acknowledge the response with conflicting settings. Is the idea not to throw a validation exception in case of 2.

In case of 2, replica count will not even be an accepted parameter at the time of index creation.

4. We should ensure all API responses(cat indices/cat shards etc) are consistent with the settings, reflecting the latest index metadata

Agreed .

Thanks @gbbafna , I clearly see the benefits but there are some concerns as well. Primarily, the basic premise I have seen in many deployments is that clusters are quite dynamic by nature, with that:

1. The `auto_balance_xxx` (2nd approach) should probably disable the manual replica management. In this regards I sound @Bukhtawar 's concern related to `auto_expand_replica`. It seems to me the number of replicas should be dictated by cluster topology + awareness attributes (with caveats to preserve HA).

Yes.

2. Cluster may not only grow (new nodes / zones / racks added) but also shrink on purpose (decommissioning nodes / racks / zones), it plays well with 1st point (disable the manual replica management). The behavior should be aligned with delayed allocation rules [1] to trigger the process.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/delayed-allocation.html

Taking full control of replica count management makes a lot of sense for ease of management . We can have auto_balance_xxx as well as count_per_awareness on index level to completely disable manual replica management.

Summarizing above points, I propose breaking the feature in two parts :

  1. Start with enforce_awareness_attribute_balance setting : I can get started on LLD for this.
  2. auto_balance : This will expand the replica as per current awareness attributes , more on the lines of auto_expand_replica. This will also have count_per_awareness on index level setting . This can be a follow up item .

@reta
Copy link
Collaborator

reta commented May 24, 2022

Thanks, @gbbafna just curious how enforce_awareness_attribute_balance would impact the changes in awareness attributes? Fe the cluster + indices were created in one zone initially (with rack awareness only), but later on scaled across multiple zones (with another attribute of zone awareness added). What will happen with existing indices in this case?

@gbbafna
Copy link
Collaborator Author

gbbafna commented May 25, 2022

Thanks, @gbbafna just curious how enforce_awareness_attribute_balance would impact the changes in awareness attributes? Fe the cluster + indices were created in one zone initially (with rack awareness only), but later on scaled across multiple zones (with another attribute of zone awareness added). What will happen with existing indices in this case?

The validation setting enforce_awareness_attribute_balance is enforced only at the time of index creation and updation . So for the above mentioned case, the existing indices would remain as it is . Only the update calls to those will have the awareness checks . It is not the ideal user experience, but better than the existing behavior. We are putting that responsibility to the user to modify replica count of existing indices during their scaling up/down. Hence, this behavior is kept status quo .

The second part of the feature auto_balance would take care of this and is completely hands free. But this is a much bigger change , which we can take up as a follow up item . In auto_balance , we will also need to add validations done in first part. So, enforce_awareness_attribute_balance is a precursor to auto_balance .

@elfisher
Copy link

elfisher commented May 26, 2022

A few questions:

the awareness.attributes

What else is this used for?

A boolean cluster level setting balanced_across_awareness_attribute which is false by default . When true, we would validate that total copies is always a multiple of awareness attribute value count . If not, we will throw a validation exception...

Is this on indexing, rebalancing, some other operation or multiple?

A boolean cluster level setting auto_balance_across_awareness_attribute. If this is true, we would increase the total copies to be a multiple of AZ count . For instance, there are 3 AZs and index creation request comes with 7 replica. OpenSearch will create 8 replica, to ensure that there are total 9 copies .

How is the AZ count specified?

User Experience

User sets cluster.routing.allocation.awareness.attributes and cluster.routing.allocation.awareness.force.zone.values
If user enables balanced_across_awareness_attribute , the total copy needs to be a multiple of all possible values of awareness attribute. If not , we will do one of the following
Reject the create/update index
Auto expand the replica count as per need.

Are these live reloaded settings or require cluster restart? Any new REST APIs?

@gbbafna
Copy link
Collaborator Author

gbbafna commented May 27, 2022

A few questions:

the awareness.attributes

What else is this used for?

This is used to distribute the shards across the AZs/racks. If there are 2 copies of a shard and 2 zones, 1 zone will have 1 copy of shard.

A boolean cluster level setting balanced_across_awareness_attribute which is false by default . When true, we would validate that total copies is always a multiple of awareness attribute value count . If not, we will throw a validation exception...

Is this on indexing, rebalancing, some other operation or multiple?

This is on operations which create/modify index .

A boolean cluster level setting auto_balance_across_awareness_attribute. If this is true, we would increase the total copies to be a multiple of AZ count . For instance, there are 3 AZs and index creation request comes with 7 replica. OpenSearch will create 8 replica, to ensure that there are total 9 copies .

How is the AZ count specified?

Assuming cluster.routing.allocation.awareness.attributes specifies zone as the attribute, AZ count is specified via cluster.routing.allocation.awareness.force.zone.values .

Below settings illustrate same for two awareness attributes.

cluster.routing.allocation.awareness.attributes : ["zone", "rack"]
cluster.routing.allocation.awareness.force.zone.values: zone-1, zone-2
cluster.routing.allocation.awareness.force.rack.values: rack-1, rack-2

User Experience

User sets cluster.routing.allocation.awareness.attributes and cluster.routing.allocation.awareness.force.zone.values
If user enables balanced_across_awareness_attribute , the total copy needs to be a multiple of all possible values of awareness attribute. If not , we will do one of the following
Reject the create/update index
Auto expand the replica count as per need.

Are these live reloaded settings or require cluster restart? Any new REST APIs?

These are dynamic cluster level settings . Existing APIs to update settings will be reused to modify these

@gbbafna gbbafna changed the title [PROPOSAL] Making all copies of shards spread evenly across all Awareness Attribute [META] Making all copies of shards spread evenly across all Awareness Attribute May 27, 2022
@elfisher
Copy link

@gbbafna is this tracking 2.2? Can we add two labels to this? 1/ "roadmap" to highlight this improvement on the project roadmap 2/ the version of OpenSearch this is targeting? Thanks!

@gbbafna
Copy link
Collaborator Author

gbbafna commented Jul 21, 2022

Yes, we are tracking 2.2 for this. @Bukhtawar , can you please help with same as I don't have permissions ?

@elfisher
Copy link

Thanks! I see it now. Can we also open an issue in the docs repo to track any documentation updates that might need to happen for this?

@kartg
Copy link
Member

kartg commented Aug 3, 2022

@gbbafna can this issue be closed? I see #3461 which tracks the first solution here, with #3462 as the PR to main and #4086 as the backport to 2.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request v2.2.0
Projects
None yet
Development

No branches or pull requests

5 participants