Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Exception raised even though number of shard copies is a multiple of awareness attributes #8205

Closed
IanHoang opened this issue Jun 21, 2023 · 9 comments
Labels
benchmarking Issues related to benchmarking or performance. bug Something isn't working distributed framework

Comments

@IanHoang
Copy link

Describe the bug
We opened an issue in opensearch-py (opensearch-project/opensearch-py#411) but realized that the issue might be related to OpenSearch core instead.

OpenSearch-Benchmark (OSB) uses opensearch-py under the hood to perform CRUD operations on target clusters. Before users run a test, they can store their metrics and results in a datastore, which is often another opensearch cluster. Users have can override the index settings within this datastore by specifying the following in a config:

# Example config fields to ensure that indices created have 9 primary shards and 1 set of replicas 
datastore.number_of_shards = 9
datastore.number_of_replicas = 1

Using the example above, there should be a total of 18 shards for each index in the datastore cluster. When we curl the datastore cluster, the indices have the correct primary and replica count set.

### Indices in single node 1AZ OpenSearch cluster
health status index                             uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   benchmark-metrics-2023-06         9JB1Ua3DRvW99DwokDyziA   9   1        280            0    779.2kb        779.2kb
yellow open   benchmark-results-2023-06         hhIrNV3KR4C7_kXS0_hkqg   9   1        184            0    152.3kb        152.3kb
yellow open   benchmark-test-executions-2023-06 0Kt0nFtJSTmHJ5llmjMIYA   9   1         27            0     57.6kb         57.6kb

However, when we try to use the same config settings for another datastore cluster that has 3AZs and has the settings default_number_of_replicas = 2, we encounter this issue:

opensearchpy.exceptions.RequestError: RequestError(400, 'invalid_index_template_exception', 'index_template [benchmark-metrics] invalid, cause [Validation Failed: 1: expected total copies needs to be a multiple of total awareness attributes [3];]')

18 should work since it's a multiple of 3. The only way we found to get around this issue with the same datastore configuration is with datastore.number_of_replicas = 2. We've been using managed service datastore clusters. We're curious if default_number_of_replicas = 2 is the culprit.

To Reproduce
The option to edit the number of shards and replicas for the datastore is not officially out yet. However, it does exist on a feature branch in a forked repository. Let me know if you'd like to test it out and I can provide it.

Expected behavior
Since my cluster has 3AZs, it should work since 18 is a multiple of 3.

Plugins
None

Screenshots
None

Host/Environment (please complete the following information):
Shouldn't matter in this situation since running against an external cluster but running the client on my local machine, which is a MacOS/X86

Additional context

@dblock
Copy link
Member

dblock commented Jun 27, 2023

I think the next step should be to narrow this down to an API call/REST request that produces invalid_index_template_exception. There should be a way to reproduce it without benchmarks or managed systems being involved.

@Rishikesh1159 Rishikesh1159 added benchmarking Issues related to benchmarking or performance. distributed framework labels Jun 27, 2023
@anasalkouz
Copy link
Member

@imRishN is this related Zone Decommission? any idea

@gbbafna
Copy link
Collaborator

gbbafna commented Jun 28, 2023

Hi @anasalkouz , @IanHoang ,

This is related to balanced replica count : #3461

18 should work since it's a multiple of 3. The only way we found to get around this issue with the same datastore configuration is with datastore.number_of_replicas = 2. We've been using managed service datastore clusters. We're curious if default_number_of_replicas = 2 is the culprit.

You are accounting for number of shards in the index as well. In the calculation we just check for total copies of a given shard and that should be a multiple of AZ count . Since you are having 18 shards in total , I reckon the total copies of 1 shard is 2 which is not a multiple of 3 , hence you are getting validation exception.

@matthew-mcallister
Copy link

This is trivial to reproduce:

  1. Create a new 3-AZ OpenSearch instance in AWS.
  2. Attempt to create a new index through the dashboard.

Creating the index will fail.

I was able to work around this by copying the configuration of an automatically generated index. Here are the settings I used:

{
  "index.auto_expand_replicas": "0-2",
  "index.number_of_replicas": "2",
  "index.number_of_shards": "1"
}

@justjais
Copy link

justjais commented Aug 10, 2023

I am also facing a similar failure when trying to restore from one domain to another in the same AWS region and both the domain are at the latest 2.5 version.
I've tried the suggestion but it's still giving me the same error as below:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Validation Failed: 1: expected total copies needs to be a multiple of total awareness attributes [3];"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Validation Failed: 1: expected total copies needs to be a multiple of total awareness attributes [3];"
  },
  "status": 400
}

Current index settings are:

{
  "index.creation_date": "xxxxx",
  "index.knn": "true",
  "index.number_of_replicas": "2",
  "index.number_of_shards": "2",
  "index.provided_name": "test-data",
  "index.refresh_interval": "1s",
  "index.uuid": "xxxxx",
  "index.version.created": "xxxxx"
}

Also, I've tried with settings having "index.number_of_replicas": "1", but same.

@sapatel12
Copy link

Faced the same issue. Was able to resolve by disabling Stand-By mode and then running the restore command.

@rsolano
Copy link

rsolano commented Dec 13, 2023

Faced the same issue. Was able to resolve by disabling Stand-By mode and then running the restore command.

This did the trick for me as well on a newly created 3-AZ, 3-node opensearch domain. After turning off standby I was able to restore a snapshot from s3.

@dblock
Copy link
Member

dblock commented Dec 13, 2023

@gbbafna thanks for digging this up, is there something we need/can fix/improve in OpenSearch (e.g. error message) or should this be closed?

@gbbafna
Copy link
Collaborator

gbbafna commented Dec 14, 2023

@sapatel12 , @rsolano :

The snapshot restore has an override index setting param also , which can be used here

curl -X POST "localhost:9200/_snapshot/my_repository/my_snapshot_1/_restore?pretty" -H 'Content-Type: application/json' -d'
{
  "index_settings": {
    "index.number_of_replicas": 2
  }
}
'

Hi @dblock ,

The error message is descriptive in itself . I am going to close this one.

@gbbafna gbbafna closed this as completed Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarking Issues related to benchmarking or performance. bug Something isn't working distributed framework
Projects
None yet
Development

No branches or pull requests

9 participants