[BUG] Exception raised even though number of shard copies is a multiple of awareness attributes #8205

IanHoang · 2023-06-21T21:58:47Z

Describe the bug
We opened an issue in opensearch-py (opensearch-project/opensearch-py#411) but realized that the issue might be related to OpenSearch core instead.

OpenSearch-Benchmark (OSB) uses opensearch-py under the hood to perform CRUD operations on target clusters. Before users run a test, they can store their metrics and results in a datastore, which is often another opensearch cluster. Users have can override the index settings within this datastore by specifying the following in a config:

# Example config fields to ensure that indices created have 9 primary shards and 1 set of replicas 
datastore.number_of_shards = 9
datastore.number_of_replicas = 1

Using the example above, there should be a total of 18 shards for each index in the datastore cluster. When we curl the datastore cluster, the indices have the correct primary and replica count set.

### Indices in single node 1AZ OpenSearch cluster
health status index                             uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   benchmark-metrics-2023-06         9JB1Ua3DRvW99DwokDyziA   9   1        280            0    779.2kb        779.2kb
yellow open   benchmark-results-2023-06         hhIrNV3KR4C7_kXS0_hkqg   9   1        184            0    152.3kb        152.3kb
yellow open   benchmark-test-executions-2023-06 0Kt0nFtJSTmHJ5llmjMIYA   9   1         27            0     57.6kb         57.6kb

However, when we try to use the same config settings for another datastore cluster that has 3AZs and has the settings default_number_of_replicas = 2, we encounter this issue:

opensearchpy.exceptions.RequestError: RequestError(400, 'invalid_index_template_exception', 'index_template [benchmark-metrics] invalid, cause [Validation Failed: 1: expected total copies needs to be a multiple of total awareness attributes [3];]')

18 should work since it's a multiple of 3. The only way we found to get around this issue with the same datastore configuration is with datastore.number_of_replicas = 2. We've been using managed service datastore clusters. We're curious if default_number_of_replicas = 2 is the culprit.

To Reproduce
The option to edit the number of shards and replicas for the datastore is not officially out yet. However, it does exist on a feature branch in a forked repository. Let me know if you'd like to test it out and I can provide it.

Expected behavior
Since my cluster has 3AZs, it should work since 18 is a multiple of 3.

Plugins
None

Screenshots
None

Host/Environment (please complete the following information):
Shouldn't matter in this situation since running against an external cluster but running the client on my local machine, which is a MacOS/X86

Additional context

The text was updated successfully, but these errors were encountered:

dblock · 2023-06-27T03:13:07Z

I think the next step should be to narrow this down to an API call/REST request that produces invalid_index_template_exception. There should be a way to reproduce it without benchmarks or managed systems being involved.

anasalkouz · 2023-06-27T21:12:55Z

@imRishN is this related Zone Decommission? any idea

gbbafna · 2023-06-28T15:15:34Z

Hi @anasalkouz , @IanHoang ,

This is related to balanced replica count : #3461

18 should work since it's a multiple of 3. The only way we found to get around this issue with the same datastore configuration is with datastore.number_of_replicas = 2. We've been using managed service datastore clusters. We're curious if default_number_of_replicas = 2 is the culprit.

You are accounting for number of shards in the index as well. In the calculation we just check for total copies of a given shard and that should be a multiple of AZ count . Since you are having 18 shards in total , I reckon the total copies of 1 shard is 2 which is not a multiple of 3 , hence you are getting validation exception.

matthew-mcallister · 2023-07-03T19:13:51Z

This is trivial to reproduce:

Create a new 3-AZ OpenSearch instance in AWS.
Attempt to create a new index through the dashboard.

Creating the index will fail.

I was able to work around this by copying the configuration of an automatically generated index. Here are the settings I used:

{
  "index.auto_expand_replicas": "0-2",
  "index.number_of_replicas": "2",
  "index.number_of_shards": "1"
}

justjais · 2023-08-10T07:01:05Z

I am also facing a similar failure when trying to restore from one domain to another in the same AWS region and both the domain are at the latest 2.5 version.
I've tried the suggestion but it's still giving me the same error as below:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Validation Failed: 1: expected total copies needs to be a multiple of total awareness attributes [3];"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Validation Failed: 1: expected total copies needs to be a multiple of total awareness attributes [3];"
  },
  "status": 400
}

Current index settings are:

{
  "index.creation_date": "xxxxx",
  "index.knn": "true",
  "index.number_of_replicas": "2",
  "index.number_of_shards": "2",
  "index.provided_name": "test-data",
  "index.refresh_interval": "1s",
  "index.uuid": "xxxxx",
  "index.version.created": "xxxxx"
}

Also, I've tried with settings having "index.number_of_replicas": "1", but same.

sapatel12 · 2023-10-12T18:19:35Z

Faced the same issue. Was able to resolve by disabling Stand-By mode and then running the restore command.

rsolano · 2023-12-13T15:00:56Z

Faced the same issue. Was able to resolve by disabling Stand-By mode and then running the restore command.

This did the trick for me as well on a newly created 3-AZ, 3-node opensearch domain. After turning off standby I was able to restore a snapshot from s3.

dblock · 2023-12-13T16:54:04Z

@gbbafna thanks for digging this up, is there something we need/can fix/improve in OpenSearch (e.g. error message) or should this be closed?

gbbafna · 2023-12-14T12:30:05Z

@sapatel12 , @rsolano :

The snapshot restore has an override index setting param also , which can be used here

curl -X POST "localhost:9200/_snapshot/my_repository/my_snapshot_1/_restore?pretty" -H 'Content-Type: application/json' -d'
{
  "index_settings": {
    "index.number_of_replicas": 2
  }
}
'

Hi @dblock ,

The error message is descriptive in itself . I am going to close this one.

IanHoang added bug Something isn't working untriaged labels Jun 21, 2023

IanHoang mentioned this issue Jun 21, 2023

[BUG] OpenSearch-Py throws error despite total copies being a multiple of total awareness attributes opensearch-project/opensearch-py#411

Closed

Rishikesh1159 added benchmarking Issues related to benchmarking or performance. distributed framework labels Jun 27, 2023

anasalkouz removed the untriaged label Jun 27, 2023

gbbafna closed this as completed Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Exception raised even though number of shard copies is a multiple of awareness attributes #8205

[BUG] Exception raised even though number of shard copies is a multiple of awareness attributes #8205

IanHoang commented Jun 21, 2023

dblock commented Jun 27, 2023

anasalkouz commented Jun 27, 2023

gbbafna commented Jun 28, 2023

matthew-mcallister commented Jul 3, 2023

justjais commented Aug 10, 2023 •

edited

Loading

sapatel12 commented Oct 12, 2023

rsolano commented Dec 13, 2023

dblock commented Dec 13, 2023 •

edited

Loading

gbbafna commented Dec 14, 2023

[BUG] Exception raised even though number of shard copies is a multiple of awareness attributes #8205

[BUG] Exception raised even though number of shard copies is a multiple of awareness attributes #8205

Comments

IanHoang commented Jun 21, 2023

dblock commented Jun 27, 2023

anasalkouz commented Jun 27, 2023

gbbafna commented Jun 28, 2023

matthew-mcallister commented Jul 3, 2023

justjais commented Aug 10, 2023 • edited Loading

sapatel12 commented Oct 12, 2023

rsolano commented Dec 13, 2023

dblock commented Dec 13, 2023 • edited Loading

gbbafna commented Dec 14, 2023

justjais commented Aug 10, 2023 •

edited

Loading

dblock commented Dec 13, 2023 •

edited

Loading