[BUG] Searchable snapshot restore - null_pointer_exception #12966

xodat2 · 2024-03-28T17:54:21Z

Describe the bug

Hello,
Recently restoring snapshot as searchable is not working on our cluster.
I've tried to reproduce it on my local cluster, but it worked there...
When I change storage_type from remote_snapshot to local, snapshot is being restored without issue.
Can you please help with this ?

POST /_snapshot/azure_data_repository/retail-legacy-ppe-000337-2024.03.27-06:44:53.545/_restore { "indices": "*", "storage_type": "remote_snapshot", "ignore_unavailable": true, "include_global_state": false, "rename_pattern": "(.+)", "rename_replacement": "observability-searchable-sna2p-test2-$1", "include_aliases": false }

Response:

{ "error": { "root_cause": [ { "type": "null_pointer_exception", "reason": "Cannot invoke \"org.opensearch.index.IndexService.getIndexSettings()\" because the return value of \"org.opensearch.indices.IndicesService.indexService(org.opensearch.core.index.Index)\" is null" } ], "type": "null_pointer_exception", "reason": "Cannot invoke \"org.opensearch.index.IndexService.getIndexSettings()\" because the return value of \"org.opensearch.indices.IndicesService.indexService(org.opensearch.core.index.Index)\" is null" }, "status": 500 }

Logs:
org.opensearch.transport.RemoteTransportException: [observability-opensearch-master-1][***][cluster:admin/snapshot/restore] Caused by: java.lang.NullPointerException: Cannot invoke "org.opensearch.index.IndexService.getIndexSettings()" because the return value of "org.opensearch.indices.IndicesService.indexService(org.opensearch.core.index.Index)" is null at org.opensearch.snapshots.RestoreService$1.lambda$validateSearchableSnapshotRestorable$5(RestoreService.java:809) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.routing.RoutingTable.allShardsSatisfyingPredicate(RoutingTable.java:325) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.routing.RoutingTable.allShardsSatisfyingPredicate(RoutingTable.java:307) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.snapshots.RestoreService$1.validateSearchableSnapshotRestorable(RestoreService.java:813) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.snapshots.RestoreService$1.execute(RestoreService.java:591) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:67) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.service.MasterService.executeTasks(MasterService.java:882) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:434) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:301) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:212) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:204) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:242) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246) ~[opensearch-2.12.0.jar:2.12.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?] at java.lang.Thread.run(Thread.java:1583) [?:?]

Related component

Storage:Snapshots

To Reproduce

Restore snapshot as remove storage.

Expected behavior

Snapshot restored as remote snap

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: k8s
Version 2.12.0

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

peternied · 2024-04-03T16:04:48Z

Thanks for writing up this bug, we should fix all exceptions like this

xodat2 · 2024-04-05T14:19:47Z

Thanks for writing up this bug, we should fix all exceptions like this

Hello,
Do you have any clue how can I debug this issue? are there debug/trace logs available for snapshots? I've tried to change log level for restore service logger, but it didn't throw any debug or trace logs.

xodat2 · 2024-04-16T08:54:55Z

Hello,
I've discovered that when I stop our "search" nodes, it is possible to "restore" searchable snapshot and shards will be assigned after starting those nodes.

ahmadbabaeimoghadam · 2024-05-04T16:47:45Z

We are facing the same issue. Any solution or workaround other than stopping and starting search nodes?

xodat2 · 2024-05-04T19:34:52Z

We are facing the same issue. Any solution or workaround other than stopping and starting search nodes?

Hello, were you able to reproduce this issue on different cluster?
We didn't find any solution, but it's hard to debug it on production cluster.

ahmadbabaeimoghadam · 2024-05-05T05:46:20Z

Hello, were you able to reproduce this issue on different cluster? We didn't find any solution, but it's hard to debug it on production cluster.

Yes, We use the opensearch operator to provision a cluster on K8s. I have provisioned multiple v2.12 clusters with ‘search’ nodes and all had the same issue. Even with v2.13 the issue exists. Wondering how are you handling that on your prod cluster?

damslo · 2024-05-15T10:24:32Z

Hello @ahmadbabaeimoghadam,
If you could provide me as much detail as possible to reproduce the issue, I could help you with solution. We did some fix in code, however it needs to be tested and we can't do it because we were not able to reproduce the issue.

ahmadbabaeimoghadam · 2024-05-16T10:12:15Z

Hello @ahmadbabaeimoghadam, If you could provide me as much detail as possible to reproduce the issue, I could help you with solution. We did some fix in code, however it needs to be tested and we can't do it because we were not able to reproduce the issue.

Hi @damslo, I appreciate your help. Here are some details:

the cluster is deployed on a kubernetes cluster v1.26
it is deployed using the opensearch operator v2.5.1
here is the OpenSearchCluster manifest:

apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: opensearch
  namespace: test
spec:
  bootstrap:
    resources: {}
  confMgmt:
    smartScaler: true
  dashboards:
    opensearchCredentialsSecret: {}
    replicas: 0
    resources: {}
    service:
      type: ClusterIP
    version: ''
  general:
    additionalConfig:
      indices.query.bool.max_clause_count: '32768'
    additionalVolumes:
      - name: aws-iam-token
        path: /usr/share/opensearch/config/aws-iam-token
    httpPort: 9200
    monitoring:
      enable: true
      pluginUrl: >-
        https://github.com/aiven/prometheus-exporter-plugin-for-opensearch/releases/download/2.12.0.0/prometheus-exporter-2.12.0.0.zip
      scrapeInterval: 30s
      tlsConfig:
        insecureSkipVerify: true
    pluginsList:
      - repository-s3
      - >-
        https://github.com/aiven/prometheus-exporter-plugin-for-opensearch/releases/download/2.12.0.0/prometheus-exporter-2.12.0.0.zip
    serviceAccount: opensearch-s3-access
    serviceName: opensearch
    vendor: opensearch
    version: 2.12.0
  initHelper:
    resources: {}
  nodePools:
    - additionalConfig:
        cluster.filecache.remote_data_ratio: '5'
        node.search.cache.size: 18gb
        s3.client.default.endpoint: s3.us-east-1.amazonaws.com
        s3.client.default.identity_token_file: /usr/share/opensearch/config/aws-iam-token/token
      component: master-data
      diskSize: 100Gi
      labels:
        name: test
      nodeSelector:
        karpenter.sh/capacity-type: on-demand
      pdb:
        enable: true
        maxUnavailable: 1
      replicas: 3
      resources:
        limits:
          cpu: 2
          memory: 16Gi
        requests:
          cpu: 2
          memory: 16Gi
      roles:
        - master
        - data
        - search
  security:
    config:
      adminCredentialsSecret:
        name: opensearch-admin-credentials
      adminSecret: {}
      securityConfigSecret:
        name: opensearch-securityconfig
    tls:
      http:
        caSecret: {}
        generate: true
        secret: {}
      transport:
        caSecret: {}
        generate: true
        secret: {}

I could restore an index successfully once before I set cluster.filecache.remote_data_ratio. I'm however, not sure if adding this config broke it or restarting the cluster. (Update: I can confirm that removing this setting survives the cluster)

Please let me know if you need more details.

LiorF-BDBQ · 2024-05-20T19:39:17Z

We did some debugging and this is what I gathered from looking at the code and stack trace:

This is where the exception is thrown

OpenSearch/server/src/main/java/org/opensearch/snapshots/RestoreService.java

Lines 856 to 876 in a254aa9

    
           private void validateSearchableSnapshotRestorable(long totalRestorableRemoteIndexesSize) { 
        
               ClusterInfo clusterInfo = clusterInfoSupplier.get(); 
        
               double remoteDataToFileCacheRatio = DATA_TO_FILE_CACHE_SIZE_RATIO_SETTING.get(clusterService.getSettings()); 
        
               Map<String, FileCacheStats> nodeFileCacheStats = clusterInfo.getNodeFileCacheStats(); 
        
               if (nodeFileCacheStats.isEmpty() || remoteDataToFileCacheRatio <= 0.01f) { 
        
                   return; 
        
               } 
        
               long totalNodeFileCacheSize = clusterInfo.getNodeFileCacheStats() 
        
                   .values() 
        
                   .stream() 
        
                   .map(fileCacheStats -> fileCacheStats.getTotal().getBytes()) 
        
                   .mapToLong(Long::longValue) 
        
                   .sum(); 
        
               Predicate<ShardRouting> isRemoteSnapshotShard = shardRouting -> shardRouting.primary() 
        
                   && indicesService.indexService(shardRouting.index()).getIndexSettings().isRemoteSnapshot(); 
        
               ShardsIterator shardsIterator = clusterService.state() 
        
                   .routingTable() 
        
                   .allShardsSatisfyingPredicate(isRemoteSnapshotShard);

In validateSearchableSnapshotRestorable there is a predicate evaluation in allShardsSatisfyingPredicate that finds all remote snapshot shards, during this loop the call to indicesService.indexService(shardRouting.index()) returns null.

We deployed a version with some extra logging and printed out whenever that evaluation returned null, it was basically returning null for all indices.

The reason the 2 workarounds work is because of this "if" block the prevents the rest of the validateSearchableSnapshotRestorable function from running.

OpenSearch/server/src/main/java/org/opensearch/snapshots/RestoreService.java

Lines 860 to 862 in a254aa9

    
           if (nodeFileCacheStats.isEmpty() || remoteDataToFileCacheRatio <= 0.01f) { 
        
               return; 
        
           }

In the first workaround described, removing all search nodes from the cluster the nodeFileCacheStats would be empty and the function will return.

In the second workaround, removing cluster.filecache.remote_data_ratio or keeping it at default 0.0, the function will return as well.

Either way, the existence of the config probably isn't the reason that indicesService.indexService(shardRouting.index()) would return null?

I tried creating a fresh cluster on docker compose with the same config but couldn't replicate the error.

@ahmadbabaeimoghadam can you confirm this happens on new clusters that are deployed with this manifest without any existing data?
Can you share the generated opensearch.yaml and env vars from the nodes? maybe there'll be another hint there.

andrross · 2024-05-29T19:09:44Z

Thanks for all the details here. @LiorF-BDBQ it looks like you've found some potential leads about what is returning null. We would definitely welcome any contributions if you can further isolate the problem. A reproduction in an integration test can be super helpful as well.

Tagging @kotwanikunal @bugmakerrrrrr as well as I know they have worked in this area before.

bugmakerrrrrr · 2024-05-31T04:53:36Z

@andrross I have just went over this issue, the logic in the method org.opensearch.cluster.ClusterStateUpdateTask#validateSearchableSnapshotRestorable is used to check whether the size of the remote indices to be restored exceeds the file cache bounds. If user config the DATA_TO_FILE_CACHE_SIZE_RATIO_SETTING, this logic will be executed.

I think that the reason why indicesService.indexService(shardRouting.index()) may return null is that the IndicesService only keeps track of the indices that have been assigned to the current node, which in this scenario is the master node. We may need use clusterService.state().metadata().index(index) to check the index settings. It should be noted that these are all my guesses, and I will try to reproduce the problem and fix it later.

bugmakerrrrrr · 2024-05-31T09:07:28Z

Here is the IT to reproduce this isssue.

public void testRestoreNPE() throws Exception {
        final String snapshotName = "test-snap";
        final String repoName = "test-repo";
        final String indexName1 = "test-idx-1";
        final String indexName2 = "test-idx-2";
        final int numReplicasIndex1 = 1;
        final int numReplicasIndex2 = 1;

        Settings clusterManagerNodeSettings = clusterManagerOnlyNode();
        internalCluster().startNodes(2, clusterManagerNodeSettings);
        Settings dateNodeSettings = dataNode();
        internalCluster().startNodes(2, dateNodeSettings);
        createIndexWithDocsAndEnsureGreen(numReplicasIndex1, 100, indexName1);
        createIndexWithDocsAndEnsureGreen(numReplicasIndex2, 100, indexName2);

        final Client client = client();
        assertAcked(
            client.admin()
                .cluster()
                .prepareUpdateSettings()
                .setTransientSettings(Settings.builder().put(DATA_TO_FILE_CACHE_SIZE_RATIO_SETTING.getKey(), 5))
        );

        createRepositoryWithSettings(null, repoName);
        takeSnapshot(client, snapshotName, repoName, indexName1, indexName2);

        internalCluster().ensureAtLeastNumSearchNodes(Math.max(numReplicasIndex1, numReplicasIndex2) + 1);
        restoreSnapshotAndEnsureGreen(client, snapshotName, repoName);
    }

    @After
    public void cleanup() throws Exception {
        assertAcked(
            client().admin()
                .cluster()
                .prepareUpdateSettings()
                .setTransientSettings(Settings.builder().putNull(DATA_TO_FILE_CACHE_SIZE_RATIO_SETTING.getKey()))
        );
    }

I found that change the DATA_TO_FILE_CACHE_SIZE_RATIO_SETTING setting does not take effect (this is another bug), so you need to manually change the default value in this setting and then run the IT above.

xodat2 added bug Something isn't working untriaged labels Mar 28, 2024

github-actions bot added the Storage:Snapshots label Mar 28, 2024

github-project-automation bot added this to Storage Project Board Mar 28, 2024

github-project-automation bot moved this to 🆕 New in Storage Project Board Mar 28, 2024

peternied removed the untriaged label Apr 3, 2024

kotwanikunal added the Search:Remote Search label Apr 16, 2024

github-project-automation bot added this to Search Project Board Apr 16, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Apr 16, 2024

bugmakerrrrrr mentioned this issue May 31, 2024

Fix NPE on restore searchable snapshot #13911

Merged

9 tasks

andrross closed this as completed in #13911 May 31, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board May 31, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Storage Project Board May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Searchable snapshot restore - null_pointer_exception #12966

[BUG] Searchable snapshot restore - null_pointer_exception #12966

xodat2 commented Mar 28, 2024

peternied commented Apr 3, 2024

xodat2 commented Apr 5, 2024

xodat2 commented Apr 16, 2024

ahmadbabaeimoghadam commented May 4, 2024 •

edited

Loading

xodat2 commented May 4, 2024

ahmadbabaeimoghadam commented May 5, 2024

damslo commented May 15, 2024

ahmadbabaeimoghadam commented May 16, 2024 •

edited

Loading

LiorF-BDBQ commented May 20, 2024 •

edited

Loading

andrross commented May 29, 2024

bugmakerrrrrr commented May 31, 2024

bugmakerrrrrr commented May 31, 2024

[BUG] Searchable snapshot restore - null_pointer_exception #12966

[BUG] Searchable snapshot restore - null_pointer_exception #12966

Comments

xodat2 commented Mar 28, 2024

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

peternied commented Apr 3, 2024

xodat2 commented Apr 5, 2024

xodat2 commented Apr 16, 2024

ahmadbabaeimoghadam commented May 4, 2024 • edited Loading

xodat2 commented May 4, 2024

ahmadbabaeimoghadam commented May 5, 2024

damslo commented May 15, 2024

ahmadbabaeimoghadam commented May 16, 2024 • edited Loading

LiorF-BDBQ commented May 20, 2024 • edited Loading

andrross commented May 29, 2024

bugmakerrrrrr commented May 31, 2024

bugmakerrrrrr commented May 31, 2024

ahmadbabaeimoghadam commented May 4, 2024 •

edited

Loading

ahmadbabaeimoghadam commented May 16, 2024 •

edited

Loading

LiorF-BDBQ commented May 20, 2024 •

edited

Loading