Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Searchable snapshot restore - null_pointer_exception #12966

Closed
xodat2 opened this issue Mar 28, 2024 · 12 comments · Fixed by #13911
Closed

[BUG] Searchable snapshot restore - null_pointer_exception #12966

xodat2 opened this issue Mar 28, 2024 · 12 comments · Fixed by #13911

Comments

@xodat2
Copy link

xodat2 commented Mar 28, 2024

Describe the bug

Hello,
Recently restoring snapshot as searchable is not working on our cluster.
I've tried to reproduce it on my local cluster, but it worked there...
When I change storage_type from remote_snapshot to local, snapshot is being restored without issue.
Can you please help with this ?

POST /_snapshot/azure_data_repository/retail-legacy-ppe-000337-2024.03.27-06:44:53.545/_restore { "indices": "*", "storage_type": "remote_snapshot", "ignore_unavailable": true, "include_global_state": false, "rename_pattern": "(.+)", "rename_replacement": "observability-searchable-sna2p-test2-$1", "include_aliases": false }

Response:

{ "error": { "root_cause": [ { "type": "null_pointer_exception", "reason": "Cannot invoke \"org.opensearch.index.IndexService.getIndexSettings()\" because the return value of \"org.opensearch.indices.IndicesService.indexService(org.opensearch.core.index.Index)\" is null" } ], "type": "null_pointer_exception", "reason": "Cannot invoke \"org.opensearch.index.IndexService.getIndexSettings()\" because the return value of \"org.opensearch.indices.IndicesService.indexService(org.opensearch.core.index.Index)\" is null" }, "status": 500 }

Logs:
org.opensearch.transport.RemoteTransportException: [observability-opensearch-master-1][***][cluster:admin/snapshot/restore] Caused by: java.lang.NullPointerException: Cannot invoke "org.opensearch.index.IndexService.getIndexSettings()" because the return value of "org.opensearch.indices.IndicesService.indexService(org.opensearch.core.index.Index)" is null at org.opensearch.snapshots.RestoreService$1.lambda$validateSearchableSnapshotRestorable$5(RestoreService.java:809) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.routing.RoutingTable.allShardsSatisfyingPredicate(RoutingTable.java:325) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.routing.RoutingTable.allShardsSatisfyingPredicate(RoutingTable.java:307) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.snapshots.RestoreService$1.validateSearchableSnapshotRestorable(RestoreService.java:813) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.snapshots.RestoreService$1.execute(RestoreService.java:591) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:67) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.service.MasterService.executeTasks(MasterService.java:882) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:434) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:301) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:212) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:204) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:242) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246) ~[opensearch-2.12.0.jar:2.12.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?] at java.lang.Thread.run(Thread.java:1583) [?:?]

Related component

Storage:Snapshots

To Reproduce

Restore snapshot as remove storage.

Expected behavior

Snapshot restored as remote snap

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: k8s
  • Version 2.12.0

Additional context
Add any other context about the problem here.

@peternied
Copy link
Member

Thanks for writing up this bug, we should fix all exceptions like this

@xodat2
Copy link
Author

xodat2 commented Apr 5, 2024

Thanks for writing up this bug, we should fix all exceptions like this

Hello,
Do you have any clue how can I debug this issue? are there debug/trace logs available for snapshots? I've tried to change log level for restore service logger, but it didn't throw any debug or trace logs.

@xodat2
Copy link
Author

xodat2 commented Apr 16, 2024

Hello,
I've discovered that when I stop our "search" nodes, it is possible to "restore" searchable snapshot and shards will be assigned after starting those nodes.

@ahmadbabaeimoghadam
Copy link

ahmadbabaeimoghadam commented May 4, 2024

We are facing the same issue. Any solution or workaround other than stopping and starting search nodes?

@xodat2
Copy link
Author

xodat2 commented May 4, 2024

We are facing the same issue. Any solution or workaround other than stopping and starting search nodes?

Hello, were you able to reproduce this issue on different cluster?
We didn't find any solution, but it's hard to debug it on production cluster.

@ahmadbabaeimoghadam
Copy link

Hello, were you able to reproduce this issue on different cluster? We didn't find any solution, but it's hard to debug it on production cluster.

Yes, We use the opensearch operator to provision a cluster on K8s. I have provisioned multiple v2.12 clusters with ‘search’ nodes and all had the same issue. Even with v2.13 the issue exists. Wondering how are you handling that on your prod cluster?

@damslo
Copy link

damslo commented May 15, 2024

Hello @ahmadbabaeimoghadam,
If you could provide me as much detail as possible to reproduce the issue, I could help you with solution. We did some fix in code, however it needs to be tested and we can't do it because we were not able to reproduce the issue.

@ahmadbabaeimoghadam
Copy link

ahmadbabaeimoghadam commented May 16, 2024

Hello @ahmadbabaeimoghadam, If you could provide me as much detail as possible to reproduce the issue, I could help you with solution. We did some fix in code, however it needs to be tested and we can't do it because we were not able to reproduce the issue.

Hi @damslo, I appreciate your help. Here are some details:

  • the cluster is deployed on a kubernetes cluster v1.26
  • it is deployed using the opensearch operator v2.5.1
  • here is the OpenSearchCluster manifest:
apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: opensearch
  namespace: test
spec:
  bootstrap:
    resources: {}
  confMgmt:
    smartScaler: true
  dashboards:
    opensearchCredentialsSecret: {}
    replicas: 0
    resources: {}
    service:
      type: ClusterIP
    version: ''
  general:
    additionalConfig:
      indices.query.bool.max_clause_count: '32768'
    additionalVolumes:
      - name: aws-iam-token
        path: /usr/share/opensearch/config/aws-iam-token
    httpPort: 9200
    monitoring:
      enable: true
      pluginUrl: >-
        https://github.com/aiven/prometheus-exporter-plugin-for-opensearch/releases/download/2.12.0.0/prometheus-exporter-2.12.0.0.zip
      scrapeInterval: 30s
      tlsConfig:
        insecureSkipVerify: true
    pluginsList:
      - repository-s3
      - >-
        https://github.com/aiven/prometheus-exporter-plugin-for-opensearch/releases/download/2.12.0.0/prometheus-exporter-2.12.0.0.zip
    serviceAccount: opensearch-s3-access
    serviceName: opensearch
    vendor: opensearch
    version: 2.12.0
  initHelper:
    resources: {}
  nodePools:
    - additionalConfig:
        cluster.filecache.remote_data_ratio: '5'
        node.search.cache.size: 18gb
        s3.client.default.endpoint: s3.us-east-1.amazonaws.com
        s3.client.default.identity_token_file: /usr/share/opensearch/config/aws-iam-token/token
      component: master-data
      diskSize: 100Gi
      labels:
        name: test
      nodeSelector:
        karpenter.sh/capacity-type: on-demand
      pdb:
        enable: true
        maxUnavailable: 1
      replicas: 3
      resources:
        limits:
          cpu: 2
          memory: 16Gi
        requests:
          cpu: 2
          memory: 16Gi
      roles:
        - master
        - data
        - search
  security:
    config:
      adminCredentialsSecret:
        name: opensearch-admin-credentials
      adminSecret: {}
      securityConfigSecret:
        name: opensearch-securityconfig
    tls:
      http:
        caSecret: {}
        generate: true
        secret: {}
      transport:
        caSecret: {}
        generate: true
        secret: {}
  • I could restore an index successfully once before I set cluster.filecache.remote_data_ratio. I'm however, not sure if adding this config broke it or restarting the cluster. (Update: I can confirm that removing this setting survives the cluster)

Please let me know if you need more details.

@LiorF-BDBQ
Copy link

LiorF-BDBQ commented May 20, 2024

We did some debugging and this is what I gathered from looking at the code and stack trace:

This is where the exception is thrown

private void validateSearchableSnapshotRestorable(long totalRestorableRemoteIndexesSize) {
ClusterInfo clusterInfo = clusterInfoSupplier.get();
double remoteDataToFileCacheRatio = DATA_TO_FILE_CACHE_SIZE_RATIO_SETTING.get(clusterService.getSettings());
Map<String, FileCacheStats> nodeFileCacheStats = clusterInfo.getNodeFileCacheStats();
if (nodeFileCacheStats.isEmpty() || remoteDataToFileCacheRatio <= 0.01f) {
return;
}
long totalNodeFileCacheSize = clusterInfo.getNodeFileCacheStats()
.values()
.stream()
.map(fileCacheStats -> fileCacheStats.getTotal().getBytes())
.mapToLong(Long::longValue)
.sum();
Predicate<ShardRouting> isRemoteSnapshotShard = shardRouting -> shardRouting.primary()
&& indicesService.indexService(shardRouting.index()).getIndexSettings().isRemoteSnapshot();
ShardsIterator shardsIterator = clusterService.state()
.routingTable()
.allShardsSatisfyingPredicate(isRemoteSnapshotShard);

In validateSearchableSnapshotRestorable there is a predicate evaluation in allShardsSatisfyingPredicate that finds all remote snapshot shards, during this loop the call to indicesService.indexService(shardRouting.index()) returns null.

We deployed a version with some extra logging and printed out whenever that evaluation returned null, it was basically returning null for all indices.

The reason the 2 workarounds work is because of this "if" block the prevents the rest of the validateSearchableSnapshotRestorable function from running.

if (nodeFileCacheStats.isEmpty() || remoteDataToFileCacheRatio <= 0.01f) {
return;
}

In the first workaround described, removing all search nodes from the cluster the nodeFileCacheStats would be empty and the function will return.

In the second workaround, removing cluster.filecache.remote_data_ratio or keeping it at default 0.0, the function will return as well.

Either way, the existence of the config probably isn't the reason that indicesService.indexService(shardRouting.index()) would return null?

I tried creating a fresh cluster on docker compose with the same config but couldn't replicate the error.

@ahmadbabaeimoghadam can you confirm this happens on new clusters that are deployed with this manifest without any existing data?
Can you share the generated opensearch.yaml and env vars from the nodes? maybe there'll be another hint there.

@andrross
Copy link
Member

Thanks for all the details here. @LiorF-BDBQ it looks like you've found some potential leads about what is returning null. We would definitely welcome any contributions if you can further isolate the problem. A reproduction in an integration test can be super helpful as well.

Tagging @kotwanikunal @bugmakerrrrrr as well as I know they have worked in this area before.

@bugmakerrrrrr
Copy link
Contributor

@andrross I have just went over this issue, the logic in the method org.opensearch.cluster.ClusterStateUpdateTask#validateSearchableSnapshotRestorable is used to check whether the size of the remote indices to be restored exceeds the file cache bounds. If user config the DATA_TO_FILE_CACHE_SIZE_RATIO_SETTING, this logic will be executed.

I think that the reason why indicesService.indexService(shardRouting.index()) may return null is that the IndicesService only keeps track of the indices that have been assigned to the current node, which in this scenario is the master node. We may need use clusterService.state().metadata().index(index) to check the index settings. It should be noted that these are all my guesses, and I will try to reproduce the problem and fix it later.

@bugmakerrrrrr
Copy link
Contributor

Here is the IT to reproduce this isssue.

public void testRestoreNPE() throws Exception {
        final String snapshotName = "test-snap";
        final String repoName = "test-repo";
        final String indexName1 = "test-idx-1";
        final String indexName2 = "test-idx-2";
        final int numReplicasIndex1 = 1;
        final int numReplicasIndex2 = 1;

        Settings clusterManagerNodeSettings = clusterManagerOnlyNode();
        internalCluster().startNodes(2, clusterManagerNodeSettings);
        Settings dateNodeSettings = dataNode();
        internalCluster().startNodes(2, dateNodeSettings);
        createIndexWithDocsAndEnsureGreen(numReplicasIndex1, 100, indexName1);
        createIndexWithDocsAndEnsureGreen(numReplicasIndex2, 100, indexName2);

        final Client client = client();
        assertAcked(
            client.admin()
                .cluster()
                .prepareUpdateSettings()
                .setTransientSettings(Settings.builder().put(DATA_TO_FILE_CACHE_SIZE_RATIO_SETTING.getKey(), 5))
        );

        createRepositoryWithSettings(null, repoName);
        takeSnapshot(client, snapshotName, repoName, indexName1, indexName2);

        internalCluster().ensureAtLeastNumSearchNodes(Math.max(numReplicasIndex1, numReplicasIndex2) + 1);
        restoreSnapshotAndEnsureGreen(client, snapshotName, repoName);
    }

    @After
    public void cleanup() throws Exception {
        assertAcked(
            client().admin()
                .cluster()
                .prepareUpdateSettings()
                .setTransientSettings(Settings.builder().putNull(DATA_TO_FILE_CACHE_SIZE_RATIO_SETTING.getKey()))
        );
    }

I found that change the DATA_TO_FILE_CACHE_SIZE_RATIO_SETTING setting does not take effect (this is another bug), so you need to manually change the default value in this setting and then run the IT above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

8 participants