-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Searchable snapshot restore - null_pointer_exception #12966
Comments
Thanks for writing up this bug, we should fix all exceptions like this |
Hello, |
Hello, |
We are facing the same issue. Any solution or workaround other than stopping and starting search nodes? |
Hello, were you able to reproduce this issue on different cluster? |
Yes, We use the opensearch operator to provision a cluster on K8s. I have provisioned multiple v2.12 clusters with ‘search’ nodes and all had the same issue. Even with v2.13 the issue exists. Wondering how are you handling that on your prod cluster? |
Hello @ahmadbabaeimoghadam, |
Hi @damslo, I appreciate your help. Here are some details:
apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
name: opensearch
namespace: test
spec:
bootstrap:
resources: {}
confMgmt:
smartScaler: true
dashboards:
opensearchCredentialsSecret: {}
replicas: 0
resources: {}
service:
type: ClusterIP
version: ''
general:
additionalConfig:
indices.query.bool.max_clause_count: '32768'
additionalVolumes:
- name: aws-iam-token
path: /usr/share/opensearch/config/aws-iam-token
httpPort: 9200
monitoring:
enable: true
pluginUrl: >-
https://github.com/aiven/prometheus-exporter-plugin-for-opensearch/releases/download/2.12.0.0/prometheus-exporter-2.12.0.0.zip
scrapeInterval: 30s
tlsConfig:
insecureSkipVerify: true
pluginsList:
- repository-s3
- >-
https://github.com/aiven/prometheus-exporter-plugin-for-opensearch/releases/download/2.12.0.0/prometheus-exporter-2.12.0.0.zip
serviceAccount: opensearch-s3-access
serviceName: opensearch
vendor: opensearch
version: 2.12.0
initHelper:
resources: {}
nodePools:
- additionalConfig:
cluster.filecache.remote_data_ratio: '5'
node.search.cache.size: 18gb
s3.client.default.endpoint: s3.us-east-1.amazonaws.com
s3.client.default.identity_token_file: /usr/share/opensearch/config/aws-iam-token/token
component: master-data
diskSize: 100Gi
labels:
name: test
nodeSelector:
karpenter.sh/capacity-type: on-demand
pdb:
enable: true
maxUnavailable: 1
replicas: 3
resources:
limits:
cpu: 2
memory: 16Gi
requests:
cpu: 2
memory: 16Gi
roles:
- master
- data
- search
security:
config:
adminCredentialsSecret:
name: opensearch-admin-credentials
adminSecret: {}
securityConfigSecret:
name: opensearch-securityconfig
tls:
http:
caSecret: {}
generate: true
secret: {}
transport:
caSecret: {}
generate: true
secret: {}
Please let me know if you need more details. |
We did some debugging and this is what I gathered from looking at the code and stack trace: This is where the exception is thrown OpenSearch/server/src/main/java/org/opensearch/snapshots/RestoreService.java Lines 856 to 876 in a254aa9
In We deployed a version with some extra logging and printed out whenever that evaluation returned null, it was basically returning null for all indices. The reason the 2 workarounds work is because of this "if" block the prevents the rest of the OpenSearch/server/src/main/java/org/opensearch/snapshots/RestoreService.java Lines 860 to 862 in a254aa9
In the first workaround described, removing all search nodes from the cluster the nodeFileCacheStats would be empty and the function will return. In the second workaround, removing Either way, the existence of the config probably isn't the reason that I tried creating a fresh cluster on docker compose with the same config but couldn't replicate the error. @ahmadbabaeimoghadam can you confirm this happens on new clusters that are deployed with this manifest without any existing data? |
Thanks for all the details here. @LiorF-BDBQ it looks like you've found some potential leads about what is returning null. We would definitely welcome any contributions if you can further isolate the problem. A reproduction in an integration test can be super helpful as well. Tagging @kotwanikunal @bugmakerrrrrr as well as I know they have worked in this area before. |
@andrross I have just went over this issue, the logic in the method I think that the reason why indicesService.indexService(shardRouting.index()) may return null is that the IndicesService only keeps track of the indices that have been assigned to the current node, which in this scenario is the master node. We may need use |
Here is the IT to reproduce this isssue.
I found that change the |
Describe the bug
Hello,
Recently restoring snapshot as searchable is not working on our cluster.
I've tried to reproduce it on my local cluster, but it worked there...
When I change storage_type from remote_snapshot to local, snapshot is being restored without issue.
Can you please help with this ?
POST /_snapshot/azure_data_repository/retail-legacy-ppe-000337-2024.03.27-06:44:53.545/_restore { "indices": "*", "storage_type": "remote_snapshot", "ignore_unavailable": true, "include_global_state": false, "rename_pattern": "(.+)", "rename_replacement": "observability-searchable-sna2p-test2-$1", "include_aliases": false }
Response:
{ "error": { "root_cause": [ { "type": "null_pointer_exception", "reason": "Cannot invoke \"org.opensearch.index.IndexService.getIndexSettings()\" because the return value of \"org.opensearch.indices.IndicesService.indexService(org.opensearch.core.index.Index)\" is null" } ], "type": "null_pointer_exception", "reason": "Cannot invoke \"org.opensearch.index.IndexService.getIndexSettings()\" because the return value of \"org.opensearch.indices.IndicesService.indexService(org.opensearch.core.index.Index)\" is null" }, "status": 500 }
Logs:
org.opensearch.transport.RemoteTransportException: [observability-opensearch-master-1][***][cluster:admin/snapshot/restore] Caused by: java.lang.NullPointerException: Cannot invoke "org.opensearch.index.IndexService.getIndexSettings()" because the return value of "org.opensearch.indices.IndicesService.indexService(org.opensearch.core.index.Index)" is null at org.opensearch.snapshots.RestoreService$1.lambda$validateSearchableSnapshotRestorable$5(RestoreService.java:809) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.routing.RoutingTable.allShardsSatisfyingPredicate(RoutingTable.java:325) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.routing.RoutingTable.allShardsSatisfyingPredicate(RoutingTable.java:307) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.snapshots.RestoreService$1.validateSearchableSnapshotRestorable(RestoreService.java:813) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.snapshots.RestoreService$1.execute(RestoreService.java:591) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:67) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.service.MasterService.executeTasks(MasterService.java:882) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:434) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:301) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:212) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:204) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:242) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246) ~[opensearch-2.12.0.jar:2.12.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?] at java.lang.Thread.run(Thread.java:1583) [?:?]
Related component
Storage:Snapshots
To Reproduce
Restore snapshot as remove storage.
Expected behavior
Snapshot restored as remote snap
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: