Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Searchable Snapshot not working with an NPE error #9291

Closed
ryanqin01 opened this issue Aug 14, 2023 · 11 comments · Fixed by #9470
Closed

[BUG] Searchable Snapshot not working with an NPE error #9291

ryanqin01 opened this issue Aug 14, 2023 · 11 comments · Fixed by #9470
Assignees
Labels
bug Something isn't working distributed framework

Comments

@ryanqin01
Copy link

ryanqin01 commented Aug 14, 2023

Describe the bug
I create a snapshot based on HDFS. When the snapshot is not searchable, it works fine. When I set the snapshot to searchable, it reports error. The opensearch version is 2.9
To Reproduce
Create the repository:

{
  "type": "hdfs",
  "settings": {
    "uri": "hdfs://es1:9000",
    "path": "/searchable_snapshots",
    "conf.dfs.client.read.shortcircuit": "false"
  }
}

create index:

{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "age": {
        "type": "integer"
      }
    }
  }
}

create snapshot:

{
  "indices": "my_index",
  "ignore_unavailable": true,
  "include_global_state": false,
  "partial": false
}

create restore index:

{
  "indices": "my_index",
  "storage_type": "remote_snapshot",
  "rename_pattern": "(.+)",
  "rename_replacement": "restored_my_index"
}

search. The error reports:

{
  "error" : {
    "root_cause" : [ ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [ ]
  },
  "status" : 503
}

Error in log:

[2023-08-14T14:36:06,713][WARN ][o.o.c.r.a.AllocationService] [es2] failing shard [failed shard, shard [restored_my_index][0], node[jMJDygdtQC6QqlAi043HUg], [P], recovery_source[snapshot recovery [YpS3Jwc8RROsfw7z1Wn7uQ] from searchable_hdfs_repository:searchable_snapshot/sEJNO-07T2Sc_9sO9b0UYA], s[INITIALIZING], a[id=t7IsZGhZQUKEOn1c4L5GnA], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-08-14T06:36:06.588Z], failed_attempts[4], failed_nodes[[jMJDygdtQC6QqlAi043HUg]], delayed=false, details[failed shard on node [jMJDygdtQC6QqlAi043HUg]: failed recovery, failure RecoveryFailedException[[restored_my_index][0]: Recovery failed on {es2}{jMJDygdtQC6QqlAi043HUg}{leroNMlxRsOMDwkuxlyRcg}{192.168.56.103}{192.168.56.103:9300}{dms}{shard_indexing_pressure_enabled=true}]; nested: IndexShardRecoveryException[failed recovery]; nested: NullPointerException[Cannot invoke "org.opensearch.common.unit.ByteSizeValue.getBytes()" because the return value of "org.opensearch.index.snapshots.blobstore.BlobStoreIndexShardSnapshot$FileInfo.partSize()" is null]; ], allocation_status[fetching_shard_data]], message [failed recovery], failure [RecoveryFailedException[[restored_my_index][0]: Recovery failed on {es2}{jMJDygdtQC6QqlAi043HUg}{leroNMlxRsOMDwkuxlyRcg}{192.168.56.103}{192.168.56.103:9300}{dms}{shard_indexing_pressure_enabled=true}]; nested: IndexShardRecoveryException[failed recovery]; nested: NullPointerException[Cannot invoke "org.opensearch.common.unit.ByteSizeValue.getBytes()" because the return value of "org.opensearch.index.snapshots.blobstore.BlobStoreIndexShardSnapshot$FileInfo.partSize()" is null]; ], markAsStale [true]]
org.opensearch.indices.recovery.RecoveryFailedException: [restored_my_index][0]: Recovery failed on {es2}{jMJDygdtQC6QqlAi043HUg}{leroNMlxRsOMDwkuxlyRcg}{192.168.56.103}{192.168.56.103:9300}{dms}{shard_indexing_pressure_enabled=true}
        at org.opensearch.index.shard.IndexShard.lambda$executeRecovery$30(IndexShard.java:3554) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.shard.StoreRecovery.lambda$recoveryListener$8(StoreRecovery.java:510) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.ActionListener.completeWith(ActionListener.java:345) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:113) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2620) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:88) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.9.0.jar:2.9.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.index.shard.IndexShardRecoveryException: failed recovery
        ... 11 more
Caused by: java.lang.NullPointerException: Cannot invoke "org.opensearch.common.unit.ByteSizeValue.getBytes()" because the return value of "org.opensearch.index.snapshots.blobstore.BlobStoreIndexShardSnapshot$FileInfo.partSize()" is null
        at org.opensearch.index.store.remote.file.OnDemandBlockSnapshotIndexInput.<init>(OnDemandBlockSnapshotIndexInput.java:107) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.store.remote.file.OnDemandBlockSnapshotIndexInput.<init>(OnDemandBlockSnapshotIndexInput.java:90) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.store.remote.file.OnDemandBlockSnapshotIndexInput.<init>(OnDemandBlockSnapshotIndexInput.java:61) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.store.remote.directory.RemoteSnapshotDirectory.openInput(RemoteSnapshotDirectory.java:77) ~[opensearch-2.9.0.jar:2.9.0]
        at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:156) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.codecs.lucene90.Lucene90CompoundReader.readEntries(Lucene90CompoundReader.java:110) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.codecs.lucene90.Lucene90CompoundReader.<init>(Lucene90CompoundReader.java:67) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.getCompoundReader(Lucene90CompoundFormat.java:86) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]

Host/Environment (please complete the following information):

  • OS: Centos7

Additional context
Add any other context about the problem here.

@ryanqin01 ryanqin01 added bug Something isn't working untriaged labels Aug 14, 2023
@ryanqin01 ryanqin01 changed the title [BUG] Searchable Snapshot not working with an NPE [BUG] Searchable Snapshot not working with an NPE error Aug 14, 2023
@ryanqin01
Copy link
Author

If I remove "storage_type": "remote_snapshot" from creating restored index. The restored index can be created correctly. So I am pretty sure it's an error against searchable snapshot feature.

@ryanqin01
Copy link
Author

The config file is like:

cluster.name: my-cluster
discovery.type: single-node
node.roles: [ master, data, search ]
node.search.cache.size: 10mb

@andrross
Copy link
Member

Thanks @ryanqin01. This looks like a bug with the HDFS integration with the searchable snapshot feature.

@ryanqin01
Copy link
Author

Thanks @ryanqin01. This looks like a bug with the HDFS integration with the searchable snapshot feature.

Thanks for reply. I guess is that because "partSize" is a parameter of Amazon S3, but the HDFS integration involves it wrongly?

@andrross
Copy link
Member

@ryanqin01 That seems to be the issue. "partSize" isn't specifically a parameter of S3, but for whatever reason it appears not to be set by the HDFS repository. I'm honestly not yet sure whether the HDFS repository is wrong or if it is wrong to assume that field will never be null. This still needs some more investigation.

@andrross
Copy link
Member

I believe I have traced the bug to the fact that the searchable snapshot code is not correctly allowing for "partSize" to be null. null is a valid value, as this part size ultimately comes from this chuckSize() property on a repository, which per the contract can be null when no chunking is needed. In practice, it appears the other repository implementations use a very large value as a default, whereas HDFSRepository uses null as the default. The fix should be simple here.

@ryanqin01 Is there any chance you can validate this by supplying the following setting when creating your HDFS repository?

"chunk_size": "5tb"

Any implausibly large value is fine as it will have the effect of "no chunking". This should be useful as a work-around until the fix is made available in a subsequent release.

@ryanqin01
Copy link
Author

ryanqin01 commented Aug 22, 2023

I believe I have traced the bug to the fact that the searchable snapshot code is not correctly allowing for "partSize" to be null. null is a valid value, as this part size ultimately comes from this chuckSize() property on a repository, which per the contract can be null when no chunking is needed. In practice, it appears the other repository implementations use a very large value as a default, whereas HDFSRepository uses null as the default. The fix should be simple here.

@ryanqin01 Is there any chance you can validate this by supplying the following setting when creating your HDFS repository?

"chunk_size": "5tb"

Any implausibly large value is fine as it will have the effect of "no chunking". This should be useful as a work-around until the fix is made available in a subsequent release.

Hi Andrew,

It's frustrating that some new errors occured:

[2023-08-22T14:57:59,049][WARN ][r.suppressed             ] [es2] path: /restored_my_index/_search, params: {pretty=, index=restored_my_index}
org.opensearch.action.search.SearchPhaseExecutionException: all shards failed
        at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:665) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:373) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:704) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:473) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$0(AbstractSearchAsyncAction.java:274) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:351) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.9.0.jar:2.9.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]

The output is like:

{
  "error" : {
    "root_cause" : [ ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [ ]
  },
  "status" : 503
}

@ryanqin01
Copy link
Author

The repository setting is like:
{
"type": "hdfs",
"settings": {
"uri": "hdfs://es1:9000",
"path": "/searchable_snapshots",
"conf.dfs.client.read.shortcircuit": "false",
"chunk_size": "5tb"
}
}'

The HDFS cluster is a single node cluster on my vm:
[root@es1 bin]# ./hdfs dfs -ls /searchable_snapshots
Found 5 items
-rw-r--r-- 1 wing supergroup 424 2023-08-22 14:55 /searchable_snapshots/index-2
-rw-r--r-- 1 wing supergroup 8 2023-08-22 14:55 /searchable_snapshots/index.latest
drwxr-xr-x - wing supergroup 0 2023-08-22 14:55 /searchable_snapshots/indices
-rw-r--r-- 1 wing supergroup 234 2023-08-22 14:55 /searchable_snapshots/meta-WnboontfQeOXbJKZMNSUmQ.dat
-rw-r--r-- 1 wing supergroup 313 2023-08-22 14:55 /searchable_snapshots/snap-WnboontfQeOXbJKZMNSUmQ.dat

@andrross
Copy link
Member

@ryanqin01 It turns out the key interface method that the searchable snapshot feature uses to fetch partial files in not implemented by the HDFS repository. Unfortunately that means HDFS doesn't support searchable snapshots currently. I'm going to look into what it would take to add support and in the meantime update our documentation appropriately.

@andrross
Copy link
Member

Closed inadvertently, reopening.

@andrross
Copy link
Member

andrross commented Sep 5, 2023

This has been fixed and will be in the upcoming 2.10 release

@andrross andrross closed this as completed Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants