[BUG] Searchable Snapshot not working with an NPE error #9291

ryanqin01 · 2023-08-14T06:44:21Z

Describe the bug
I create a snapshot based on HDFS. When the snapshot is not searchable, it works fine. When I set the snapshot to searchable, it reports error. The opensearch version is 2.9
To Reproduce
Create the repository:

{
  "type": "hdfs",
  "settings": {
    "uri": "hdfs://es1:9000",
    "path": "/searchable_snapshots",
    "conf.dfs.client.read.shortcircuit": "false"
  }
}

create index:

{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "age": {
        "type": "integer"
      }
    }
  }
}

create snapshot:

{
  "indices": "my_index",
  "ignore_unavailable": true,
  "include_global_state": false,
  "partial": false
}

create restore index:

{
  "indices": "my_index",
  "storage_type": "remote_snapshot",
  "rename_pattern": "(.+)",
  "rename_replacement": "restored_my_index"
}

search. The error reports:

{
  "error" : {
    "root_cause" : [ ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [ ]
  },
  "status" : 503
}

Error in log:

[2023-08-14T14:36:06,713][WARN ][o.o.c.r.a.AllocationService] [es2] failing shard [failed shard, shard [restored_my_index][0], node[jMJDygdtQC6QqlAi043HUg], [P], recovery_source[snapshot recovery [YpS3Jwc8RROsfw7z1Wn7uQ] from searchable_hdfs_repository:searchable_snapshot/sEJNO-07T2Sc_9sO9b0UYA], s[INITIALIZING], a[id=t7IsZGhZQUKEOn1c4L5GnA], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-08-14T06:36:06.588Z], failed_attempts[4], failed_nodes[[jMJDygdtQC6QqlAi043HUg]], delayed=false, details[failed shard on node [jMJDygdtQC6QqlAi043HUg]: failed recovery, failure RecoveryFailedException[[restored_my_index][0]: Recovery failed on {es2}{jMJDygdtQC6QqlAi043HUg}{leroNMlxRsOMDwkuxlyRcg}{192.168.56.103}{192.168.56.103:9300}{dms}{shard_indexing_pressure_enabled=true}]; nested: IndexShardRecoveryException[failed recovery]; nested: NullPointerException[Cannot invoke "org.opensearch.common.unit.ByteSizeValue.getBytes()" because the return value of "org.opensearch.index.snapshots.blobstore.BlobStoreIndexShardSnapshot$FileInfo.partSize()" is null]; ], allocation_status[fetching_shard_data]], message [failed recovery], failure [RecoveryFailedException[[restored_my_index][0]: Recovery failed on {es2}{jMJDygdtQC6QqlAi043HUg}{leroNMlxRsOMDwkuxlyRcg}{192.168.56.103}{192.168.56.103:9300}{dms}{shard_indexing_pressure_enabled=true}]; nested: IndexShardRecoveryException[failed recovery]; nested: NullPointerException[Cannot invoke "org.opensearch.common.unit.ByteSizeValue.getBytes()" because the return value of "org.opensearch.index.snapshots.blobstore.BlobStoreIndexShardSnapshot$FileInfo.partSize()" is null]; ], markAsStale [true]]
org.opensearch.indices.recovery.RecoveryFailedException: [restored_my_index][0]: Recovery failed on {es2}{jMJDygdtQC6QqlAi043HUg}{leroNMlxRsOMDwkuxlyRcg}{192.168.56.103}{192.168.56.103:9300}{dms}{shard_indexing_pressure_enabled=true}
        at org.opensearch.index.shard.IndexShard.lambda$executeRecovery$30(IndexShard.java:3554) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.shard.StoreRecovery.lambda$recoveryListener$8(StoreRecovery.java:510) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.ActionListener.completeWith(ActionListener.java:345) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:113) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2620) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:88) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.9.0.jar:2.9.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.index.shard.IndexShardRecoveryException: failed recovery
        ... 11 more
Caused by: java.lang.NullPointerException: Cannot invoke "org.opensearch.common.unit.ByteSizeValue.getBytes()" because the return value of "org.opensearch.index.snapshots.blobstore.BlobStoreIndexShardSnapshot$FileInfo.partSize()" is null
        at org.opensearch.index.store.remote.file.OnDemandBlockSnapshotIndexInput.<init>(OnDemandBlockSnapshotIndexInput.java:107) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.store.remote.file.OnDemandBlockSnapshotIndexInput.<init>(OnDemandBlockSnapshotIndexInput.java:90) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.store.remote.file.OnDemandBlockSnapshotIndexInput.<init>(OnDemandBlockSnapshotIndexInput.java:61) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.store.remote.directory.RemoteSnapshotDirectory.openInput(RemoteSnapshotDirectory.java:77) ~[opensearch-2.9.0.jar:2.9.0]
        at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:156) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.codecs.lucene90.Lucene90CompoundReader.readEntries(Lucene90CompoundReader.java:110) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.codecs.lucene90.Lucene90CompoundReader.<init>(Lucene90CompoundReader.java:67) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.getCompoundReader(Lucene90CompoundFormat.java:86) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]

Host/Environment (please complete the following information):

OS: Centos7

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

ryanqin01 · 2023-08-14T06:52:47Z

If I remove "storage_type": "remote_snapshot" from creating restored index. The restored index can be created correctly. So I am pretty sure it's an error against searchable snapshot feature.

ryanqin01 · 2023-08-14T06:56:15Z

The config file is like:

cluster.name: my-cluster
discovery.type: single-node
node.roles: [ master, data, search ]
node.search.cache.size: 10mb

andrross · 2023-08-16T00:25:53Z

Thanks @ryanqin01. This looks like a bug with the HDFS integration with the searchable snapshot feature.

ryanqin01 · 2023-08-16T02:02:10Z

Thanks @ryanqin01. This looks like a bug with the HDFS integration with the searchable snapshot feature.

Thanks for reply. I guess is that because "partSize" is a parameter of Amazon S3, but the HDFS integration involves it wrongly?

andrross · 2023-08-16T15:19:30Z

@ryanqin01 That seems to be the issue. "partSize" isn't specifically a parameter of S3, but for whatever reason it appears not to be set by the HDFS repository. I'm honestly not yet sure whether the HDFS repository is wrong or if it is wrong to assume that field will never be null. This still needs some more investigation.

andrross · 2023-08-19T00:38:10Z

I believe I have traced the bug to the fact that the searchable snapshot code is not correctly allowing for "partSize" to be null. null is a valid value, as this part size ultimately comes from this chuckSize() property on a repository, which per the contract can be null when no chunking is needed. In practice, it appears the other repository implementations use a very large value as a default, whereas HDFSRepository uses null as the default. The fix should be simple here.

@ryanqin01 Is there any chance you can validate this by supplying the following setting when creating your HDFS repository?

"chunk_size": "5tb"

Any implausibly large value is fine as it will have the effect of "no chunking". This should be useful as a work-around until the fix is made available in a subsequent release.

ryanqin01 · 2023-08-22T07:00:01Z

I believe I have traced the bug to the fact that the searchable snapshot code is not correctly allowing for "partSize" to be null. null is a valid value, as this part size ultimately comes from this chuckSize() property on a repository, which per the contract can be null when no chunking is needed. In practice, it appears the other repository implementations use a very large value as a default, whereas HDFSRepository uses null as the default. The fix should be simple here.

@ryanqin01 Is there any chance you can validate this by supplying the following setting when creating your HDFS repository?
"chunk_size": "5tb"
Any implausibly large value is fine as it will have the effect of "no chunking". This should be useful as a work-around until the fix is made available in a subsequent release.

Hi Andrew，

It's frustrating that some new errors occured:

[2023-08-22T14:57:59,049][WARN ][r.suppressed             ] [es2] path: /restored_my_index/_search, params: {pretty=, index=restored_my_index}
org.opensearch.action.search.SearchPhaseExecutionException: all shards failed
        at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:665) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:373) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:704) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:473) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$0(AbstractSearchAsyncAction.java:274) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:351) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.9.0.jar:2.9.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]

The output is like:

{
  "error" : {
    "root_cause" : [ ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [ ]
  },
  "status" : 503
}

ryanqin01 · 2023-08-22T07:04:06Z

The repository setting is like:
{
"type": "hdfs",
"settings": {
"uri": "hdfs://es1:9000",
"path": "/searchable_snapshots",
"conf.dfs.client.read.shortcircuit": "false",
"chunk_size": "5tb"
}
}'

The HDFS cluster is a single node cluster on my vm:
[root@es1 bin]# ./hdfs dfs -ls /searchable_snapshots
Found 5 items
-rw-r--r-- 1 wing supergroup 424 2023-08-22 14:55 /searchable_snapshots/index-2
-rw-r--r-- 1 wing supergroup 8 2023-08-22 14:55 /searchable_snapshots/index.latest
drwxr-xr-x - wing supergroup 0 2023-08-22 14:55 /searchable_snapshots/indices
-rw-r--r-- 1 wing supergroup 234 2023-08-22 14:55 /searchable_snapshots/meta-WnboontfQeOXbJKZMNSUmQ.dat
-rw-r--r-- 1 wing supergroup 313 2023-08-22 14:55 /searchable_snapshots/snap-WnboontfQeOXbJKZMNSUmQ.dat

andrross · 2023-08-23T04:01:05Z

@ryanqin01 It turns out the key interface method that the searchable snapshot feature uses to fetch partial files in not implemented by the HDFS repository. Unfortunately that means HDFS doesn't support searchable snapshots currently. I'm going to look into what it would take to add support and in the meantime update our documentation appropriately.

andrross · 2023-08-24T17:28:53Z

Closed inadvertently, reopening.

andrross · 2023-09-05T16:43:38Z

This has been fixed and will be in the upcoming 2.10 release

ryanqin01 added bug Something isn't working untriaged labels Aug 14, 2023

ryanqin01 changed the title ~~[BUG] Searchable Snapshot not working with an NPE~~ [BUG] Searchable Snapshot not working with an NPE error Aug 14, 2023

Xtansia added the distributed framework label Aug 14, 2023

andrross self-assigned this Aug 21, 2023

anasalkouz removed the untriaged label Aug 21, 2023

andrross mentioned this issue Aug 22, 2023

Handle null partSize in OnDemandBlockSnapshotIndexInput #9470

Merged

6 tasks

This was referenced Aug 23, 2023

[BUG] Add support for reading partial files to HDFS repository #9513

Closed

[BUG] Add searchable snapshot case to OpenSearchBlobStoreRepositoryIntegTestCase #9514

Open

andrross closed this as completed in #9470 Aug 23, 2023

andrross reopened this Aug 24, 2023

github-actions bot added the untriaged label Aug 24, 2023

andrross removed the untriaged label Aug 24, 2023

andrross closed this as completed Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Searchable Snapshot not working with an NPE error #9291

[BUG] Searchable Snapshot not working with an NPE error #9291

ryanqin01 commented Aug 14, 2023 •

edited by dblock

Loading

ryanqin01 commented Aug 14, 2023

ryanqin01 commented Aug 14, 2023

andrross commented Aug 16, 2023

ryanqin01 commented Aug 16, 2023

andrross commented Aug 16, 2023

andrross commented Aug 19, 2023

ryanqin01 commented Aug 22, 2023 •

edited by andrross

Loading

ryanqin01 commented Aug 22, 2023

andrross commented Aug 23, 2023

andrross commented Aug 24, 2023

andrross commented Sep 5, 2023

[BUG] Searchable Snapshot not working with an NPE error #9291

[BUG] Searchable Snapshot not working with an NPE error #9291

Comments

ryanqin01 commented Aug 14, 2023 • edited by dblock Loading

ryanqin01 commented Aug 14, 2023

ryanqin01 commented Aug 14, 2023

andrross commented Aug 16, 2023

ryanqin01 commented Aug 16, 2023

andrross commented Aug 16, 2023

andrross commented Aug 19, 2023

ryanqin01 commented Aug 22, 2023 • edited by andrross Loading

ryanqin01 commented Aug 22, 2023

andrross commented Aug 23, 2023

andrross commented Aug 24, 2023

andrross commented Sep 5, 2023

ryanqin01 commented Aug 14, 2023 •

edited by dblock

Loading

ryanqin01 commented Aug 22, 2023 •

edited by andrross

Loading