Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Store]Transient CorruptIndexException while reading Segment Store Metadata #8491

Open
gbbafna opened this issue Jul 6, 2023 · 1 comment
Labels
bug Something isn't working Storage:Durability Issues and PRs related to the durability framework Storage Issues and PRs relating to data and metadata storage

Comments

@gbbafna
Copy link
Collaborator

gbbafna commented Jul 6, 2023

Issue

I ran OSB so workload on 3 node cluster . On termination of a node, recovery kicked in on the primaries. While recovering , I see transient CorruptIndexException while reading metadata . This is getting retried and succeeds in the next run

Additional context

https://gist.github.com/gbbafna/a0522da858abac1795f80ccfd426c375

[2023-07-06T11:16:09,554][INFO ][o.o.i.t.t.TranslogTransferManager] [node-2] Downloading translog files with: Primary Term = 1, Generation = 550, Location = /home/ec2-user/opensearch-3.0.0-SNAPSHOT/data/nodes/0/indices/Pr3vqheZQZWm_6oMZap2Lw/2/translog
[2023-07-06T11:16:09,754][INFO ][o.o.i.t.RemoteFsTranslog ] [node-2] Downloaded translog files from remote for shard [so][2]
[2023-07-06T11:16:11,644][INFO ][o.o.i.s.RemoteSegmentStoreDirectory] [node-2] Reading latest Metadata file metadata__1__7__gizjKokBHKE7CGwW6h4e
[2023-07-06T11:16:11,702][WARN ][o.o.i.c.IndicesClusterStateService] [node-2] [so][1] marking and sending shard failed due to [failed to create shard]

[2023-07-06T11:16:11,702][WARN ][o.o.i.c.IndicesClusterStateService] [node-2] [so][1] marking and sending shard failed due to [failed to create shard]
org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=0 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(metadata__1__7__gizjKokBHKE7CGwW6h4e))
        at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:584) ~[lucene-core-9.7.0-snapshot-204acc3.jar:9.7.0-snapshot-204acc3 204acc3570bcbe81af8509ad2b1f8d104e03b8f4 - 2023-06-02 13:55:21]
        at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:432) ~[lucene-core-9.7.0-snapshot-204acc3.jar:9.7.0-snapshot-204acc3 204acc3570bcbe81af8509ad2b1f8d104e03b8f4 - 2023-06-02 13:55:21]
        at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:619) ~[lucene-core-9.7.0-snapshot-204acc3.jar:9.7.0-snapshot-204acc3 204acc3570bcbe81af8509ad2b1f8d104e03b8f4 - 2023-06-02 13:55:21]
        at org.opensearch.common.io.VersionedCodecStreamWrapper.readStream(VersionedCodecStreamWrapper.java:49) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.store.RemoteSegmentStoreDirectory.readMetadataFile(RemoteSegmentStoreDirectory.java:184) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.store.RemoteSegmentStoreDirectory.readLatestMetadataFile(RemoteSegmentStoreDirectory.java:172) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.store.RemoteSegmentStoreDirectory.init(RemoteSegmentStoreDirectory.java:143) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.store.RemoteSegmentStoreDirectory.<init>(RemoteSegmentStoreDirectory.java:131) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.store.RemoteSegmentStoreDirectoryFactory.newDirectory(RemoteSegmentStoreDirectoryFactory.java:60) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.IndexService.createShard(IndexService.java:475) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.indices.IndicesService.createShard(IndicesService.java:948) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.indices.IndicesService.createShard(IndicesService.java:208) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]

[2023-07-06T11:16:11,813][INFO ][o.o.i.s.RemoteSegmentStoreDirectory] [node-2] Reading latest Metadata file metadata__1__7__gizjKokBHKE7CGwW6h4e
[2023-07-06T11:16:11,873][INFO ][o.o.i.s.IndexShard       ] [node-2] [so][1] Downloading segments from remote segment store
[2023-07-06T11:16:11,890][INFO ][o.o.i.s.RemoteSegmentStoreDirectory] [node-2] Reading latest Metadata file metadata__1__7__gizjKokBHKE7CGwW6h4e
[2023-07-06T11:16:11,952][INFO ][o.o.i.s.RemoteStoreRefreshListener] [node-2] uploadBytes=36472895 uploadBytesPerSec=21594372 uploadTime=1689
[2023-07-06T11:16:21,800][INFO ][o.o.i.s.RemoteStoreRefreshListener] [node-2] uploadBytes=63181437 uploadBytesPerSec=8539185 uploadTime=7399
[2023-07-06T11:16:23,438][INFO ][o.o.i.s.RemoteStoreRefreshListener] [node-2] uploadBytes=54177246 uploadBytesPerSec=86961871 uploadTime=623
[2023-07-06T11:16:34,099][INFO ][o.o.i.s.IndexShard       ] [node-2] [so][2] org.opensearch.index.shard.IndexShard@63517857: ApplyTrans took took 24.149 seconds
[2023-07-06T11:16:34,100][INFO ][o.o.i.s.IndexShard       ] [node-2] [so][2] org.opensearch.index.shard.IndexShard@63517857: FirstRefresh took took 0.001 seconds
[2023-07-06T11:16:34,222][INFO ][o.o.i.s.IndexShard       ] [node-2] [so][2] org.opensearch.index.shard.IndexShard@63517857: Failover took took 82.54 seconds
@gbbafna gbbafna added enhancement Enhancement or improvement to existing feature or request untriaged Storage:Durability Issues and PRs related to the durability framework and removed untriaged labels Jul 6, 2023
@gbbafna gbbafna changed the title [Remote Store]Transient CorruptIndexException Exception while reading Segment Store Metadata [Remote Store]Transient CorruptIndexException while reading Segment Store Metadata Jul 6, 2023
@gbbafna gbbafna added bug Something isn't working and removed enhancement Enhancement or improvement to existing feature or request labels Jul 11, 2023
@linuxpi
Copy link
Collaborator

linuxpi commented Jul 11, 2023

Previously we faced issue with footer reading - #6824, although the issue was different. There the contents of the metadata file were intact and we had extra bytes at the end even after reading the footer which was causing the problem.

In current scenario its hard to say if the issue is same. To debug such issues we need some information around no of bytes read during the failure and successful retry which will help us understand why it worked fine after a retry.

@Bukhtawar Bukhtawar added the Storage Issues and PRs relating to data and metadata storage label Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Storage:Durability Issues and PRs related to the durability framework Storage Issues and PRs relating to data and metadata storage
Projects
Status: 🆕 New
Development

No branches or pull requests

3 participants