Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Store] Exception in RemoteStoreRefreshListener.afterRefresh() during relocation for remote-backed indexes #5844

Closed
ashking94 opened this issue Jan 12, 2023 · 3 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Storage:Durability Issues and PRs related to the durability framework v2.6.0 'Issues and PRs related to version v2.6.0'

Comments

@ashking94
Copy link
Member

Is your feature request related to a problem? Please describe.
During primary relocation, the new primary gets bootstrapped with NRTReplicationEngine. Now, the check for primary shard routing and remote store enabled evaluates as true during primary relocation. So, RemoteStoreRefreshListener.afterRefresh() can be invoked with InternalEngine as well as NRTReplicationEngine. However, within the afterRefresh() we are casting the engine to InternalEngine without knowing the exact implementation.

((InternalEngine) indexShard.getEngine()).lastRefreshedCheckpoint();

Exception thrown -

[2023-01-12T10:01:48,118][ERROR][o.o.i.s.RemoteStoreRefreshListener] [opensearch-node1] Exception in RemoteStoreRefreshListener.afterRefresh()
java.lang.ClassCastException: class org.opensearch.index.engine.NRTReplicationEngine cannot be cast to class org.opensearch.index.engine.InternalEngine (org.opensearch.index.engine.NRTReplicationEngine and org.opensearch.index.engine.InternalEngine are in unnamed module of loader 'app')
	at org.opensearch.index.shard.RemoteStoreRefreshListener.uploadSegmentInfosSnapshot(RemoteStoreRefreshListener.java:191) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.afterRefresh(RemoteStoreRefreshListener.java:133) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275) [lucene-core-9.5.0-snapshot-0878271.jar:9.5.0-snapshot-0878271 08782710435618f15825f777ae2a5bee9b6f681a - runner - 2022-12-27 14:43:13]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182) [lucene-core-9.5.0-snapshot-0878271.jar:9.5.0-snapshot-0878271 08782710435618f15825f777ae2a5bee9b6f681a - runner - 2022-12-27 14:43:13]
	at org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:213) [lucene-core-9.5.0-snapshot-0878271.jar:9.5.0-snapshot-0878271 08782710435618f15825f777ae2a5bee9b6f681a - runner - 2022-12-27 14:43:13]
	at org.opensearch.index.engine.NRTReplicationReaderManager.updateSegments(NRTReplicationReaderManager.java:81) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.engine.NRTReplicationEngine.updateSegments(NRTReplicationEngine.java:130) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.shard.IndexShard.finalizeReplication(IndexShard.java:1412) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$finalizeReplication$5(SegmentReplicationTarget.java:217) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.replication.SegmentReplicationTarget.finalizeReplication(SegmentReplicationTarget.java:202) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$3(SegmentReplicationTarget.java:166) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:77) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:55) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:181) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1381) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.lang.Thread.run(Thread.java:1589) [?:?]

Describe the solution you'd like
The class cast code to InternalEngine is used for performing cleanup of translogs on local machine and remote. We need to need to handle this by skipping setMinSeqNoToKeep if the underlying engine is NRTReplicationEngine.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@ashking94 ashking94 added enhancement Enhancement or improvement to existing feature or request untriaged Storage:Durability Issues and PRs related to the durability framework v2.6.0 'Issues and PRs related to version v2.6.0' and removed untriaged labels Jan 12, 2023
@sachinpkale
Copy link
Member

However, within the afterRefresh() we are casting the engine to InternalEngine without knowing the exact implementation.

We have an assert which checks if the engine is instanceof InternalEngine but it will be bypassed while running as binary.

@ashking94 ashking94 self-assigned this Jan 12, 2023
ashking94 added a commit to ashking94/OpenSearch that referenced this issue Jan 12, 2023
Signed-off-by: Ashish Singh <ssashish@amazon.com>
@ashking94
Copy link
Member Author

However, within the afterRefresh() we are casting the engine to InternalEngine without knowing the exact implementation.

We have an assert which checks if the engine is instanceof InternalEngine but it will be bypassed while running as binary.

I have made the changes that fixes this issue in the linked PR.

ashking94 added a commit to ashking94/OpenSearch that referenced this issue Jan 13, 2023
Signed-off-by: Ashish Singh <ssashish@amazon.com>
sachinpkale pushed a commit to sachinpkale/OpenSearch that referenced this issue Jan 13, 2023
Signed-off-by: Ashish Singh <ssashish@amazon.com>
ashking94 added a commit to ashking94/OpenSearch that referenced this issue Jan 17, 2023
Signed-off-by: Ashish Singh <ssashish@amazon.com>
ashking94 added a commit to ashking94/OpenSearch that referenced this issue Jan 17, 2023
Signed-off-by: Ashish Singh <ssashish@amazon.com>
@sachinpkale
Copy link
Member

Fixed as part of #5804

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Storage:Durability Issues and PRs related to the durability framework v2.6.0 'Issues and PRs related to version v2.6.0'
Projects
None yet
Development

No branches or pull requests

2 participants