Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to enable remote state repository independently of data repositories #13523

Open
BhumikaSaini-Amazon opened this issue May 3, 2024 · 4 comments · May be fixed by #13611
Open

[BUG] Unable to enable remote state repository independently of data repositories #13523

BhumikaSaini-Amazon opened this issue May 3, 2024 · 4 comments · May be fixed by #13611
Assignees
Labels
bug Something isn't working good first issue Good for newcomers low hanging fruit Storage:Remote Storage Issues and PRs relating to data and metadata storage v2.15.0 Issues and PRs related to version 2.15.0

Comments

@BhumikaSaini-Amazon
Copy link
Contributor

Describe the bug

Enabling just remote state repo (support added via PR #11858 ) starts a remote store migration. This migration doesn’t go through. The shards stay unassigned.

Related component

Storage

To Reproduce

  1. Launch a docrep cluster with remote state enabled. Do not enable remote segment and remote translog repositories.
  2. Create an index.
  3. Shard creation fails and shards stay unassigned.

Expected behavior

Enabling only remote state repository should not start a migration to remote store

Additional Details

Exception stack trace

[2024-05-02T13:17:25,436][INFO ][o.o.i.IndexService       ] [node-1] [idx1] DocRep shard [idx1][3] is migrating to remote
[2024-05-02T13:17:25,436][WARN ][o.o.i.c.IndicesClusterStateService] [node-1] [idx1][3] marking and sending shard failed due to [failed to create shard]
java.lang.NullPointerException: Cannot invoke "Object.hashCode()" because "key" is null
    at java.base/java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936) ~[?:?]
    at org.opensearch.repositories.RepositoriesService.repository(RepositoriesService.java:568) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.index.store.RemoteSegmentStoreDirectoryFactory.newDirectory(RemoteSegmentStoreDirectoryFactory.java:61) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.index.IndexService.createShard(IndexService.java:512) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.indices.IndicesService.createShard(IndicesService.java:1025) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.indices.IndicesService.createShard(IndicesService.java:213) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:672) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:649) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:294) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:608) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:595) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:563) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:486) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:188) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
    at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]

Proposed solution
The migration flow uses the isRemoteStoreNode( ) method. This method checks for the presence of any remote_store node attribute on the node:

/**
* Returns whether the node is a remote store node.
*
* @return true if the node contains remote store node attributes, false otherwise
*/
public boolean isRemoteStoreNode() {
return this.getAttributes().keySet().stream().anyMatch(key -> key.startsWith(REMOTE_STORE_NODE_ATTRIBUTE_KEY_PREFIX));
}

Given that the state repo should be independent now, we should have distinct methods to identify whether the cluster state or data is remote-backed.

@BhumikaSaini-Amazon BhumikaSaini-Amazon added bug Something isn't working untriaged labels May 3, 2024
@github-actions github-actions bot added the Storage Issues and PRs relating to data and metadata storage label May 3, 2024
@BhumikaSaini-Amazon BhumikaSaini-Amazon moved this from 🆕 New to Now(This Quarter) in Storage Project Board May 3, 2024
@BhumikaSaini-Amazon BhumikaSaini-Amazon added v2.15.0 Issues and PRs related to version 2.15.0 good first issue Good for newcomers low hanging fruit Storage:Remote and removed untriaged labels May 3, 2024
@sulthan309
Copy link

Hi, I would like to contribute to resolve this bug. Can you please assign this to me?

@BhumikaSaini-Amazon
Copy link
Contributor Author

Thank you @sulthan309 for volunteering!

We are tracking this bugfix for the 2.15 release. We want to get the fix merged to main and backported to 2.x by the code freeze date of 10th June (calendar).

Please do check how the method is used at various places. That will help with identifying the changes we need. If you need more info anytime, please let us know.

Looking forward to your contribution!

@sulthan309
Copy link

Thank you for assigning this ticket to me.

@BhumikaSaini-Amazon Sure i will go through the code and reach out if needed.

@rookuu
Copy link

rookuu commented Jul 10, 2024

We're running into this bug also in 2.15. Do we have a new target release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers low hanging fruit Storage:Remote Storage Issues and PRs relating to data and metadata storage v2.15.0 Issues and PRs related to version 2.15.0
Projects
Status: Now(This Quarter)
3 participants