TRA waits when an index doesn't exist but fails immediately when shard is not found #20279

bleskes · 2016-09-01T09:11:48Z

TransportReplicationAction currently has an inconsistency in how it deals with requests that refer to things that don't exist (which is different than not available).

When an index is not found in the cluster state, we go into a retry loop where we wait for the index to appear.
When a request comes in for a shard that doesn't exists (i.e., the shard id is higher than the number of shards ) we fail immediately - as it will never appear.

This is surprising and we should fix it.

In my opinion we should:

Require ReplicationRequests to have a complete ShardId when they get to the reroute phase in TRA.
Fail immediately when that shard id can not be resolved.
Change TransportIndexAction and similar write actions to resolve the incoming requests and set their proper shard id (with index uuid). If they need to create the index, they can go ahead, but then it's up to them to also wait until the current (data) node, knows about the index that was just created. We can have a shared utility method for this on AutoCreateIndex.

The text was updated successfully, but these errors were encountered:

bleskes · 2016-09-01T10:15:47Z

one more note - when converting an index name to an Index instances, we always have to wait in the case the index is not there. The reason is that, sadly, a create index operation can finish successfully without all nodes knowing about it. Since we can then accept a follow up indexing request on arbitrary nodes, which should also give a chance for lagging cluster state to arrive. Practically this means that TransportIndexAction and friends need to wait also when they don't create a missing index . I expect this to be a common pattern when we move all index resolving to the rest layer (once the transport client has been removed, so there is time...)

elasticmachine · 2018-03-14T02:06:04Z

Pinging @elastic/es-distributed

Closes elastic#20279

This stems from a time where index requests were directly forwarded to TransportReplicationAction. Nowadays they are wrapped in a BulkShardRequest, and this logic is obsolete. Closes #20279

ywelsch · 2019-11-28T10:33:03Z

Require ReplicationRequests to have a complete ShardId when they get to the reroute phase in TRA.

Addressed by #40424

Fail immediately when that shard id can not be resolved.

There's no resolving anymore of the shard id in TRA

Change TransportIndexAction and similar write actions to resolve the incoming requests and set their proper shard id (with index uuid). If they need to create the index, they can go ahead, but then it's up to them to also wait until the current (data) node, knows about the index that was just created. We can have a shared utility method for this on AutoCreateIndex

This is taken care of by TransportBulkAction now.

Remaining points addressed by #49647

This stems from a time where index requests were directly forwarded to TransportReplicationAction. Nowadays they are wrapped in a BulkShardRequest, and this logic is obsolete. Closes #20279

Closes elastic#20279

This stems from a time where index requests were directly forwarded to TransportReplicationAction. Nowadays they are wrapped in a BulkShardRequest, and this logic is obsolete. In contrast to prior PR (#49647), this PR also fixes (see b3697cc) a situation where the previous index expression logic had an interesting side effect. For bulk requests (which had resolveIndex = false), the reroute phase was waiting for the index to appear in case where it was not present, and for all other replication requests (resolveIndex = true) it would right away throw an IndexNotFoundException while resolving the name and exit. With #49647, every replication request was now waiting for the index to appear, which was problematic when the given index had just been deleted (e.g. deleting a follower index while it's still receiving requests from the leader, where these requests would now wait up to a minute for the index to appear). This PR now adds b3697cc on top of that prior PR to make sure to reestablish some of the prior behavior where the reroute phase waits for the bulk request for the index to appear. That logic was in place to ensure that when an index was created and not all nodes had learned about it yet, that the bulk would not fail somewhere in the reroute phase. This is now only restricted to the situation where the current node has an older cluster state than the one that coordinated the bulk request (which checks that the index is present). This also means that when an index is deleted, we will no longer unnecessarily wait up to the timeout for the index o appear, and instead fail the request. Closes #20279

This stems from a time where index requests were directly forwarded to TransportReplicationAction. Nowadays they are wrapped in a BulkShardRequest, and this logic is obsolete. In contrast to prior PR (elastic#49647), this PR also fixes (see b3697cc) a situation where the previous index expression logic had an interesting side effect. For bulk requests (which had resolveIndex = false), the reroute phase was waiting for the index to appear in case where it was not present, and for all other replication requests (resolveIndex = true) it would right away throw an IndexNotFoundException while resolving the name and exit. With elastic#49647, every replication request was now waiting for the index to appear, which was problematic when the given index had just been deleted (e.g. deleting a follower index while it's still receiving requests from the leader, where these requests would now wait up to a minute for the index to appear). This PR now adds b3697cc on top of that prior PR to make sure to reestablish some of the prior behavior where the reroute phase waits for the bulk request for the index to appear. That logic was in place to ensure that when an index was created and not all nodes had learned about it yet, that the bulk would not fail somewhere in the reroute phase. This is now only restricted to the situation where the current node has an older cluster state than the one that coordinated the bulk request (which checks that the index is present). This also means that when an index is deleted, we will no longer unnecessarily wait up to the timeout for the index o appear, and instead fail the request. Closes elastic#20279

bleskes added help wanted adoptme :Core/Infra/Core Core issues without another label labels Sep 1, 2016

jasontedor added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Core/Infra/Core Core issues without another label labels Mar 14, 2018

bleskes added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. discuss and removed :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Mar 20, 2018

colings86 added the >bug label Apr 24, 2018

pcsanwald added team-discuss and removed discuss labels Aug 30, 2018

ywelsch mentioned this issue Mar 26, 2019

Always delegate TransportIndexAction and TransportDeleteAction to bulk action #40424

Merged

ywelsch added a commit to ywelsch/elasticsearch that referenced this issue Nov 27, 2019

Remove obsolete resolving logic from TRA

ead8dea

Closes elastic#20279

ywelsch mentioned this issue Nov 27, 2019

Remove obsolete resolving logic from TRA #49647

Merged

ywelsch closed this as completed in #49647 Nov 28, 2019

ywelsch removed help wanted adoptme team-discuss labels Nov 28, 2019

ywelsch added a commit to ywelsch/elasticsearch that referenced this issue Nov 28, 2019

Remove obsolete resolving logic from TRA

2f6e019

Closes elastic#20279

ywelsch mentioned this issue Nov 28, 2019

Remove obsolete resolving logic from TRA #49685

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TRA waits when an index doesn't exist but fails immediately when shard is not found #20279

TRA waits when an index doesn't exist but fails immediately when shard is not found #20279

bleskes commented Sep 1, 2016 •

edited

Loading

bleskes commented Sep 1, 2016

elasticmachine commented Mar 14, 2018

ywelsch commented Nov 28, 2019

TRA waits when an index doesn't exist but fails immediately when shard is not found #20279

TRA waits when an index doesn't exist but fails immediately when shard is not found #20279

Comments

bleskes commented Sep 1, 2016 • edited Loading

bleskes commented Sep 1, 2016

elasticmachine commented Mar 14, 2018

ywelsch commented Nov 28, 2019

bleskes commented Sep 1, 2016 •

edited

Loading