Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRA waits when an index doesn't exist but fails immediately when shard is not found #20279

Closed
bleskes opened this issue Sep 1, 2016 · 3 comments · Fixed by #49647 or #49685
Closed
Labels
>bug :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search.

Comments

@bleskes
Copy link
Contributor

bleskes commented Sep 1, 2016

TransportReplicationAction currently has an inconsistency in how it deals with requests that refer to things that don't exist (which is different than not available).

  1. When an index is not found in the cluster state, we go into a retry loop where we wait for the index to appear.
  2. When a request comes in for a shard that doesn't exists (i.e., the shard id is higher than the number of shards ) we fail immediately - as it will never appear.

This is surprising and we should fix it.

In my opinion we should:

  1. Require ReplicationRequests to have a complete ShardId when they get to the reroute phase in TRA.
  2. Fail immediately when that shard id can not be resolved.
  3. Change TransportIndexAction and similar write actions to resolve the incoming requests and set their proper shard id (with index uuid). If they need to create the index, they can go ahead, but then it's up to them to also wait until the current (data) node, knows about the index that was just created. We can have a shared utility method for this on AutoCreateIndex.
@bleskes bleskes added help wanted adoptme :Core/Infra/Core Core issues without another label labels Sep 1, 2016
@bleskes
Copy link
Contributor Author

bleskes commented Sep 1, 2016

one more note - when converting an index name to an Index instances, we always have to wait in the case the index is not there. The reason is that, sadly, a create index operation can finish successfully without all nodes knowing about it. Since we can then accept a follow up indexing request on arbitrary nodes, which should also give a chance for lagging cluster state to arrive. Practically this means that TransportIndexAction and friends need to wait also when they don't create a missing index . I expect this to be a common pattern when we move all index resolving to the rest layer (once the transport client has been removed, so there is time...)

@jasontedor jasontedor added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Core/Infra/Core Core issues without another label labels Mar 14, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@bleskes bleskes added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. discuss and removed :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Mar 20, 2018
@colings86 colings86 added the >bug label Apr 24, 2018
ywelsch added a commit to ywelsch/elasticsearch that referenced this issue Nov 27, 2019
ywelsch added a commit that referenced this issue Nov 28, 2019
This stems from a time where index requests were directly forwarded to
TransportReplicationAction. Nowadays they are wrapped in a BulkShardRequest, and this logic is
obsolete.

Closes #20279
@ywelsch
Copy link
Contributor

ywelsch commented Nov 28, 2019

Require ReplicationRequests to have a complete ShardId when they get to the reroute phase in TRA.

Addressed by #40424

Fail immediately when that shard id can not be resolved.

There's no resolving anymore of the shard id in TRA

Change TransportIndexAction and similar write actions to resolve the incoming requests and set their proper shard id (with index uuid). If they need to create the index, they can go ahead, but then it's up to them to also wait until the current (data) node, knows about the index that was just created. We can have a shared utility method for this on AutoCreateIndex

This is taken care of by TransportBulkAction now.

Remaining points addressed by #49647

ywelsch added a commit that referenced this issue Nov 28, 2019
This stems from a time where index requests were directly forwarded to
TransportReplicationAction. Nowadays they are wrapped in a BulkShardRequest, and this logic is
obsolete.

Closes #20279
ywelsch added a commit to ywelsch/elasticsearch that referenced this issue Nov 28, 2019
ywelsch added a commit that referenced this issue Nov 29, 2019
This stems from a time where index requests were directly forwarded to
TransportReplicationAction. Nowadays they are wrapped in a BulkShardRequest, and this logic is
obsolete.

In contrast to prior PR (#49647), this PR also fixes (see b3697cc) a situation where the previous
index expression logic had an interesting side effect. For bulk requests (which had resolveIndex
= false), the reroute phase was waiting for the index to appear in case where it was not present,
and for all other replication requests (resolveIndex = true) it would right away throw an
IndexNotFoundException while resolving the name and exit. With #49647, every replication
request was now waiting for the index to appear, which was problematic when the given index
had just been deleted (e.g. deleting a follower index while it's still receiving requests from the
leader, where these requests would now wait up to a minute for the index to appear). This PR
now adds b3697cc on top of that prior PR to make sure to reestablish some of the prior behavior
where the reroute phase waits for the bulk request for the index to appear. That logic was in
place to ensure that when an index was created and not all nodes had learned about it yet, that
the bulk would not fail somewhere in the reroute phase. This is now only restricted to the
situation where the current node has an older cluster state than the one that coordinated the
bulk request (which checks that the index is present). This also means that when an index is
deleted, we will no longer unnecessarily wait up to the timeout for the index o appear, and
instead fail the request.

Closes #20279
ywelsch added a commit that referenced this issue Nov 29, 2019
This stems from a time where index requests were directly forwarded to
TransportReplicationAction. Nowadays they are wrapped in a BulkShardRequest, and this logic is
obsolete.

In contrast to prior PR (#49647), this PR also fixes (see b3697cc) a situation where the previous
index expression logic had an interesting side effect. For bulk requests (which had resolveIndex
= false), the reroute phase was waiting for the index to appear in case where it was not present,
and for all other replication requests (resolveIndex = true) it would right away throw an
IndexNotFoundException while resolving the name and exit. With #49647, every replication
request was now waiting for the index to appear, which was problematic when the given index
had just been deleted (e.g. deleting a follower index while it's still receiving requests from the
leader, where these requests would now wait up to a minute for the index to appear). This PR
now adds b3697cc on top of that prior PR to make sure to reestablish some of the prior behavior
where the reroute phase waits for the bulk request for the index to appear. That logic was in
place to ensure that when an index was created and not all nodes had learned about it yet, that
the bulk would not fail somewhere in the reroute phase. This is now only restricted to the
situation where the current node has an older cluster state than the one that coordinated the
bulk request (which checks that the index is present). This also means that when an index is
deleted, we will no longer unnecessarily wait up to the timeout for the index o appear, and
instead fail the request.

Closes #20279
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
This stems from a time where index requests were directly forwarded to
TransportReplicationAction. Nowadays they are wrapped in a BulkShardRequest, and this logic is
obsolete.

In contrast to prior PR (elastic#49647), this PR also fixes (see b3697cc) a situation where the previous
index expression logic had an interesting side effect. For bulk requests (which had resolveIndex
= false), the reroute phase was waiting for the index to appear in case where it was not present,
and for all other replication requests (resolveIndex = true) it would right away throw an
IndexNotFoundException while resolving the name and exit. With elastic#49647, every replication
request was now waiting for the index to appear, which was problematic when the given index
had just been deleted (e.g. deleting a follower index while it's still receiving requests from the
leader, where these requests would now wait up to a minute for the index to appear). This PR
now adds b3697cc on top of that prior PR to make sure to reestablish some of the prior behavior
where the reroute phase waits for the bulk request for the index to appear. That logic was in
place to ensure that when an index was created and not all nodes had learned about it yet, that
the bulk would not fail somewhere in the reroute phase. This is now only restricted to the
situation where the current node has an older cluster state than the one that coordinated the
bulk request (which checks that the index is present). This also means that when an index is
deleted, we will no longer unnecessarily wait up to the timeout for the index o appear, and
instead fail the request.

Closes elastic#20279
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search.
Projects
None yet
6 participants