Skip to content

Conversation

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Jun 26, 2018

If the recovery and indexing are concurrently happening, it's possible for a replica to receive the same operation twice: one from the replication, and the other from the recovery. However, these operations are not identical because we don't store the versionType of operations in the Lucene index.

The TranslogWriter#assertNoSeqNumberConflict assertion has been tripped several times in the CCR branch since we use Lucene in peer-recovery. This commit relaxes that assertion by excluding the versionType from the check.

If the recovery and indexing are concurrently happening, it's possible for a replica to receive the same operation twice. One from the replication, and the other from the recovery. These operations will be different because we don't store the versionType of operations in the Lucene index.

The TranslogWriter#assertNoSeqNumberConflict assertion has been tripped several times in the CCR branch since we use Lucene in peer-recovery. This commit relaxes that assertion by excluding the versionType from that check.
@dnhatn dnhatn added >feature :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jun 26, 2018
@dnhatn dnhatn requested review from bleskes and s1monw June 26, 2018 03:12
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn
Copy link
Member Author

dnhatn commented Jun 26, 2018

@bleskes
Copy link
Contributor

bleskes commented Jun 27, 2018

do we still need this now that recovery ships an external version type (like replication)?

@dnhatn
Copy link
Member Author

dnhatn commented Jun 27, 2018

@bleskes We still need this for now. We do not convert EXTERNAL_GTE to EXTERNAL in the replication requests. Should we always use EXTERNAL in both replication and recovery?

[2018-06-27T18:39:26,830][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-0] fatal error in thread [elasticsearch[node-0][generic][T#4]], exiting
java.lang.AssertionError: seqNo [0] was processed twice in generation [2], with different data. 

prvOp [Index{id='1', type='test', seqNo=0, primaryTerm=1, 
version=5, versionType=EXTERNAL_GTE, autoGeneratedIdTimestamp=-1}], // <-- from replication

newOp [Index{id='1', type='test', seqNo=0, primaryTerm=1,
version=5, versionType=EXTERNAL, autoGeneratedIdTimestamp=-1}]  // <-- from recovery

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bleskes
Copy link
Contributor

bleskes commented Jun 28, 2018

We still need this for now.
fair enough.

One more thing - maybe add a test for this so we won't forget external_gte again?

@dnhatn dnhatn merged commit 573df2d into elastic:ccr Jun 28, 2018
@dnhatn
Copy link
Member Author

dnhatn commented Jun 28, 2018

Thanks @bleskes and @s1monw

@dnhatn dnhatn deleted the relax-version-type-check branch June 28, 2018 21:42
dnhatn added a commit that referenced this pull request Jun 28, 2018
If the recovery and indexing are concurrently happening, it's possible
for a replica to receive the same operation twice: one from the
replication, and the other from the recovery. However, these operations
are not identical because we don't store the versionType of operations
in the Lucene index.

The TranslogWriter#assertNoSeqNumberConflict assertion has been tripped
several times in the CCR branch since we use Lucene in peer-recovery.
This commit relaxes that assertion by excluding the versionType from the
check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants