-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Primary Terms #14062
Introduce Primary Terms #14062
Conversation
@brwe @jasontedor care to take a look? |
@@ -637,6 +662,9 @@ public boolean equals(Object o) { | |||
if (unassignedInfo != null ? !unassignedInfo.equals(that.unassignedInfo) : that.unassignedInfo != null) { | |||
return false; | |||
} | |||
if (primaryTerm != that.primaryTerm) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need a change in hashCode()
too?
Should the primary term also increase when we move a primary from one node to another? |
When I restart a node then primaryTerm of primaries is incremented by 2. Is this intended? |
@@ -580,6 +605,7 @@ public void writeTo(StreamOutput out) throws IOException { | |||
out.writeLong(version); | |||
out.writeByte(state.id); | |||
Settings.writeSettingsToStream(settings, out); | |||
out.writeIntArray(primaryTerms); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we expect these to be non-negative and "not large", I wonder if it'd be better to serialize these using a variable-length encoding? See this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can - (and yeah, reviewed your PR :) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to writeVIntArray
We can maybe later. My feeling is now that this not needed and as this is the "same" primary - it just moved.
Good catch!! fixed and added some testing. |
assertThat(testAllocator.needToFindPrimaryCopy(shard), equalTo(false)); | ||
} | ||
|
||
@Test | ||
public void testNoProcessPrimayNotAllcoatedBefore() { | ||
ShardRouting shard = TestShardRouting.newShardRouting("test", 0, null, null, null, true, ShardRoutingState.UNASSIGNED, 0, new UnassignedInfo(UnassignedInfo.Reason.INDEX_CREATED, null)); | ||
public void testNoProcessPrimacyNotAllocatedBefore() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep.. fixed.
I left some nitpicking but in general I wonder this: shard version and primary term should always be same for all copies. We add primary term to index meta data but not the shard version. Also, we write the shard version when we persist the shard meta data but not the primary term. Why do we treat them differently? |
return true; | ||
} | ||
|
||
@Override | ||
public int hashCode() { | ||
int result = index.hashCode(); | ||
result = 31 * result + (int) (version ^ (version >>> 32)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a built-in (int Long#hashCode(long)
) for computing the hash code of a long
since Java 8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thing.
What I meant was that they are both always the same for each copy (although primary term and version can of course differ). Shard version is only in the ShardRoutings but shard term is in both and that seems redundant to me. However, this is not a problem with this pull request but more with how versioning of MetaData, IndexMetaData, shards etc. works now. I have no good idea how to make this easier to read but opened an issue here to discuss: #14158 |
I look at the PR and I wonder if we should introduce a dedicated class for this for several reasons:
WDYT? |
pushed another commit with a fix for the double version increment issue @brwe found and some(what) beefed java docs. @s1monw I gave it some more thought and I still think - at least as things stand now - that a wrapper class for the PrimaryTerm will add complexity instead of making things clearer. It will just be a wrapper around an int and would obscure simple operation behind a method. Since it's a gut feeling thing I've asked the group today and @jasontedor tends to agree. We do totally see the importance of documentation. I've beefed up what I could in the current PR and added an explicit docs todo on the seq no meta data issue. I suggest we proceed as is. This is the very first step in a longer journey - as soon as there is more complex logic around the primary term that needs a home we'll wrap it up in a class. I also moved primary terms to be long. I made them int originally to address concerns people voiced about 16 bytes (term + counter) per doc but I agree we can review it later on and maybe just encode it differently. |
} | ||
|
||
private void primaryTerms(long[] primaryTerms) { | ||
this.primaryTerms = primaryTerms; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll a copy for safety (though it's called with freshly constructed arrays).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I misread, but I think there's one place where it's not in IndexMetaDataDiff.apply
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You didn’t missread - the only thing is that the diffs are read of the network and are discarded. That one pulled me over the line to actually change it and copy the array.
On 21 Oct 2015, at 18:15, Jason Tedor notifications@github.com wrote:
In core/src/main/java/org/elasticsearch/cluster/metadata/IndexMetaData.java:
}
/**
\* sets the primary term for the given shard.
\* See {@link IndexMetaData#primaryTerm(int)} for more information.
*/
public Builder primaryTerm(int shardId, long primaryTerm) {
if (primaryTerms == null) {
initializePrimaryTerms();
}
this.primaryTerms[shardId] = primaryTerm;
return this;
}
private void primaryTerms(long[] primaryTerms) {
this.primaryTerms = primaryTerms;
Maybe I misread, but I think there's one place where it's not in IndexMetaDataDiff.apply?
—
Reply to this email directly or view it on GitHub.
I left a few more comments: I have reservations about the conversion from |
Every shard group in Elasticsearch has a selected copy called a primary. When a primary shard fails a new primary would be selected from the existing replica copies. This PR introduces `primary terms` to track the number of times this has happened. This will allow us, as follow up work and among other things, to identify operations that come from old stale primaries. It is also the first step in road towards sequence numbers. Relates to #10708 Closes #14062
this is pushed to the feature/seq_no branch. Thanks @jasontedor @brwe and @s1monw for the reviews. |
Primary terms is a way to make sure that operations replicated from stale primary are rejected by shards following a newly elected primary. Original PRs adding this to the seq# feature branch elastic#14062 , elastic#14651 . Unlike those PR, here we take a different approach (based on newer code in master) where the primary terms are stored in the meta data only (and not in `ShardRouting` objects). Relates to elastic#17038 Closes elastic#17044
Every shard group in Elasticsearch has a selected copy called a primary. When a primary shard fails a new primary would be selected from the existing replica copies. This PR introduces
primary terms
to track the number of times this has happened. This will allow us, as follow up work and among other things, to identify operations that come from old stale primaries. It is also the first step in road towards sequence numbers.Relates to #10708