-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-24871 Replication may loss data when refresh recovered replicat… #2249
Conversation
Add a UT for this case? |
🎊 +1 overall
This message was automatically generated. |
OK, I'll try it. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
As @infraio said, it will be great to add an UT to show the issue w/o change. Otherwise, looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a UT for this case?
@@ -516,7 +516,7 @@ public void refreshSources(String peerId) throws IOException { | |||
ReplicationSourceInterface replicationSource = createSource(queueId, peer); | |||
this.oldsources.add(replicationSource); | |||
for (SortedSet<String> walsByGroup : walsByIdRecoveredQueues.get(queueId).values()) { | |||
walsByGroup.forEach(wal -> src.enqueueLog(new Path(wal))); | |||
walsByGroup.forEach(wal -> replicationSource.enqueueLog(new Path(wal))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem is that logs from recovered queues were getting added to the new "normal" queue? And these logs are never read, then?
Nit: maybe worth rename the variable from line #516 from replicationSource
to recoveredReplicationSource
, for further clarity?
I endorse the call for an additional UT, here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem is that logs from recovered queues were getting added to the new "normal" queue? And these logs are never read, then?
Yes. These log will be added to normal source. And will read by normal source. So this should not loss data but only affect the replicate order.
@ddupg I thought the issue title need to be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying, @infraio . Yes, let's change the title, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank @wchevreuil @infraio for reviewing. In this case, it will loss data and duplicate data at the same time.
After we added these log to normal source, the normal source replicate the first WAL entries batch and updateLogPosition via zk.
It will successfully replicate first WAL entries batch, but data may be out-of date. And then failed to updateLogPosition in zk, which aborting RS, so we will also loss data.
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Thank @huaxiangsun for reviewing, I've added an UT for this case. |
Thanks @ddupg. One quick question, will the new UT fail without changes in refreshSources()? Just want to make sure that the UT really does its job. +1 pending on your answer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending on the question I have for the new UT.
Yes, the UT fail without changes in this patch. Because it failed to updateLogPosition in zk after replicating first WAL edits batch, which aborting RS. |
@infraio and @wchevreuil, any more comments before I merge the changes? Thanks. |
!replication.getReplicationManager().getOldSources().isEmpty()); | ||
|
||
// disable peer to trigger refreshSources | ||
hbaseAdmin.disableReplicationPeer(PEER_ID2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which assert will failed if no this fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UT will timeout in last UTIL2.waitFor
without this fix because of losing data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
…ion sources (#2249) Signed-off-by: huaxiangsun <huaxiangsun@apache.org> Signed-off-by: Guanghao Zhang <zghao@apache.org> Conflicts: hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java
…ion sources (#2249) Signed-off-by: huaxiangsun <huaxiangsun@apache.org> Signed-off-by: Guanghao Zhang <zghao@apache.org> Conflicts: hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java
…ion sources (#2249) Signed-off-by: huaxiangsun <huaxiangsun@apache.org> Signed-off-by: Guanghao Zhang <zghao@apache.org> Conflicts: hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java Conflicts: hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java
…ion sources (apache#2249) Signed-off-by: huaxiangsun <huaxiangsun@apache.org> Signed-off-by: Guanghao Zhang <zghao@apache.org> Conflicts: hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java
…ion sources (apache#2249) Signed-off-by: huaxiangsun <huaxiangsun@apache.org> Signed-off-by: Guanghao Zhang <zghao@apache.org> Conflicts: hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java Conflicts: hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java (cherry picked from commit e300ae3) Change-Id: I5b2aa29597811d77ce99732019a2ac3217c93dab
…ion sources