-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: implement the distributed lock of shard #706
refactor: implement the distributed lock of shard #706
Conversation
Codecov Report
@@ Coverage Diff @@
## main #706 +/- ##
==========================================
- Coverage 68.49% 68.08% -0.41%
==========================================
Files 292 293 +1
Lines 45413 45644 +231
==========================================
- Hits 31104 31076 -28
- Misses 14309 14568 +259
... and 3 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should create a branch for merging changes concerning the new cluster implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalize the first letter of the log message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
let res = handle_close_shard(new_ctx, close_shard_req).await; | ||
match res { | ||
Ok(_) => info!("Close shard success, shard_id:{shard_id}"), | ||
Err(e) => error!("Close shard failed, shard_id:{shard_id}, err:{e}"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing to do when close failed? Will this leave some state dirty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In this case where the lease is expired, the network partition between ceresdb-server and etcd cluster has occurred, and what we can't do more than just hoping it to close shard successfully.
})?; | ||
|
||
if !granted_by_this_call { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will this happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When retrying close/open shard.
|
||
/// Check whether lease is expired. | ||
fn is_expired(&self) -> bool { | ||
let now = common_util::time::current_time_millis(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a monotonically clock, you should use Instant instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the interval of expire check is not very accurate, so the time offsetting is allowed.
match shard_lock { | ||
Some(mut v) => { | ||
let mut etcd_client = self.etcd_client.clone(); | ||
v.revoke(&mut etcd_client).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if this revoke failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error will be thrown up to the CeresMeta finally. And in the ceresmeta, the close shard call will fail and the following operations will be cancelled.
/// The temporary key in etcd | ||
key: Bytes, | ||
/// The value of the key in etcd | ||
value: Bytes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will this value be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ceresmeta will read it to draw the whole topology: shard -> node.
* refactor: add etcd shard lock * feat: upgrade etcd-client to v0.10.4 and enable tls feature * refactor: new implementation of the shard lock * refactor: clippy compliants * chore: use latest ceresdbproto * chore: fix format error * feat: operate shard lock in meta event server * feat: enable cluster integration tests * feat: use ceresdbproto in crates.io * feat: use ceresdbproto in crates.io * feat: add some basic unit tests * refactor: rename and comment * chore: check the root path and cluster name --------- Co-authored-by: xikai.wxk <xikai.wxk@antgroup.com>
Which issue does this PR close?
Closes #
Rationale for this change
In order to ensure that user data in distributed mode will not be lost, we need a mechanism to ensure that CeresDB will not have multiple leader shard under any circumstances.
What changes are included in this PR?
etcd
client in CeresDB.etcd
, and the corresponding locks need to be acquired every time when operating on the shard.Are there any user-facing changes?
None
How does this change test
Pass all unit tests and integration tests.