-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(meta): refuse to start cluster if data directory is used by another instance #9642
Conversation
Codecov Report
@@ Coverage Diff @@
## main #9642 +/- ##
==========================================
- Coverage 70.77% 70.76% -0.02%
==========================================
Files 1237 1237
Lines 207190 207221 +31
==========================================
- Hits 146648 146637 -11
- Misses 60542 60584 +42
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 7 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
src/meta/src/hummock/manager/mod.rs
Outdated
.into()) | ||
} else { | ||
// FIXME: Can't distinguish no such item from other errors. | ||
object_store |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So before this is fixed, a temporary object store read error may result in unexpected cluster id object overwriting, which is risky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to use list
, but opendal requires the path to be a directory, which behaves differently from s3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have put the file in another directory called cluster_id
and uses list to check the file's existence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
We may replace the list trick after refactoring object store's get error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/// Column in meta store | ||
pub const TELEMETRY_CF: &str = "cf/telemetry"; | ||
/// `telemetry` in bytes | ||
pub const TELEMETRY_KEY: &[u8] = &[74, 65, 0x6c, 65, 0x6d, 65, 74, 72, 79]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only concern is that clusters may lose the previous persistent tracking_id
after upgradation. Not a big problem.
What happen when upgrading the cluster deployed prior to this PR? Will the cluster fail to startup because cluster id is not found in the new CF? |
If the cluster id is not found, meta node will consider the cluster to be launched for the first time, resulting in overriding existing system parameters. In addition, since the id is newly generated, telemetry's report id will be different. |
I see. As long as the cluster can start up without any issue, creating a new telemetry report id sounds okay to me. |
Though I'm more worried about this, since some parameters cannot be modified, the override will be irreversible. |
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Reused the
TrackingId
from telemetry asClusterId
. Its presence/absence in meta store is used to determine if the cluster is being created. If so, the meta node will write acluster_id/0
file in the specified data directory, signifying its occupancy. If this file is present on cluster creation, the cluster will refuse to start.Defer system parameter persistence to before starting RPC services, so that if invalid parameters prevent the meta node from starting, they will not be persisted.
The problem of reusing
TrackingId
is that backward compatibility is broken because its cf is changed.Checklist For Contributors
./risedev check
(or alias,./risedev c
)Documentation
Types of user-facing changes
Please keep the types that apply to your changes, and remove the others.
Release note
The object store URL identified by state_store and data_directory must not be shared by multiple clusters, or later clusters will refuse to start.