Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(meta): refuse to start cluster if data directory is used by another instance #9642

Merged
merged 11 commits into from
May 9, 2023

Conversation

Gun9niR
Copy link
Contributor

@Gun9niR Gun9niR commented May 6, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Reused the TrackingId from telemetry as ClusterId. Its presence/absence in meta store is used to determine if the cluster is being created. If so, the meta node will write a cluster_id/0 file in the specified data directory, signifying its occupancy. If this file is present on cluster creation, the cluster will refuse to start.

Defer system parameter persistence to before starting RPC services, so that if invalid parameters prevent the meta node from starting, they will not be persisted.

The problem of reusing TrackingId is that backward compatibility is broken because its cf is changed.

Checklist For Contributors

  • I have written necessary rustdoc comments
  • All checks passed in ./risedev check (or alias, ./risedev c)

Documentation

Types of user-facing changes

Please keep the types that apply to your changes, and remove the others.

  • Installation and deployment

Release note

The object store URL identified by state_store and data_directory must not be shared by multiple clusters, or later clusters will refuse to start.

@Gun9niR Gun9niR requested review from odysa and zwang28 May 6, 2023 10:07
@codecov
Copy link

codecov bot commented May 6, 2023

Codecov Report

Merging #9642 (64b61cf) into main (300823c) will decrease coverage by 0.02%.
The diff coverage is 61.53%.

@@            Coverage Diff             @@
##             main    #9642      +/-   ##
==========================================
- Coverage   70.77%   70.76%   -0.02%     
==========================================
  Files        1237     1237              
  Lines      207190   207221      +31     
==========================================
- Hits       146648   146637      -11     
- Misses      60542    60584      +42     
Flag Coverage Δ
rust 70.76% <61.53%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/meta/src/rpc/server.rs 0.00% <0.00%> (ø)
src/meta/src/rpc/service/telemetry_service.rs 0.00% <0.00%> (ø)
src/meta/src/telemetry.rs 0.00% <0.00%> (-55.12%) ⬇️
src/meta/src/manager/system_param/mod.rs 18.55% <26.66%> (-1.90%) ⬇️
src/meta/src/manager/env.rs 68.84% <58.82%> (-1.41%) ⬇️
src/storage/backup/src/meta_snapshot.rs 79.38% <60.00%> (-0.83%) ⬇️
src/meta/src/model/cluster.rs 56.84% <69.76%> (+10.68%) ⬆️
src/meta/src/hummock/manager/mod.rs 71.75% <73.52%> (+<0.01%) ⬆️
...c/meta/src/backup_restore/meta_snapshot_builder.rs 87.50% <83.33%> (ø)
src/common/src/system_param/mod.rs 85.81% <100.00%> (+0.41%) ⬆️

... and 7 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

.into())
} else {
// FIXME: Can't distinguish no such item from other errors.
object_store
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So before this is fixed, a temporary object store read error may result in unexpected cluster id object overwriting, which is risky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use list, but opendal requires the path to be a directory, which behaves differently from s3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have put the file in another directory called cluster_id and uses list to check the file's existence.

@Gun9niR Gun9niR requested a review from zwang28 May 8, 2023 09:56
@github-actions github-actions bot added the user-facing-changes Contains changes that are visible to users label May 8, 2023
Copy link
Contributor

@zwang28 zwang28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
We may replace the list trick after refactoring object store's get error.

@Gun9niR Gun9niR enabled auto-merge May 9, 2023 03:06
Copy link
Contributor

@odysa odysa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

/// Column in meta store
pub const TELEMETRY_CF: &str = "cf/telemetry";
/// `telemetry` in bytes
pub const TELEMETRY_KEY: &[u8] = &[74, 65, 0x6c, 65, 0x6d, 65, 74, 72, 79];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only concern is that clusters may lose the previous persistent tracking_id after upgradation. Not a big problem.

@Gun9niR Gun9niR added this pull request to the merge queue May 9, 2023
Merged via the queue into main with commit a1db618 May 9, 2023
@Gun9niR Gun9niR deleted the zhidong/data-dir branch May 9, 2023 03:53
@hzxa21
Copy link
Collaborator

hzxa21 commented May 9, 2023

The problem of reusing TrackingId is that backward compatibility is broken because its cf is changed.

What happen when upgrading the cluster deployed prior to this PR? Will the cluster fail to startup because cluster id is not found in the new CF?

@Gun9niR
Copy link
Contributor Author

Gun9niR commented May 9, 2023

If the cluster id is not found, meta node will consider the cluster to be launched for the first time, resulting in overriding existing system parameters.

In addition, since the id is newly generated, telemetry's report id will be different.

@hzxa21
Copy link
Collaborator

hzxa21 commented May 9, 2023

If the cluster id is not found, meta node will consider the cluster to be launched for the first time, resulting in overriding existing system parameters.

In addition, since the id is newly generated, telemetry's report id will be different.

I see. As long as the cluster can start up without any issue, creating a new telemetry report id sounds okay to me.

@Gun9niR
Copy link
Contributor Author

Gun9niR commented May 9, 2023

overriding existing system parameters.

Though I'm more worried about this, since some parameters cannot be modified, the override will be irreversible.

@CharlieSYH CharlieSYH added the 📖✓ Covered or will be covered in the user docs. label May 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change type/fix Bug fix user-facing-changes Contains changes that are visible to users 📖✓ Covered or will be covered in the user docs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants