fix(meta): refuse to start cluster if data directory is used by another instance #9642

Gun9niR · 2023-05-06T10:06:59Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Reused the TrackingId from telemetry as ClusterId. Its presence/absence in meta store is used to determine if the cluster is being created. If so, the meta node will write a cluster_id/0 file in the specified data directory, signifying its occupancy. If this file is present on cluster creation, the cluster will refuse to start.

Defer system parameter persistence to before starting RPC services, so that if invalid parameters prevent the meta node from starting, they will not be persisted.

The problem of reusing TrackingId is that backward compatibility is broken because its cf is changed.

Checklist For Contributors

I have written necessary rustdoc comments
All checks passed in ./risedev check (or alias, ./risedev c)

Documentation

Types of user-facing changes

Please keep the types that apply to your changes, and remove the others.

Installation and deployment

Release note

The object store URL identified by state_store and data_directory must not be shared by multiple clusters, or later clusters will refuse to start.

…nto zhidong/data-dir

codecov · 2023-05-06T11:22:10Z

Codecov Report

Merging #9642 (64b61cf) into main (300823c) will decrease coverage by 0.02%.
The diff coverage is 61.53%.

@@            Coverage Diff             @@
##             main    #9642      +/-   ##
==========================================
- Coverage   70.77%   70.76%   -0.02%     
==========================================
  Files        1237     1237              
  Lines      207190   207221      +31     
==========================================
- Hits       146648   146637      -11     
- Misses      60542    60584      +42

Flag	Coverage Δ
rust	`70.76% <61.53%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/meta/src/rpc/server.rs	`0.00% <0.00%> (ø)`
src/meta/src/rpc/service/telemetry_service.rs	`0.00% <0.00%> (ø)`
src/meta/src/telemetry.rs	`0.00% <0.00%> (-55.12%)`	⬇️
src/meta/src/manager/system_param/mod.rs	`18.55% <26.66%> (-1.90%)`	⬇️
src/meta/src/manager/env.rs	`68.84% <58.82%> (-1.41%)`	⬇️
src/storage/backup/src/meta_snapshot.rs	`79.38% <60.00%> (-0.83%)`	⬇️
src/meta/src/model/cluster.rs	`56.84% <69.76%> (+10.68%)`	⬆️
src/meta/src/hummock/manager/mod.rs	`71.75% <73.52%> (+<0.01%)`	⬆️
...c/meta/src/backup_restore/meta_snapshot_builder.rs	`87.50% <83.33%> (ø)`
src/common/src/system_param/mod.rs	`85.81% <100.00%> (+0.41%)`	⬆️

... and 7 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

zwang28 · 2023-05-08T02:14:11Z

src/meta/src/hummock/manager/mod.rs

+        .into())
+    } else {
+        // FIXME: Can't distinguish no such item from other errors.
+        object_store


So before this is fixed, a temporary object store read error may result in unexpected cluster id object overwriting, which is risky.

I tried to use list, but opendal requires the path to be a directory, which behaves differently from s3.

I have put the file in another directory called cluster_id and uses list to check the file's existence.

zwang28

LGTM.
We may replace the list trick after refactoring object store's get error.

odysa

LGTM

odysa · 2023-05-09T03:06:00Z

src/meta/src/telemetry.rs

-/// Column in meta store
-pub const TELEMETRY_CF: &str = "cf/telemetry";
-/// `telemetry` in bytes
-pub const TELEMETRY_KEY: &[u8] = &[74, 65, 0x6c, 65, 0x6d, 65, 74, 72, 79];


The only concern is that clusters may lose the previous persistent tracking_id after upgradation. Not a big problem.

hzxa21 · 2023-05-09T04:11:39Z

The problem of reusing TrackingId is that backward compatibility is broken because its cf is changed.

What happen when upgrading the cluster deployed prior to this PR? Will the cluster fail to startup because cluster id is not found in the new CF?

Gun9niR · 2023-05-09T10:29:13Z

If the cluster id is not found, meta node will consider the cluster to be launched for the first time, resulting in overriding existing system parameters.

In addition, since the id is newly generated, telemetry's report id will be different.

hzxa21 · 2023-05-09T10:45:10Z

If the cluster id is not found, meta node will consider the cluster to be launched for the first time, resulting in overriding existing system parameters.

In addition, since the id is newly generated, telemetry's report id will be different.

I see. As long as the cluster can start up without any issue, creating a new telemetry report id sounds okay to me.

Gun9niR · 2023-05-09T11:07:48Z

overriding existing system parameters.

Though I'm more worried about this, since some parameters cannot be modified, the override will be irreversible.

Gun9niR added 4 commits April 20, 2023 10:17

make sure data dir is empty when creating cluster

9eb5f47

check data dir instead of bucket

067bee1

log system param on meta node

26409ba

persist cluster id in etcd and write __cluster_id file to data dir

2955a83

Gun9niR requested review from odysa and zwang28 May 6, 2023 10:07

Gun9niR added 5 commits May 6, 2023 10:09

Merge branch 'main' into zhidong/data-dir

88b7196

Merge branch 'main' of https://github.com/risingwavelabs/risingwave i…

dce0467

…nto zhidong/data-dir

fix compile error

9561739

fix ut

a5c9091

write cluster id in data dir instead of bucket root

bdf0edc

Gun9niR added type/fix Bug fix breaking-change labels May 6, 2023

zwang28 reviewed May 8, 2023

View reviewed changes

put cluster id in a dir so list can be used to check if the file exists

ae47890

Gun9niR requested a review from zwang28 May 8, 2023 09:56

github-actions bot added the user-facing-changes Contains changes that are visible to users label May 8, 2023

zwang28 approved these changes May 8, 2023

View reviewed changes

fix clippy

64b61cf

Gun9niR enabled auto-merge May 9, 2023 03:06

odysa reviewed May 9, 2023

View reviewed changes

odysa approved these changes May 9, 2023

View reviewed changes

Gun9niR added this pull request to the merge queue May 9, 2023

Merged via the queue into main with commit a1db618 May 9, 2023

Gun9niR deleted the zhidong/data-dir branch May 9, 2023 03:53

CharlieSYH added the 📖✓ Covered or will be covered in the user docs. label May 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(meta): refuse to start cluster if data directory is used by another instance #9642

fix(meta): refuse to start cluster if data directory is used by another instance #9642

Gun9niR commented May 6, 2023 •

edited

Loading

codecov bot commented May 6, 2023 •

edited

Loading

zwang28 May 8, 2023

Gun9niR May 8, 2023

Gun9niR May 8, 2023

zwang28 left a comment

odysa left a comment

odysa May 9, 2023

hzxa21 commented May 9, 2023

Gun9niR commented May 9, 2023

hzxa21 commented May 9, 2023

Gun9niR commented May 9, 2023

fix(meta): refuse to start cluster if data directory is used by another instance #9642

fix(meta): refuse to start cluster if data directory is used by another instance #9642

Conversation

Gun9niR commented May 6, 2023 • edited Loading

What's changed and what's your intention?

Checklist For Contributors

Documentation

Types of user-facing changes

Release note

codecov bot commented May 6, 2023 • edited Loading

Codecov Report

zwang28 May 8, 2023

Choose a reason for hiding this comment

Gun9niR May 8, 2023

Choose a reason for hiding this comment

Gun9niR May 8, 2023

Choose a reason for hiding this comment

zwang28 left a comment

Choose a reason for hiding this comment

odysa left a comment

Choose a reason for hiding this comment

odysa May 9, 2023

Choose a reason for hiding this comment

hzxa21 commented May 9, 2023

Gun9niR commented May 9, 2023

hzxa21 commented May 9, 2023

Gun9niR commented May 9, 2023

Gun9niR commented May 6, 2023 •

edited

Loading

codecov bot commented May 6, 2023 •

edited

Loading