-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Safe DB migrations using RocksDB Checkpoints #6282
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the test plan?
@bowenwang1996 Manual testing. Updated description. Please let me know if you think we need an automated test for it. |
nearcore/src/lib.rs
Outdated
match existing_checkpoint_path { | ||
Ok(existing_checkpoint_path) => match existing_checkpoint_path { | ||
Some(existing_checkpoint_path) => { | ||
panic!("Detected an existing DB migration checkpoint: '{}'. Probably a DB migration got interrupted and your DB is corrupted. Please replace the contents of '{}' with data from that checkpoint, delete the checkpoint and try again.",existing_checkpoint_path.display(), path.display()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if here, instead of a long instruction - we shouldn't point people to some README on our webpage (near.org/dev/recover-from-failed-migration) - where we can have more detailed steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will follow up after I write that README.
nearcore/src/config.rs
Outdated
/// Absolute path to the root directory for DB checkpoints. | ||
/// If not set, defaults to the database location, i.e. '$home/data/'. | ||
#[serde(skip_serializing_if = "Option::is_none")] | ||
pub db_checkpoints_path: Option<PathBuf>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that other files here (genesis_file, validator_key_file) are treated as relative to the nearhome directory, we might want to do the same treatment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Treating it as a relative path is complicated as the path needs to exist.
If a user provides "./sub/dir", then ~/.near/sub/dir needs to exist, which can be confusing.
Instead I added a check that the given path is an absolute path.
nearcore/src/lib.rs
Outdated
if near_config.config.use_checkpoints_for_db_migration { | ||
if let Some(checkpoint_path) = checkpoint_path { | ||
info!(target: "near", "Deleting DB checkpoint at '{}'", checkpoint_path.display()); | ||
match std::fs::remove_dir_all(&checkpoint_path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that std::fs::remove_dir_all
is broken on widows for large dirs, and https://docs.rs/remove_dir_all/latest/remove_dir_all/ should be preferred instead. We probably shouldn't care about this at this point though.
Update user-visible messages to consistently refer to these snapshots as "database migration snapshot". Merge checkpoint creation with a check if it already exists.
nearcore/src/lib.rs
Outdated
panic!( | ||
"Detected an existing database migration snapshot: '{}'.\n\ | ||
Probably a database migration got interrupted and your database is corrupted.\n\ | ||
Please replace the contents of '{}' with data from that checkpoint, delete the checkpoint and try again.",checkpoint_path.display(), path.display()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already return Result
, no need to panic here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to return an Err
. The only concern is that this error message is verbose, and the callsite handling error messages is verbose too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RocksDB checkpoints hardlink files and therefore are free as long as the files are on the same filesystem. Because we usually a separate disk into ~/.near/data, this PR by default writes checkpoints inside that directory.
The PR detects if a checkpoint already exists and if it does refuses a DB migration.
Checkpoints are deleted after successful migrations.
Test plan:
use_checkpoints_for_db_migration = false
and see that the DB gets corrupted and can't be recovered.