fix: incr sync shouldn't be established after full sync corrupted #2756

cheniujh · 2024-06-24T03:38:54Z

这个PR修复了 Issue #2742 中的第二个问题：

问题描述: 从节点在全量同步失败，RsynClient异常退出之后尝试进行增量连接竟然成功了，这是不对的。

原因：一方面，RsyncClient如果异常退出，对于上层主从状态机来说没有别的信号告知，所以才会走了继续尝试增量连接的链路。另一方面，按理说如果全量失败，那么使用全量文件打开一个新RocksDB实例应当也是失败的（RocksDB Apply完manifest文件后会检查内存中的current version中的fileset是否和磁盘上的一致）。但这次case中，全量同步发生中断，恰好只拉取了部分SST文件，RocksDB的CURRENT, MANIFEST文件都没有拉过来，于是在Replace DB的阶段，RocksDB打开新实例时，因为找不到CURRENT文件，会直接起一个空实例，所以没有报错。

解决方案：

1 在RsynClient内部增加error_stop_标志位，如果RsyncClient异常退出（也就是全量同步异常退出，文件没有拉取完毕），就直接删除snapshot所对应的文件夹（./dbsync/dbx）
2 通过1中删除文件夹的做法，能在不提高RsyncClient和上层耦合的情况下，将错误状态以文件夹不存在的形式传递给上层的从节点状态机，从节点状态机发现snapshot文件夹不存在的话会将SlaveDB状态转为TryConnect进行连接重试

This PR fixes the second issue in Issue #2742:

Problem Description: The issue was that after a full synchronization was successful, the attempt to perform an incremental connection succeeded, which was incorrect.

Reason: In this case, the full synchronization was interrupted, and only some SST files were fetched; the CURRENT and MANIFEST files were not fetched. Therefore, during the Replace DB phase, when RocksDB opened a new instance, it started an empty instance directly because it couldn't find the CURRENT file, so no error was reported.

Solution:

1. Add an error_stop_ flag inside RsyncClient, which will directly delete the folder corresponding to the snapshot (./dbsync/dbx) if RsyncClient exits abnormally (i.e., if the full synchronization exits abnormally and the files were not completely fetched).
1. By deleting the folder as mentioned in point 1, RsyncClient can convey the error information to the upper-layer slave state machine by making the folder non-existent. If the slave node finds that the snapshot folder does not exist, it will switch the slave state to TryConnect and attempt to reconnect.

Summary by CodeRabbit

Bug Fixes
- Enhanced error logging and state management for missing slave database scenarios.
- Improved error handling and logging for various methods to provide clearer diagnostics.
New Features
- Introduced state setting to trigger sync restart when certain conditions are not met, improving reliability in data synchronization scenarios.

…t_exit

coderabbitai · 2024-06-24T03:39:03Z

Walkthrough

The main changes enhance error handling and state management in the DB::TryUpdateMasterOffset and RsyncClient classes. Specifically, logging levels were adjusted, new error conditions were added, and additional actions were taken to improve synchronization and stability of these components.

Changes

Files	Change Summary
`src/pika_db.cc`	Adjusted logging levels, error handling, and synchronization actions in `DB::TryUpdateMasterOffset`
`src/rsync_client.cc`	Enhanced error handling, logging, and state management methods: `Copy`, `ThreadMain`, `CopyRemoteFile`, and `ComparisonUpdate`

Poem

In code's vast plains, changes fly high,
Where errors once whispered, now logs loudly cry.
Sync's gentle dance with states refined,
By clever tweaks, stability defined.
Oh, code so bright, in structure tight,
Rabbits cheer as bugs take flight!
🐇✨

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between e131567 and ff17053.

Files selected for processing (5)

include/pika_rm.h (1 hunks)
include/rsync_client.h (2 hunks)
src/pika_db.cc (1 hunks)
src/pika_rm.cc (1 hunks)
src/rsync_client.cc (7 hunks)

Additional comments not posted (10)

include/rsync_client.h (2)

65-67: New method IsExitedFromRunning added for improved state checking

This method provides a clear way to check if the Rsync client has stopped and all workers have exited, which is crucial for ensuring that resources are cleaned up properly after operations. This change aligns with the PR's goal to handle errors and state transitions more effectively.

98-99: Introduction of error_stopped_ and all_worker_exited_ atomic variables

These new atomic variables enhance the management of the Rsync client's state, particularly in error scenarios and worker management. This is a positive change as it supports robust state tracking and error handling, which is essential for the reliability of the synchronization process.
[APROVED]

include/pika_rm.h (1)

119-119: Method IsRsyncExited correctly delegates to IsExitedFromRunning of RsyncClient

This change is critical as it ensures that the slave database synchronization state is accurately assessed based on the Rsync client's state. It helps in making the slave state machine more robust by providing accurate information about the Rsync process's status.

src/rsync_client.cc (4)

52-54: Improved logging condition based on error_stopped_

The conditional logging based on the error_stopped_ state is a good addition. It prevents unnecessary log entries when an error has already been flagged, thus keeping the log cleaner and more relevant.

82-85: Enhanced error handling and state management in ThreadMain

The changes in this method significantly improve how errors and the operational state are managed. Specifically:

Directories are cleaned up if no files need copying.

The state and worker exit flags are appropriately managed to reflect the current operational status.

Error conditions lead to immediate state changes and directory cleanup, which is crucial for maintaining system integrity in failure scenarios.

These changes are well-aligned with the PR's goals of robust error handling and state management.

Also applies to: 96-107, 152-163

238-238: Proper setting of error_stopped_ on snapshot UUID mismatch

This line ensures that the error_stopped_ flag is set when there is a mismatch in the snapshot UUIDs, which is a critical error scenario. This change is essential for preventing incorrect data synchronization and potential data corruption.

334-335: Resetting state and error flags before metadata copying

Resetting the state_ to RUNNING and error_stopped_ to false before starting a new operation ensures that the Rsync client is in a clean state. This is a good practice, especially in a complex multi-threaded environment, to avoid carrying over previous errors or states into new operations.

src/pika_db.cc (2)

404-406: Elevated logging level from WARNING to ERROR for missing slave DB.

This change is appropriate as it correctly emphasizes the severity of a missing slave database in a replication setup, which could lead to significant data consistency issues.

408-412: Enhanced error handling by setting replication state to kTryConnect on missing snapshot file.

This change ensures that the system attempts to re-establish a connection when critical snapshot files are missing, promoting robustness in error scenarios.
src/pika_rm.cc (1)
993-993: Ensure that the IsRsyncExited method is robust and correctly implemented.

The method IsRsyncExited is used here to decide whether to update the master offset based on the rsync process's state. This is critical for ensuring that synchronization does not proceed based on incomplete or corrupt data.

Verification successful

Ensure that the IsRsyncExited method is robust and correctly implemented.

The method IsExitedFromRunning, which is called by IsRsyncExited, checks if the state_ is STOP and if all workers have exited. This implementation is robust and accurately reflects the state of the rsync process.

include/rsync_client.h: bool IsExitedFromRunning() { return state_.load() == STOP && all_worker_exited_.load(); }
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the implementation of `IsRsyncExited` method in `rsync_client.cc` to ensure it accurately reflects the rsync process state.

# Test: Search for the method implementation. Expect: Accurate state reflection of the rsync process.
rg --type cpp --multiline "bool RsyncClient::IsRsyncExited()"
Length of output: 63

Script:
#!/bin/bash
# Description: Locate the class `RsyncClient` and list all its methods to identify if `IsRsyncExited` or a similar method exists.

# Step 1: Search for the class `RsyncClient` to find its definition.
rg --type cpp --multiline "class RsyncClient"

# Step 2: List all methods within the `RsyncClient` class to identify the presence of `IsRsyncExited` or similar.
rg --type cpp --multiline "RsyncClient::" --context 10
Length of output: 13128

Script:
#!/bin/bash
# Description: Search for any methods in the RsyncClient class that might indicate the state of the rsync process.

# Step 1: Search for methods in RsyncClient that relate to the state of the rsync process.
rg --type cpp --multiline "RsyncClient::" --context 5 | grep -i "state"

# Step 2: Inspect the usage of IsRsyncExited in the codebase to understand its context.
rg --type cpp "IsRsyncExited"
Length of output: 668

Script:
#!/bin/bash
# Description: Locate and review the implementation of `IsExitedFromRunning` to ensure it accurately reflects the rsync process state.

# Step 1: Search for the `IsExitedFromRunning` method in the codebase.
rg --type cpp "IsExitedFromRunning"
Length of output: 176

Script:
#!/bin/bash
# Description: Locate and review the implementation of `IsExitedFromRunning` to ensure it accurately reflects the rsync process state.

# Step 1: Search for the implementation of `IsExitedFromRunning` in `rsync_client.h` or related `.cc` files.
rg --type cpp --multiline "bool IsExitedFromRunning" --context 10
Length of output: 1204

…o fix_full_sync_interupt_exit

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between ff17053 and 249be82.

Files selected for processing (2)

src/pika_db.cc (1 hunks)
src/rsync_client.cc (7 hunks)

Files skipped from review as they are similar to previous changes (2)

src/pika_db.cc
src/rsync_client.cc

src/pika_db.cc

wangshao1 · 2024-06-26T09:10:52Z

src/rsync_client.cc

+  if (!error_stopped_.load()) {
+    LOG(INFO) << "RsyncClient copy remote files done";
+  } else {
+    if (DeleteDirIfExist(dir_)) {


创建/删除dir的逻辑我印象中外层也有，这个可能你也要梳理下，看会不会出现删了之后，没创建就直接用的情况。或者也可以评估下把创建/删除dir的逻辑统一收敛在rsync_client里。

创建/删除dir的逻辑我印象中外层也有，这个可能你也要梳理下，看会不会出现删了之后，没创建就直接用的情况。或者也可以评估下把创建/删除dir的逻辑统一收敛在rsync_client里。

该PR中删除的文件夹为DB粒度，即：./dbsync/dbname, 如./dbsync/db0,经过梳理，相关的操作有两类：

1 RsyncClient的dir_成员，其值就是./dbsync/dbname，经过梳理可以确定，RsyncClient内部有几处打开/新建文件的逻辑都要求./dbsync/dbname路径已经存在，否则会出现非预期的文件打开/创建失败的错误。所以 ./dbsync/dname需要在RsynClient执行Init()时，或者之前就创建好。

2 DB的dbsync_path_成员，其值也是./dbsync/dbname，被用在两个地方：1 在SlaveDB发出DBSync请求时，用于创建./dbsync/dbname目录 2 在全量结束，RsyncClient结束以后，用于拼接info文件路径来检查是否全量成功以及在替换DB实例时作为ChangeDB函数的入参。

综合1, 2可以得到结论：RsynClient执行全量传输时必须要求./dbsync/dbname路径已经存在，而该路径是每次在Slave发出全量同步请求时就会建立，所以当前PR的改法正确性是没有问题的，因为该路径（./dbsync/dbname）一定会先于RsynClient开始工作前被创建。

但是似乎存在另外一个问题，就是全量同步的断点续传功能按照目前这个逻辑，似乎就不能起作用（每次发出DBSync请求前，都会删除./dbsync/dbname目录以后再重新建，这样也会把本来能复用于续传的文件也清空），这个问题后面进一步测试再做考虑，如果该问题真的存在，则另起一个PR处理。

具体的梳理如下：

1.1 RsyncClient::dir_被用于：CopyRemoteFile函数，rsync_worker拷贝文件时，在新建文件时会去使用dir_拼接文件名来新文件，使用open系统调用新建文件，此时的前提是dir_目录要存在，否则会建立文件失败（要求./dbsync/dbname）

1.2 RsyncClient::dir_被用于：在RsyncClient的ThreadMain中，GetLocalMetaFilepath也使用了dir_用于拼接DUMP_META的文件名，此处也和1类似，需要目录先存在，才能成功打开/创建DUMP_META，所以如果目录不存在会导致非预期的错误（文件建立失败）

1.3 RsyncClient::dir_被用于：LoadLocalMeta<-ComparisionUpdate<-RsyncCli::Init()<-SyncSlaveDB::ActivateRsync，在这里dir_被用于拼接DUMP_META文件的路径，这里试图打开DUMP_META文件来实现断点续传，如果文件夹都不存在，文件自然也不存在，但这里有相应的处理链路，如果文件夹都不存在就执行这里也没有问题

1.4 RsyncClient::dir_被用于：在CleanUpExPireFiles<-ComparisionUpdate中，这里使用dir_用于拼接过期的dump文件，如果文件夹都不存在，自然不会发生删除，这里也没有问题

1.5 RsyncClient::dir_被用于：UpdateLocalMeta<-ComparisionUpdate，这里也会使用dir_去拼接DUMP_META的文件名，并且新建一个DUMP_META, 如果此时文件夹不存在，也是要出问题的（文件建立失败）

2.1 DB::dbsync_path_被用于: DB::PrepareRsync函数，负责创建./dbsync/db_name目录，他会先将./dbsync/db_name目录删除，然后重建该目录，并且也会将其下的inst目录创建好（0,1,2...)。从这里可以看出全量同步的断点续传逻辑目前应该没法工作，因为每次发DBSync请求之前都会把./dbsync/db_name目录清空。

2.2 DB::dbsync_path_被用于：在RsyncClient退出以后，于TryUpdateMasterOffset中用于拼接info文件的地址，如果发现info文件不存在就不继续往下走（本PR也在这个分支加了状态流转）

2.3 DB::dbsync_path_被用于：ChangeDB(dbsync_path_)<-TryUpdateMasterOffset，也是RsyncClient结束以后,去用这个db_sync_目录来做RocksDB实例的替换

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 249be82 and 6087180.

Files selected for processing (1)

src/pika_db.cc (2 hunks)

Additional comments not posted (3)

src/pika_db.cc (3)

404-406: Correctly updated logging and state setting for missing slave_db.

Changing the logging level to ERROR and setting the replication state to kError ensures that the system correctly identifies and logs critical issues with the slave database.

409-413: Properly added check for info_path existence and updated replication state.

Introducing the info_path variable and checking its existence ensures that the synchronization process only proceeds if the necessary info file is present. Setting the replication state to kTryConnect allows the system to attempt reconnection if the file is missing.

481-482: Correctly updated replication state for missing master_db.

Setting the replication state to kError if the master database is missing ensures that the system correctly identifies and logs critical issues with the master database.

…t_exit

) * add error_exited for rsyncClient * add some logs * add some state transition based on reviewer's opinion --------- Co-authored-by: cheniujh <1271435567@qq.com>

…enAtomFoundation#2756) * add error_exited for rsyncClient * add some logs * add some state transition based on reviewer's opinion --------- Co-authored-by: cheniujh <1271435567@qq.com>

cheniujh and others added 2 commits June 20, 2024 21:48

add error_exited for rsyncClient

db096d7

Merge branch 'OpenAtomFoundation:unstable' into fix_full_sync_interup…

ff17053

…t_exit

github-actions bot added the ☢️ Bug Something isn't working label Jun 24, 2024

cheniujh mentioned this pull request Jun 24, 2024

fix: unsigned int overflow when master-slave sync data #2746

Merged

coderabbitai bot reviewed Jun 24, 2024

View reviewed changes

cheniujh added 2 commits June 24, 2024 15:53

add some logs

5fca1d8

Merge remote-tracking branch 'origin/fix_full_sync_interupt_exit' int…

249be82

…o fix_full_sync_interupt_exit

coderabbitai bot reviewed Jun 24, 2024

View reviewed changes

cheniujh requested a review from wangshao1 June 25, 2024 02:20

wangshao1 reviewed Jun 26, 2024

View reviewed changes

AlexStocks added 3.5.5 4.0.1 labels Jun 28, 2024

add some state transition based on reviewer's opinion

6087180

cheniujh requested a review from wangshao1 July 1, 2024 10:42

coderabbitai bot reviewed Jul 1, 2024

View reviewed changes

wangshao1 approved these changes Jul 2, 2024

View reviewed changes

Merge branch 'OpenAtomFoundation:unstable' into fix_full_sync_interup…

a8212df

…t_exit

baixin01 approved these changes Jul 4, 2024

View reviewed changes

AlexStocks merged commit fb67dbb into OpenAtomFoundation:unstable Jul 6, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: incr sync shouldn't be established after full sync corrupted #2756

fix: incr sync shouldn't be established after full sync corrupted #2756

cheniujh commented Jun 24, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 24, 2024 •

edited

Loading

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

wangshao1 Jun 26, 2024

cheniujh Jul 1, 2024 •

edited

Loading

coderabbitai bot left a comment

fix: incr sync shouldn't be established after full sync corrupted #2756

fix: incr sync shouldn't be established after full sync corrupted #2756

Conversation

cheniujh commented Jun 24, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Jun 24, 2024 • edited Loading

Walkthrough

Changes

Poem

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

wangshao1 Jun 26, 2024

Choose a reason for hiding this comment

cheniujh Jul 1, 2024 • edited Loading

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

cheniujh commented Jun 24, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 24, 2024 •

edited

Loading

CodeRabbit Configration File (`.coderabbit.yaml`)

cheniujh Jul 1, 2024 •

edited

Loading