-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: incr sync shouldn't be established after full sync corrupted #2756
fix: incr sync shouldn't be established after full sync corrupted #2756
Conversation
WalkthroughThe main changes enhance error handling and state management in the Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (5)
- include/pika_rm.h (1 hunks)
- include/rsync_client.h (2 hunks)
- src/pika_db.cc (1 hunks)
- src/pika_rm.cc (1 hunks)
- src/rsync_client.cc (7 hunks)
Additional comments not posted (10)
include/rsync_client.h (2)
65-67
: New methodIsExitedFromRunning
added for improved state checkingThis method provides a clear way to check if the Rsync client has stopped and all workers have exited, which is crucial for ensuring that resources are cleaned up properly after operations. This change aligns with the PR's goal to handle errors and state transitions more effectively.
98-99
: Introduction oferror_stopped_
andall_worker_exited_
atomic variablesThese new atomic variables enhance the management of the Rsync client's state, particularly in error scenarios and worker management. This is a positive change as it supports robust state tracking and error handling, which is essential for the reliability of the synchronization process.
[APROVED]include/pika_rm.h (1)
119-119
: MethodIsRsyncExited
correctly delegates toIsExitedFromRunning
ofRsyncClient
This change is critical as it ensures that the slave database synchronization state is accurately assessed based on the Rsync client's state. It helps in making the slave state machine more robust by providing accurate information about the Rsync process's status.
src/rsync_client.cc (4)
52-54
: Improved logging condition based onerror_stopped_
The conditional logging based on the
error_stopped_
state is a good addition. It prevents unnecessary log entries when an error has already been flagged, thus keeping the log cleaner and more relevant.
82-85
: Enhanced error handling and state management inThreadMain
The changes in this method significantly improve how errors and the operational state are managed. Specifically:
- Directories are cleaned up if no files need copying.
- The state and worker exit flags are appropriately managed to reflect the current operational status.
- Error conditions lead to immediate state changes and directory cleanup, which is crucial for maintaining system integrity in failure scenarios.
These changes are well-aligned with the PR's goals of robust error handling and state management.
Also applies to: 96-107, 152-163
238-238
: Proper setting oferror_stopped_
on snapshot UUID mismatchThis line ensures that the
error_stopped_
flag is set when there is a mismatch in the snapshot UUIDs, which is a critical error scenario. This change is essential for preventing incorrect data synchronization and potential data corruption.
334-335
: Resetting state and error flags before metadata copyingResetting the
state_
toRUNNING
anderror_stopped_
tofalse
before starting a new operation ensures that the Rsync client is in a clean state. This is a good practice, especially in a complex multi-threaded environment, to avoid carrying over previous errors or states into new operations.src/pika_db.cc (2)
404-406
: Elevated logging level from WARNING to ERROR for missing slave DB.This change is appropriate as it correctly emphasizes the severity of a missing slave database in a replication setup, which could lead to significant data consistency issues.
408-412
: Enhanced error handling by setting replication state tokTryConnect
on missing snapshot file.This change ensures that the system attempts to re-establish a connection when critical snapshot files are missing, promoting robustness in error scenarios.
src/pika_rm.cc (1)
993-993
: Ensure that theIsRsyncExited
method is robust and correctly implemented.The method
IsRsyncExited
is used here to decide whether to update the master offset based on the rsync process's state. This is critical for ensuring that synchronization does not proceed based on incomplete or corrupt data.Verification successful
Ensure that the
IsRsyncExited
method is robust and correctly implemented.The method
IsExitedFromRunning
, which is called byIsRsyncExited
, checks if thestate_
isSTOP
and if all workers have exited. This implementation is robust and accurately reflects the state of the rsync process.
include/rsync_client.h: bool IsExitedFromRunning() { return state_.load() == STOP && all_worker_exited_.load(); }
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation of `IsRsyncExited` method in `rsync_client.cc` to ensure it accurately reflects the rsync process state. # Test: Search for the method implementation. Expect: Accurate state reflection of the rsync process. rg --type cpp --multiline "bool RsyncClient::IsRsyncExited()"Length of output: 63
Script:
#!/bin/bash # Description: Locate the class `RsyncClient` and list all its methods to identify if `IsRsyncExited` or a similar method exists. # Step 1: Search for the class `RsyncClient` to find its definition. rg --type cpp --multiline "class RsyncClient" # Step 2: List all methods within the `RsyncClient` class to identify the presence of `IsRsyncExited` or similar. rg --type cpp --multiline "RsyncClient::" --context 10Length of output: 13128
Script:
#!/bin/bash # Description: Search for any methods in the RsyncClient class that might indicate the state of the rsync process. # Step 1: Search for methods in RsyncClient that relate to the state of the rsync process. rg --type cpp --multiline "RsyncClient::" --context 5 | grep -i "state" # Step 2: Inspect the usage of IsRsyncExited in the codebase to understand its context. rg --type cpp "IsRsyncExited"Length of output: 668
Script:
#!/bin/bash # Description: Locate and review the implementation of `IsExitedFromRunning` to ensure it accurately reflects the rsync process state. # Step 1: Search for the `IsExitedFromRunning` method in the codebase. rg --type cpp "IsExitedFromRunning"Length of output: 176
Script:
#!/bin/bash # Description: Locate and review the implementation of `IsExitedFromRunning` to ensure it accurately reflects the rsync process state. # Step 1: Search for the implementation of `IsExitedFromRunning` in `rsync_client.h` or related `.cc` files. rg --type cpp --multiline "bool IsExitedFromRunning" --context 10Length of output: 1204
…o fix_full_sync_interupt_exit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- src/pika_db.cc (1 hunks)
- src/rsync_client.cc (7 hunks)
Files skipped from review as they are similar to previous changes (2)
- src/pika_db.cc
- src/rsync_client.cc
if (!error_stopped_.load()) { | ||
LOG(INFO) << "RsyncClient copy remote files done"; | ||
} else { | ||
if (DeleteDirIfExist(dir_)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
创建/删除dir的逻辑我印象中外层也有,这个可能你也要梳理下,看会不会出现删了之后,没创建就直接用的情况。或者也可以评估下把创建/删除dir的逻辑统一收敛在rsync_client里。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
创建/删除dir的逻辑我印象中外层也有,这个可能你也要梳理下,看会不会出现删了之后,没创建就直接用的情况。或者也可以评估下把创建/删除dir的逻辑统一收敛在rsync_client里。
该PR中删除的文件夹为DB粒度,即:./dbsync/dbname, 如./dbsync/db0,经过梳理,相关的操作有两类:
1 RsyncClient的dir_成员,其值就是./dbsync/dbname,经过梳理可以确定,RsyncClient内部有几处打开/新建文件的逻辑都要求./dbsync/dbname路径已经存在,否则会出现非预期的文件打开/创建失败的错误。所以 ./dbsync/dname需要在RsynClient执行Init()时,或者之前就创建好。
2 DB的dbsync_path_成员,其值也是./dbsync/dbname, 被用在两个地方:1 在SlaveDB发出DBSync请求时,用于创建./dbsync/dbname目录 2 在全量结束,RsyncClient结束以后,用于拼接info文件路径来检查是否全量成功以及在替换DB实例时作为ChangeDB函数的入参。
综合1, 2可以得到结论:RsynClient执行全量传输时必须要求./dbsync/dbname路径已经存在,而该路径是每次在Slave发出全量同步请求时就会建立,所以当前PR的改法正确性是没有问题的,因为该路径(./dbsync/dbname)一定会先于RsynClient开始工作前被创建。
但是似乎存在另外一个问题,就是全量同步的断点续传功能按照目前这个逻辑,似乎就不能起作用(每次发出DBSync请求前,都会删除./dbsync/dbname目录以后再重新建,这样也会把本来能复用于续传的文件也清空),这个问题后面进一步测试再做考虑,如果该问题真的存在,则另起一个PR处理。
具体的梳理如下:
1.1 RsyncClient::dir_被用于:CopyRemoteFile函数,rsync_worker拷贝文件时,在新建文件时会去使用dir_拼接文件名来新文件,使用open系统调用新建文件,此时的前提是dir_目录要存在,否则会建立文件失败(要求./dbsync/dbname)
1.2 RsyncClient::dir_被用于:在RsyncClient的ThreadMain中,GetLocalMetaFilepath也使用了dir_用于拼接DUMP_META的文件名,此处也和1类似,需要目录先存在,才能成功打开/创建DUMP_META,所以如果目录不存在会导致非预期的错误(文件建立失败)
1.3 RsyncClient::dir_被用于:LoadLocalMeta<-ComparisionUpdate<-RsyncCli::Init()<-SyncSlaveDB::ActivateRsync,在这里dir_被用于拼接DUMP_META文件的路径,这里试图打开DUMP_META文件来实现断点续传,如果文件夹都不存在,文件自然也不存在,但这里有相应的处理链路,如果文件夹都不存在就执行这里也没有问题
1.4 RsyncClient::dir_被用于:在CleanUpExPireFiles<-ComparisionUpdate中,这里使用dir_用于拼接过期的dump文件,如果文件夹都不存在,自然不会发生删除,这里也没有问题
1.5 RsyncClient::dir_被用于:UpdateLocalMeta<-ComparisionUpdate,这里也会使用dir_去拼接DUMP_META的文件名,并且新建一个DUMP_META, 如果此时文件夹不存在,也是要出问题的 (文件建立失败)
2.1 DB::dbsync_path_被用于: DB::PrepareRsync函数,负责创建./dbsync/db_name目录,他会先将./dbsync/db_name目录删除,然后重建该目录,并且也会将其下的inst目录创建好(0,1,2...)。 从这里可以看出全量同步的断点续传逻辑目前应该没法工作,因为每次发DBSync请求之前都会把./dbsync/db_name目录清空。
2.2 DB::dbsync_path_被用于:在RsyncClient退出以后,于TryUpdateMasterOffset中用于拼接info文件的地址,如果发现info文件不存在就不继续往下走(本PR也在这个分支加了状态流转)
2.3 DB::dbsync_path_被用于:ChangeDB(dbsync_path_)<-TryUpdateMasterOffset,也是RsyncClient结束以后,去用这个db_sync_目录来做RocksDB实例的替换
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- src/pika_db.cc (2 hunks)
Additional comments not posted (3)
src/pika_db.cc (3)
404-406
: Correctly updated logging and state setting for missingslave_db
.Changing the logging level to ERROR and setting the replication state to
kError
ensures that the system correctly identifies and logs critical issues with the slave database.
409-413
: Properly added check forinfo_path
existence and updated replication state.Introducing the
info_path
variable and checking its existence ensures that the synchronization process only proceeds if the necessary info file is present. Setting the replication state tokTryConnect
allows the system to attempt reconnection if the file is missing.
481-482
: Correctly updated replication state for missingmaster_db
.Setting the replication state to
kError
if the master database is missing ensures that the system correctly identifies and logs critical issues with the master database.
…enAtomFoundation#2756) * add error_exited for rsyncClient * add some logs * add some state transition based on reviewer's opinion --------- Co-authored-by: cheniujh <1271435567@qq.com>
…enAtomFoundation#2756) * add error_exited for rsyncClient * add some logs * add some state transition based on reviewer's opinion --------- Co-authored-by: cheniujh <1271435567@qq.com>
这个PR修复了 Issue #2742 中的第二个问题:
问题描述: 从节点在全量同步失败,RsynClient异常退出之后尝试进行增量连接竟然成功了,这是不对的。
原因: 一方面,RsyncClient如果异常退出,对于上层主从状态机来说没有别的信号告知,所以才会走了继续尝试增量连接的链路。另一方面,按理说如果全量失败,那么使用全量文件打开一个新RocksDB实例应当也是失败的(RocksDB Apply完manifest文件后会检查内存中的current version中的fileset是否和磁盘上的一致)。但这次case中,全量同步发生中断,恰好只拉取了部分SST文件,RocksDB的CURRENT, MANIFEST文件都没有拉过来,于是在Replace DB的阶段,RocksDB打开新实例时,因为找不到CURRENT文件,会直接起一个空实例,所以没有报错。
解决方案:
1 在RsynClient内部增加error_stop_标志位,如果RsyncClient异常退出(也就是全量同步异常退出,文件没有拉取完毕),就直接删除snapshot所对应的文件夹(./dbsync/dbx)
2 通过1中删除文件夹的做法,能在不提高RsyncClient和上层耦合的情况下,将错误状态以文件夹不存在的形式传递给上层的从节点状态机,从节点状态机发现snapshot文件夹不存在的话会将SlaveDB状态转为TryConnect进行连接重试
This PR fixes the second issue in Issue #2742:
Problem Description: The issue was that after a full synchronization was successful, the attempt to perform an incremental connection succeeded, which was incorrect.
Reason: In this case, the full synchronization was interrupted, and only some SST files were fetched; the CURRENT and MANIFEST files were not fetched. Therefore, during the Replace DB phase, when RocksDB opened a new instance, it started an empty instance directly because it couldn't find the CURRENT file, so no error was reported.
Solution:
error_stop_
flag insideRsyncClient
, which will directly delete the folder corresponding to the snapshot (./dbsync/dbx
) ifRsyncClient
exits abnormally (i.e., if the full synchronization exits abnormally and the files were not completely fetched).RsyncClient
can convey the error information to the upper-layer slave state machine by making the folder non-existent. If the slave node finds that the snapshot folder does not exist, it will switch the slave state toTryConnect
and attempt to reconnect.Summary by CodeRabbit
Bug Fixes
New Features