-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix](move-memtable) multi replica tables should tolerate minority failures #38003
Conversation
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
TPC-H: Total hot run time: 39987 ms
|
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
TPC-H: Total hot run time: 40164 ms
|
TPC-DS: Total hot run time: 173918 ms
|
ClickBench: Total hot run time: 30.55 s
|
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
TPC-H: Total hot run time: 39787 ms
|
TPC-DS: Total hot run time: 173924 ms
|
ClickBench: Total hot run time: 30.63 s
|
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
TPC-H: Total hot run time: 39944 ms
|
TPC-DS: Total hot run time: 173034 ms
|
ClickBench: Total hot run time: 30.69 s
|
run buildall |
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
clang-tidy review says "All clean, LGTM! 👍" |
TPC-H: Total hot run time: 41739 ms
|
TPC-DS: Total hot run time: 169763 ms
|
ClickBench: Total hot run time: 29.9 s
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR approved by at least one committer and no changes requested. |
PR approved by anyone and no changes requested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ilures (#38003) Load job for multi replica tables shouldn't fail immediately on any single replica errors. Errors should be recorded and reported for individual replica of tablets, and checked on commit info.
…ilures (apache#38003) Load job for multi replica tables shouldn't fail immediately on any single replica errors. Errors should be recorded and reported for individual replica of tablets, and checked on commit info.
…ilures (#38003) Load job for multi replica tables shouldn't fail immediately on any single replica errors. Errors should be recorded and reported for individual replica of tablets, and checked on commit info.
…ilures (apache#38003) Load job for multi replica tables shouldn't fail immediately on any single replica errors. Errors should be recorded and reported for individual replica of tablets, and checked on commit info.
…44344) Problem Summary: #38003 introduced a problem where the last sink node could report success even when close wait timeout, which may cause data loss. Previously we made that change hoping to tolerate minority replica failure in this step. However, it turns out the last sink node could miss tablet reports from downstreams in case of close wait failure. This PR fixes the problem by return the close_wait error immediately. The most common error in close wait is timeout, and it should not be fault tolerant on a replica basis anyways.
…44344) Problem Summary: #38003 introduced a problem where the last sink node could report success even when close wait timeout, which may cause data loss. Previously we made that change hoping to tolerate minority replica failure in this step. However, it turns out the last sink node could miss tablet reports from downstreams in case of close wait failure. This PR fixes the problem by return the close_wait error immediately. The most common error in close wait is timeout, and it should not be fault tolerant on a replica basis anyways.
…44344) Problem Summary: #38003 introduced a problem where the last sink node could report success even when close wait timeout, which may cause data loss. Previously we made that change hoping to tolerate minority replica failure in this step. However, it turns out the last sink node could miss tablet reports from downstreams in case of close wait failure. This PR fixes the problem by return the close_wait error immediately. The most common error in close wait is timeout, and it should not be fault tolerant on a replica basis anyways.
…pache#44344) Problem Summary: apache#38003 introduced a problem where the last sink node could report success even when close wait timeout, which may cause data loss. Previously we made that change hoping to tolerate minority replica failure in this step. However, it turns out the last sink node could miss tablet reports from downstreams in case of close wait failure. This PR fixes the problem by return the close_wait error immediately. The most common error in close wait is timeout, and it should not be fault tolerant on a replica basis anyways.
…44344) Problem Summary: #38003 introduced a problem where the last sink node could report success even when close wait timeout, which may cause data loss. Previously we made that change hoping to tolerate minority replica failure in this step. However, it turns out the last sink node could miss tablet reports from downstreams in case of close wait failure. This PR fixes the problem by return the close_wait error immediately. The most common error in close wait is timeout, and it should not be fault tolerant on a replica basis anyways.
…44344) Problem Summary: #38003 introduced a problem where the last sink node could report success even when close wait timeout, which may cause data loss. Previously we made that change hoping to tolerate minority replica failure in this step. However, it turns out the last sink node could miss tablet reports from downstreams in case of close wait failure. This PR fixes the problem by return the close_wait error immediately. The most common error in close wait is timeout, and it should not be fault tolerant on a replica basis anyways.
Proposed changes
Load job for multi replica tables shouldn't fail immediately on any single replica errors.
Errors should be recorded and reported for individual replica of tablets, and checked on commit info.