-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix UpgradeAndBackupRestore test failures #9240
Conversation
Otherwise, the Attrition can RebootAndDelete tlogs in remote DC such that the remote is unusable and blocking recovery to fully_recovered state. In fact, the FirstCycleTest can only reach the accepting_commits state. In the part 2 of the restarting test, the runTests() wait for quietDatabase() to reach fully fully_recovered state, but was stuck in the accepting_commits state.
This is needed for UpgradeAndBackupRestore-1 to make sure the DB is recoverable so that the part 2 can start.
Because of the new option "disabledFailureInjectionWorkloads" is not available until 7.2.4.
Result of foundationdb-pr-macos-m1 on macOS Monterey 12.x
|
Result of foundationdb-pr-macos on macOS Monterey 12.x
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Doxense CI Report for Windows 10
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this patch works, yet I am wondering if doing a fix like
[[test]]
testTitle = 'SubmitBackup'
simBackupAgents= 'BackupToFile'
clearAfterTest = false
runConsistencyCheck=false
[[test.workload]]
testName = 'SubmitBackup'
delayFor = 0
stopWhenDone = false
[[test.workload]]
testName = 'Attrition'
machinesToKill = 10
machinesToLeave = 3
reboot = true
testDuration = 30.0
disableFaultInjection = true
would bring more flexibility.
This doesn't work. The problem this PR addresses is that the simulator will add additional 'Attrition' even if it's already specified. And when the simulator adds additional Attrition workloads, it can do types other than |
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Doxense CI Report for Windows 10
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Doxense CI Report for Windows 10
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the issue is much more complicated than I thought as FailureInjection is an invisible way of adding workloads, outside the existing toml file. Now there are certain factors impacting those failure injection workloads:
- Explicitly added in the toml file
- Certain workloads can claim they dislike some/all of the failure injection workloads, e.g. LowLatency hates Attrition. (and this is defined in the workload file)
- The workload itself can prevent its injection, base on the use of the database, the count of the existing same workloads, and/or a random number. (and this is defined in tester.actor.cpp, by default; or overridden in the actual workload .actor.cpp file)
- With this new rule, a compound workload can also prevent creating certain type of workloads. (and this is defined in the toml file)
(5. Possibly other places.)
I feel if possible, it might be better to unify this kind of configuration into some kind of TOML/DSL file so there is a SPOT rather than three or four different places. But it would be a much bigger change that would take too long time.
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Doxense CI Report for Windows 10
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-macos-m1 on macOS Monterey 12.x
|
Result of foundationdb-pr-macos on macOS Monterey 12.x
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
|
I know what the issue for this is, i'll fix it in: #9261 |
The Attrition can RebootAndDelete tlogs in remote DC such that the remote is unusable and blocking recovery to fully_recovered state. In fact, the FirstCycleTest can only reach the accepting_commits state.
In the part 2 of the restarting test, the runTests() wait for quietDatabase() to reach fully fully_recovered state, but was stuck in the accepting_commits state. So part 2 timed out.
Added a new option "disabledFailureInjectionWorkloads" to toml test spec so that we can disallow randomly injecting Attrition workload that may RebootAndDelete tlogs for this test. As a result, part 2 can now reach fully_recovered state and begin testing the workload.
Tests were done in #9231
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branch
ormain
if this is the youngest branch)