Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UpgradeAndBackupRestore test failures #9240

Merged
merged 9 commits into from
Feb 1, 2023
Merged

Fix UpgradeAndBackupRestore test failures #9240

merged 9 commits into from
Feb 1, 2023

Conversation

jzhou77
Copy link
Contributor

@jzhou77 jzhou77 commented Jan 26, 2023

The Attrition can RebootAndDelete tlogs in remote DC such that the remote is unusable and blocking recovery to fully_recovered state. In fact, the FirstCycleTest can only reach the accepting_commits state.

In the part 2 of the restarting test, the runTests() wait for quietDatabase() to reach fully fully_recovered state, but was stuck in the accepting_commits state. So part 2 timed out.

Added a new option "disabledFailureInjectionWorkloads" to toml test spec so that we can disallow randomly injecting Attrition workload that may RebootAndDelete tlogs for this test. As a result, part 2 can now reach fully_recovered state and begin testing the workload.

Tests were done in #9231

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

Otherwise, the Attrition can RebootAndDelete tlogs in remote DC such that the
remote is unusable and blocking recovery to fully_recovered state. In fact,
the FirstCycleTest can only reach the accepting_commits state.

In the part 2 of the restarting test, the runTests() wait for quietDatabase()
to reach fully fully_recovered state, but was stuck in the accepting_commits
state.
This is needed for UpgradeAndBackupRestore-1 to make sure the DB is recoverable
so that the part 2 can start.
Because of the new option "disabledFailureInjectionWorkloads" is not available
until 7.2.4.
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Monterey 12.x

  • Commit ID: afcca4a
  • Duration 0:09:05
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /opt/homebrew/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Monterey 12.x

  • Commit ID: afcca4a
  • Duration 0:10:59
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: afcca4a
  • Duration 0:13:33
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: afcca4a
  • Duration 0:16:06
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: afcca4a
  • Duration 0:16:42
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: afcca4a
  • Duration 0:18:05
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 7ce6486
  • Duration 0:16:09
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 7ce6486
  • Duration 0:35:20
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 7ce6486
  • Duration 0:38:34
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 7ce6486
  • Duration 1:13:17
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 4641c9b
  • Duration 0:17:21
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 4641c9b
  • Duration 0:37:02
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 4641c9b
  • Duration 0:41:34
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 4641c9b
  • Duration 1:25:33
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@fdb-windows-ci
Copy link
Collaborator

Doxense CI Report for Windows 10

Copy link
Collaborator

@xis19 xis19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this patch works, yet I am wondering if doing a fix like

[[test]]
 testTitle = 'SubmitBackup'
 simBackupAgents= 'BackupToFile'
 clearAfterTest = false
 runConsistencyCheck=false

     [[test.workload]]
     testName = 'SubmitBackup'
     delayFor = 0
     stopWhenDone = false

     [[test.workload]]
     testName = 'Attrition'
     machinesToKill = 10
     machinesToLeave = 3
     reboot = true
     testDuration = 30.0
     disableFaultInjection = true

would bring more flexibility.

fdbserver/tester.actor.cpp Show resolved Hide resolved
@jzhou77
Copy link
Contributor Author

jzhou77 commented Jan 30, 2023

I think this patch works, yet I am wondering if doing a fix like

[[test]]
 testTitle = 'SubmitBackup'
 simBackupAgents= 'BackupToFile'
 clearAfterTest = false
 runConsistencyCheck=false

     [[test.workload]]
     testName = 'SubmitBackup'
     delayFor = 0
     stopWhenDone = false

     [[test.workload]]
     testName = 'Attrition'
     machinesToKill = 10
     machinesToLeave = 3
     reboot = true
     testDuration = 30.0
     disableFaultInjection = true

would bring more flexibility.

This doesn't work. [[test.workload]] already specifies reboot = true, which is what we want.

The problem this PR addresses is that the simulator will add additional 'Attrition' even if it's already specified. And when the simulator adds additional Attrition workloads, it can do types other than reboot = true, which is what we want to prevent.

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 65b0b75
  • Duration 0:10:34
  • Result: ❌ FAILED
  • Error: reference not found for primary source and source version 65b0b75abee0981fee1960bd710966a124672565
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 65b0b75
  • Duration 0:10:40
  • Result: ❌ FAILED
  • Error: reference not found for primary source and source version 65b0b75abee0981fee1960bd710966a124672565
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 65b0b75
  • Duration 0:11:46
  • Result: ❌ FAILED
  • Error: reference not found for primary source and source version 65b0b75abee0981fee1960bd710966a124672565
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 65b0b75
  • Duration 0:11:59
  • Result: ❌ FAILED
  • Error: reference not found for primary source and source version 65b0b75abee0981fee1960bd710966a124672565
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 58fcc90
  • Duration 0:16:25
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@fdb-windows-ci
Copy link
Collaborator

Doxense CI Report for Windows 10

@jzhou77 jzhou77 requested a review from xis19 January 31, 2023 18:51
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 58fcc90
  • Duration 0:47:17
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 58fcc90
  • Duration 0:48:18
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 58fcc90
  • Duration 1:16:46
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: fcab17c
  • Duration 0:22:32
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: fcab17c
  • Duration 1:01:09
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: fcab17c
  • Duration 1:07:18
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@fdb-windows-ci
Copy link
Collaborator

Doxense CI Report for Windows 10

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: fcab17c
  • Duration 1:14:14
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

Copy link
Collaborator

@xis19 xis19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the issue is much more complicated than I thought as FailureInjection is an invisible way of adding workloads, outside the existing toml file. Now there are certain factors impacting those failure injection workloads:

  1. Explicitly added in the toml file
  2. Certain workloads can claim they dislike some/all of the failure injection workloads, e.g. LowLatency hates Attrition. (and this is defined in the workload file)
  3. The workload itself can prevent its injection, base on the use of the database, the count of the existing same workloads, and/or a random number. (and this is defined in tester.actor.cpp, by default; or overridden in the actual workload .actor.cpp file)
  4. With this new rule, a compound workload can also prevent creating certain type of workloads. (and this is defined in the toml file)
    (5. Possibly other places.)

I feel if possible, it might be better to unify this kind of configuration into some kind of TOML/DSL file so there is a SPOT rather than three or four different places. But it would be a much bigger change that would take too long time.

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 9c5ce68
  • Duration 0:17:52
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@fdb-windows-ci
Copy link
Collaborator

Doxense CI Report for Windows 10

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 9c5ce68
  • Duration 0:28:30
  • Result: ❌ FAILED
  • Error: Error while executing command: docker build --label "org.foundationdb.version=${FDB_VERSION}" --label "org.foundationdb.build_date=${BUILD_DATE}" --label "org.foundationdb.commit=${COMMIT_SHA}" --progress plain --build-arg FDB_VERSION="${FDB_VERSION}" --build-arg FDB_LIBRARY_VERSIONS="${FDB_VERSION}" --build-arg FDB_WEBSITE="${FDB_WEBSITE}" --tag foundationdb/ycsb:${FDB_VERSION}-${COMMIT_SHA} --file Dockerfile --target ycsb .. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Monterey 12.x

  • Commit ID: 9c5ce68
  • Duration 0:32:24
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Monterey 12.x

  • Commit ID: 9c5ce68
  • Duration 0:44:44
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 9c5ce68
  • Duration 1:08:39
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 9c5ce68
  • Duration 1:11:01
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@jzhou77
Copy link
Contributor Author

jzhou77 commented Feb 1, 2023

TestFailure" Machine="[abcd::3:4:3:1]:1" ID="e39eda725f336b1d" Error="restore_destination_not_empty" in the second part is not related to this change.

@jzhou77 jzhou77 merged commit b911bcf into apple:main Feb 1, 2023
@sfc-gh-nwijetunga
Copy link
Collaborator

sfc-gh-nwijetunga commented Feb 1, 2023

TestFailure" Machine="[abcd::3:4:3:1]:1" ID="e39eda725f336b1d" Error="restore_destination_not_empty" in the second part is not related to this change.

I know what the issue for this is, i'll fix it in: #9261

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants