-
Notifications
You must be signed in to change notification settings - Fork 0
VV Bugs
20220331-021512-jzhou-c8671c6b4b13a7c7 at commit e9659b5dd40237272faf096e0f83aa2f1a0b954d, 4 failures:
- MutationLogReaderCorrectness.toml, Consistency check: Storage server unavailable
- KillRegionCycle.toml, "Operation timed out" ErrorCode="1004" Reason="Unable to perform consistency check"
- ConfigureTest.toml, Assertion self->commitProxies.size() <= self->configuration.getDesiredCommitProxies() failed
- restarting/from_7.0.0/SnapCycleRestart-1.txt, timed_out
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/restarting/from_7.0.0/SnapCycleRestart-1.txt -s 396753356 -b off -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/ConfigureTest.toml -s 571850063 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/KillRegionCycle.toml -s 694565167 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/MutationLogReaderCorrectness.toml -s 764761817 -b on
- MutationLogReaderCorrectness.toml, "Consistency check: Key server unavailable
- MutationLogReaderCorrectness.toml, Consistency check: Storage server unavailable
- MutationLogReaderCorrectness.toml, "Consistency check: Storage server unavailable"
- MemoryLifetime.toml, "Consistency check: Key server unavailable"
- Watches.toml, Consistency check: Key server unavailable
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/MutationLogReaderCorrectness.toml -s 126497979 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/MutationLogReaderCorrectness.toml -s 985737887 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/MutationLogReaderCorrectness.toml -s 703271462 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/MemoryLifetime.toml -s 748077049 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/Watches.toml -s 966816761 -b on
- CycleAndLock.toml, "Operation timed out" ErrorCode="1004" Reason="Error checking workload"
- MemoryLifetime.toml, "Unable to perform consistency check" timeout
- CycleRollbackClogged.toml, "Unable to perform consistency check"
- MoveKeysCycle.toml, "Unable to perform consistency check"
- FastTriggeredWatches.toml, "Unable to perform consistency check"
- DDBalanceAndRemoveStatus.toml, TryCommitError Severity="40" ErrorKind="Unset" DateTime="2022-03-31T05:32:40Z" Error="stale_version_vector"
- LocalRatekeeper.toml, "Consistency check: Key server unavailable"
- DDBalanceAndRemoveStatus.toml, TryCommitError Severity="40" ErrorKind="Unset" DateTime="2022-03-31T05:39:23Z" Error="stale_version_vector"
- restarting/from_5.1.7_until_6.3.0/DrUpgradeRestart-1.txt, OldBinary="fdbserver-6.1.11"
- Sideband.toml, "Unable to perform consistency check"
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/CycleAndLock.toml -s 177868687 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/MemoryLifetime.toml -s 569598442 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/rare/CycleRollbackClogged.toml -s 71110679 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/MoveKeysCycle.toml -s 737584918 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/FastTriggeredWatches.toml -s 538257192 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/DDBalanceAndRemoveStatus.toml -s 676734407 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/LocalRatekeeper.toml -s 635334347 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/DDBalanceAndRemoveStatus.toml -s 619864410 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/restarting/from_5.1.7_until_6.3.0/DrUpgradeRestart-1.txt -s 859892844 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/restarting/from_5.1.7_until_6.3.0/DrUpgradeRestart-2.txt -s 859892845 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/Sideband.toml -s 781729016 -b on
Exclude all BlobGranuleVerify* tests
- ddbalance.toml, TryCommitError stale_version_vector
- ddbalance.toml, TryCommitError stale_version_vector
- ChangeFeeds.toml, timed_out SetupAndRunError
- MemoryLifetime.toml, timed_out "Unable to perform consistency check"
- LocalRatekeeper.toml, "Error checking workload" timed_out
- SwizzledDdBalance.toml, "Consistency check: Storage server unavailable"
- CacheTest.toml, "Unable to perform consistency check"
- DDBalanceAndRemoveStatus.toml, TryCommitError stale_version_vector
- WriteDuringReadClean.toml, "Consistency check: Storage server unavailable"
- MemoryLifetime.toml, "Consistency check: Key server unavailable"
- SwizzledRollbackSideband, "Unable to perform consistency check"
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/ddbalance.toml -s 935272145 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/ddbalance.toml -s 935272145 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/ChangeFeeds.toml -s 959310315 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/MemoryLifetime.toml -s 767590165 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/LocalRatekeeper.toml -s 906848119 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/SwizzledDdBalance.toml -s 133872915 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/CacheTest.toml -s 74071928 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/DDBalanceAndRemoveStatus.toml -s 670800871 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/WriteDuringReadClean.toml -s 227133419 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/MemoryLifetime.toml -s 436433704 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/SwizzledRollbackSideband.toml -s 505011098 -b on
Exclude all BlobGranuleVerify* tests
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/MutationLogReaderCorrectness.toml -s 724174212 -b off -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/TxnStateStoreCycleTest.toml -s 172048999 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/BackupCorrectness.toml -s 236650744 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/ApiCorrectnessSwitchover.toml -s 740867558 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/CloggedSideband.toml -s 529363907 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/SwizzledCycleTest.toml -s 722943714 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/Unreadable.toml -s 199412940 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/Watches.toml -s 233627689 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/KillRegionCycle.toml -s 464781503 -b on -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/MutationLogReaderCorrectness.toml -s 793190728 -b on
Disable VV 20220331-174218-jzhou-d102bd86742dcc3f BlobGranuleVerify* tests failed a lot.
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/TimeKeeperCorrectness.toml -s 625661473 -b on commit: 64d465803
RatekeeperGetSSListError transaction_too_old, which throttles traffic
The transaction_too_old
comes from readRange()
, specifically
data->storageVersion() > version
107.423116 VVDebug Machine=[abcd::2:0:1:0]:1 ID=cea5dbef902ea032 CommitVersion=-1 Version=83960984 ReqVersion=83960984 DataOldest=83417393 DataVersion=84561488 StorageVersion=83417393 CD,SS
107.423116 VVDebugBeforeReadRange Machine=[abcd::2:0:1:0]:1 ID=cea5dbef902ea032 Version=83960984 DataOldest=83417393 StorageVersion=83417393 DataVersion=84561488 CD,SS
107.423116 VVDebugReadRange Machine=[abcd::2:0:1:0]:1 ID=cea5dbef902ea032 Version=83960984 StorageVersion=84461488 CD,SS
107.423116 VVDebugErr Machine=[abcd::2:0:1:0]:1 ID=cea5dbef902ea032 Error=transaction_too_old ErrorDescription=Transaction is too old to perform reads or be committed ErrorCode=1007 CD,SS
--knob_proxy_use_resolver_private_mutations false 53.694048 SendingPrivatized_ChangeFeed ID=12b8e7fee8cc42f1 M=code: SetValue param1: \xff\xff\x02/feed/@8x\xd0\xbe7\xa8n\x84\xe1\x11\xd1!\x83\x93W param2: \x00\x00\x01q\xb0\x00\xdb\x0f\x00\x00\x00\x00\x01\x00\x00\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\x00\x00\x00\x00 BK,CC,CD,CP,RV,SS,TL
--knob_proxy_use_resolver_private_mutations true
has no private mutations.
The problem is that change feed metadata mutations use \xff\x02/feed/ prefix, which was not considered as "metadata mutations", thus not sent to the resolvers. This makes the private mutation generation not possible for change feed if the knob PROXY_USE_RESOLVER_PRIVATE_MUTATIONS
is on. Fix by making it a metadata mutation.
commit 402b463a2 and earlier at fe0c137
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/rare/SwizzledLargeApiCorrectness.toml -s 220943799 -b on
In repairDeadDatacenter()
, the first ChangeConfig finished. But the recovery only reached "accepting_commits" 11, and didn't reach the RecoveryState::STORAGE_RECOVERED
state. Thus blocking waitForQuietDatabase()
from completing.
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/KillRegionCycle.toml -s 343565765 -b on
The workload kills the primary region, and DB starts with usable_regions:1
on the originally remote region. The test timeout because QuietDatabase tries to getMaxStorageServerQueueSize from SS in the killed region
The machine was killed and thus QuietDatabase can't reach the Worker 2.0.1.1:1.
517.016568 KillMachineProcess ID=0000000000000000 KillType=0 Process=name: Server address: 2.0.1.1:1 zone: ee2ab9a490f3c19bbe10203fe03c0e66 datahall: 0 class: storage excluded: 0 cleared: 0 StartingClass=storage Failed=0 Excluded=0 Cleared=0 Rebooting=0 TS
517.016568 FailMachine ID=0000000000000000 Name=Server Address=2.0.1.1:1 ZoneId=ee2ab9a490f3c19bbe10203fe03c0e66 Process=name: Server address: 2.0.1.1:1 zone: ee2ab9a490f3c19bbe10203fe03c0e66 datahall: 0 class: storage excluded: 0 cleared: 0 Rebooting=0 Protected=0 Backtrace=addr2line -e fdbserver.debug -p -C -f -i 0x32f1ab9 0x31d2683 0x31a54b3 0x31a7a06 0x25d4608 0x25d694a 0x1035748 0x31d0936 0x31d0774 0x31d03f4 0x31d07c2 0x31d0e8c 0x1035748 0x32a0c16 0x3290fad 0x31a1ddf 0x188d495 0x7f8f6bdfe555 TS
543.357291 QuietDatabaseWorker ID=0000000000000000 Addr=2.3.1.1:1 TS
543.357291 QuietDatabaseWorker ID=0000000000000000 Addr=2.1.1.0:1 TS
543.357291 QuietDatabaseWorker ID=0000000000000000 Addr=2.5.1.1:1 TS
543.357291 QuietDatabaseWorker ID=0000000000000000 Addr=2.5.1.0:1 TS
543.357291 QuietDatabaseWorker ID=0000000000000000 Addr=3.4.3.1:1 TS
543.357291 QuietDatabaseWorker ID=0000000000000000 Addr=3.4.3.6:1 TS
543.357291 QuietDatabaseWorker ID=0000000000000000 Addr=3.4.3.4:1 TS
543.357291 QuietDatabaseWorker ID=0000000000000000 Addr=3.4.3.2:1 TS
543.357291 QuietDatabaseWorker ID=0000000000000000 Addr=3.4.3.5:1 TS
543.357291 QuietDatabaseWorker ID=0000000000000000 Addr=2.4.1.1:1 TS
543.357291 QuietDatabaseWorker ID=0000000000000000 Addr=2.1.1.1:1 TS
543.357291 QuietDatabaseWorker ID=0000000000000000 Addr=2.3.1.0:1 TS
543.357291 QuietDatabaseWorker ID=0000000000000000 Addr=3.4.3.3:1 TS
543.357291 QuietDatabaseWorker ID=0000000000000000 Addr=3.4.3.7:1 TS
543.357291 QuietDatabaseFailure ID=0000000000000000 Reason=Could not find worker for storage server SS=c70f451fa0d30ea4 TS
543.357291 QuietDatabaseConsistencyCheckStartError ID=0000000000000000 Error=attribute_not_found ErrorDescription=Attribute not found ErrorCode=2014 TS
543.357291 QuietDatabaseConsistencyCheckStartRetry ID=0000000000000000 Error=attribute_not_found ErrorDescription=Attribute not found ErrorCode=2014 NotReady0=storageQueueSize TS
...
20533.496186 QuietDatabaseFailure ID=0000000000000000 Reason=Could not find worker for storage server SS=c70f451fa0d30ea4 TS
20533.496186 QuietDatabaseConsistencyCheckStartError ID=0000000000000000 Error=attribute_not_found ErrorDescription=Attribute not found ErrorCode=2014 TS
20533.496186 QuietDatabaseConsistencyCheckStartRetry ID=0000000000000000 Error=attribute_not_found ErrorDescription=Attribute not found ErrorCode=2014 NotReady0=storageQueueSize TS
11.999627 Role ID=c70f451fa0d30ea4 Machine="2.0.1.1:1" As=StorageServer Transition=Begin Origination=Recruited OnWorker=172fee3d8abd0df9 IsTSS=0 StorageEngine=ssd-2 SS
514.274568 Last StorageMetrics
event.
504.148461 2.0.1.0:1 ID=9c32fad3248da583 12 all_logs_recruited
521.035861 2.1.1.0:1 ID=5f4ac01671b8c109 0 reading_coordinated_state
521.036061 2.1.1.0:1 ID=5f4ac01671b8c109 1 locking_coordinated_state TLogs=2 ActiveGenerations=4 MyRecoveryCount=16 ForceRecovery=0
526.006188 2.1.1.0:1 ID=282524297a0788b3 0 reading_coordinated_state
526.006388 2.1.1.0:1 ID=282524297a0788b3 1 locking_coordinated_state TLogs=2 ActiveGenerations=4 MyRecoveryCount=17 ForceRecovery=1
526.013814 2.1.1.0:1 ID=282524297a0788b3 3 reading_transaction_system_state
526.020392 2.1.1.0:1 ID=282524297a0788b3 7 recruiting_transaction_servers Conf={"backup_worker_enabled":0,"log_spill":2,"log_version":7,"logs":3,"perpetual_storage_wiggle":0,"perpetual_storage_wiggle_locality":"data_hall:3","redundancy_mode":"single","regions":[{"datacenters":[{"id":"1","priority":1},{"id":"3","priority":1,"satellite":1,"satellite_logs":4},{"id":"5","priority":1,"satellite":1,"satellite_logs":6}]},{"datacenters":[{"id":"0","priority":-1},{"id":"2","priority":1,"satellite":1},{"id":"4","priority":1,"satellite":1,"satellite_logs":3}],"satellite_logs":4}... RequiredCommitProxies=1 RequiredGrvProxies=1 RequiredResolvers=1
526.020392 2.1.1.0:1 ID=282524297a0788b3 8 initializing_transaction_servers CommitProxies=2 GrvProxies=1 TLogs=2 Resolvers=1 SatelliteTLogs=0 OldLogRouters=1 StorageServers=0 BackupWorkers=0 PrimaryDcIds=1 RemoteDcIds=0
526.030265 2.1.1.0:1 ID=282524297a0788b3 9 recovery_transaction PrimaryLocality=-1 DcId=1 ClusterId=a10b741ac53f2775
526.045199 2.1.1.0:1 ID=282524297a0788b3 10 writing_coordinated_state TLogList=0: deee4e8847f592e9e9fe79123f4655c6, b4ff06e4416a12093266f19033babc0a
526.045599 2.1.1.0:1 ID=282524297a0788b3 11 accepting_commits StoreType=ssd-2 RecoveryDuration=0.0394102
526.098814 2.1.1.0:1 ID=282524297a0788b3 12 all_logs_recruited
535.051277 2.1.1.0:1 ID=282524297a0788b3 14 fully_recovered FullyRecoveredAtVersion=-1 ClusterId=a10b741ac53f2775
FullyRecoveredAtVersion=-1 looks suspicious.
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/DDBalanceAndRemoveStatus.toml -s 559021579 -b on
221.599731 2.0.1.1:1 ID=a4d6ee6aa90f08e7 11 accepting_commits StoreType=ssd-redwood-1-experimental RecoveryDuration=1.20294 225.000635 2.0.1.1:1 ID=0f89a860f81fa1b7 0 reading_coordinated_state 225.008237 2.0.1.1:1 ID=0f89a860f81fa1b7 1 locking_coordinated_state TLogs=2 ActiveGenerations=9 MyRecoveryCount=27 ForceRecovery=0 225.019047 2.0.1.1:1 ID=0f89a860f81fa1b7 3 reading_transaction_system_state 225.021363 2.0.1.1:1 ID=0f89a860f81fa1b7 7 recruiting_transaction_servers Conf={"backup_worker_enabled":0,"commit_proxies":1,"grv_proxies":1,"log_routers":3,"log_spill":2,"log_version":7,"perpetual_storage_wiggle":0,"perpetual_storage_wiggle_locality":"0","proxies":2,"redundancy_mode":"double","regions":[{"datacenters":[{"id":"0","priority":2},{"id":"2","priority":1,"satellite":1},{"id":"4","priority":1,"satellite":1,"satellite_logs":3}],"satellite_redundancy_mode":"two_satellite_fast"},{"datacenters":[{"id":"1","priority":-1},{"id":"3","priority":1,"satellite":1},{"i... RequiredCommitProxies=1 RequiredGrvProxies=1 RequiredResolvers=1 225.021363 2.0.1.1:1 ID=0f89a860f81fa1b7 8 initializing_transaction_servers CommitProxies=1 GrvProxies=1 TLogs=2 Resolvers=1 SatelliteTLogs=4 OldLogRouters=1 StorageServers=0 BackupWorkers=0 PrimaryDcIds=0 RemoteDcIds=1 225.042557 2.0.1.1:1 ID=0f89a860f81fa1b7 9 recovery_transaction PrimaryLocality=0 DcId=0 ClusterId=d7a1d81209da25ee 225.052305 2.0.1.1:1 ID=0f89a860f81fa1b7 10 writing_coordinated_state TLogList=0: f0912f5214e38a75b85fa6f39439504c, 1469451238ecb02065b9131ea6a9b902 1: 0b4a2c239c25017be45dbfafed43f415, 7b52f2ba69d8dcd5adbcbd5f43bcab84, de5d10bf205dd1fea64a6fcf9ba87587, 1a63ea4f046628cc77834153816645c6 225.062368 2.0.1.1:1 ID=0f89a860f81fa1b7 11 accepting_commits StoreType=ssd-redwood-1-experimental RecoveryDuration=0.0617324
225.062368 MasterRecoveryCommit ID=0f89a860f81fa1b7 TLogs=0: f0912f5214e38a75b85fa6f39439504c, 1469451238ecb02065b9131ea6a9b902 1: 0b4a2c239c25017be45dbfafed43f415, 7b52f2b69d8dcd5adbcbd5f43bcab84, de5d10bf205dd1fea64a6fcf9ba87587, 1a63ea4f046628cc77834153816645c6 RecoveryCount=27 RecoveryTransactionVersion=1010030061 CC,CP,RV,SS,TL
225.105047 SSPrivateMutation ID=4566b55785b9a4c2 Mutation=code: SetValue param1: \xff\xff/globals/lastEpochEnd param2: \xed\xf4=6\x00\x00\x00\x00 Version=1010030061 CC,CP,R,SS,TL 225.105047 SSPrivateMutation ID=4566b55785b9a4c2 Mutation=code: SetValue param1: \xff\xff/globals/primaryLocality param2: \x00 Version=1010030061 CC,CP,RV,SS,TL 1,010,030,061
226.326914 GetLatestCommitVersions ID=0000000000000000 ReadVersion=906655772 Version vector=[{0:0,906651408}{0:1,906651408}{1:0,906651408}{1:1,906651408} maxversion: 906651408] TS 906,655,772 906,651,408
GRV read version 906,655,772 < RecoveryTransactionVersion=1,010,030,061 GRV 7d9676587c594617 is the old one:
221.467340 GrvProxyServerCore ID=7d9676587c594617 MasterId=a4d6ee6aa90f08e7 MasterLifetime=d73b001c36134fe9#1 RecoveryCount=24 CD,DD,GP,MS,RK,SS,TL 225.013383 RoleRemove ID=0000000000000000 Address=2.0.1.0:1 Role=GrvProxyServer NumRoles=9 Value=0 Result=Removed Role CD,DD,GP,LR,MS,RK,SS,TL
221.645999 VVGrv ID=7d9676587c594617 Delta=[{0:0,906651408}{0:1,906651408}{1:0,906651408}{1:1,906651408} maxversion: 906651408] ReqMaxVersion=906600958 CD,DD,GP,LR,MS,RK,SS,TL 221.667009 VVGrv ID=7d9676587c594617 Delta=[{0:0,906651408}{0:1,906651408}{1:0,906651408}{1:1,906651408} maxversion: 906651408] ReqMaxVersion=906600958 CD,DD,GP,LR,MS,RK,SS,TL
225.052305 BWToCore ID=0000000000000000 Epoch=25 TotalTags=2 BeginVersion=805618443 EndVersion=909995022 CC,CP,RV,SS,TL
New GRV: 225.022051 GrvProxyServerCore ID=51a0eb458dcf5bf1 MasterId=0f89a860f81fa1b7 MasterLifetime=d73b001c36134fe9#2 RecoveryCount=26 CD,DD,GP,MS,RK,SS,TL
Many txn started, some get ReadVersion1, some get ReadVersion2. Recovery happened. GRV proxy change detected and version cache is cleared. Old GRV respond with ReadVersion1, version cache is updated to ReadVersion1. A txn starts reading with ReadVersion2, triggering the assertion.