Skip to content

WaitForPrev Optimization

Jingyu Zhou edited this page Mar 2, 2022 · 3 revisions

-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/MutationLogReaderCorrectness.toml -s 203110408 -b on

135.730840 ConsistencyCheck_NoStorage ID=0000000000000000 Address=2.5.1.0:1 ProcessId=0f55ad97e60416cdc0d2065aa118d545 ProcessClassEqualToStorageClass=0 TS 135.730840 ConsistencyCheck_NoStorage ID=0000000000000000 Address=2.3.1.1:1 ProcessId=289319c726c07c2b1a831fd00735584a ProcessClassEqualToStorageClass=0 TS 135.730840 ConsistencyCheck_NoStorage ID=0000000000000000 Address=2.4.1.1:1 ProcessId=471c9870a5e3e14d9d2c19a817fecccb ProcessClassEqualToStorageClass=0 TS 135.730840 ConsistencyCheck_NoStorage ID=0000000000000000 Address=2.2.1.1:1 ProcessId=4d86646f0bc1e81e21080adffc863463 ProcessClassEqualToStorageClass=0 TS 135.730840 ConsistencyCheck_NoStorage ID=0000000000000000 Address=2.1.1.1:1 ProcessId=53f7426e3ac54472a822e5e654e61fa3 ProcessClassEqualToStorageClass=0 TS 135.730840 ConsistencyCheck_NoStorage ID=0000000000000000 Address=2.3.1.0:1 ProcessId=583b6529f402edbe8e8c6243664af533 ProcessClassEqualToStorageClass=0 TS 135.730840 ConsistencyCheck_NoStorage ID=0000000000000000 Address=2.1.1.0:1 ProcessId=64f72c2f54243144f9fa4f850153fc8b ProcessClassEqualToStorageClass=0 TS 135.730840 ConsistencyCheck_NoStorage ID=0000000000000000 Address=2.4.1.0:1 ProcessId=8b44fd176747e78ae38afe1b395fff7f ProcessClassEqualToStorageClass=0 TS 135.730840 ConsistencyCheck_NoStorage ID=0000000000000000 Address=2.2.1.0:1 ProcessId=bdc84fea8155153c5c6418ae838f1e62 ProcessClassEqualToStorageClass=0 TS 135.730840 ConsistencyCheck_NoStorage ID=0000000000000000 Address=2.5.1.1:1 ProcessId=f1f11a88c83ce15ef04a7c5feb9747c3 ProcessClassEqualToStorageClass=0 TS

142.216093 EndpointNotFound ID=0000000000000000 SuppressedEventCount=1 Address=2.0.1.0:1 Token=6565e9053dd1b289 CC,CD,CP,RV,SS,TL

Consistency check failed due to "StorageServerUnavailable" error, which can be traced back to SS sends transaction_too_old in readRange(), where StorageVersion=411567957 > readVersion = 410739741.

142.218124 WaitForPrevTooOld6 ID=e80e5b37098784a1 Version=410739741 StorageVersion=411567957 DD,GP,MS,RK,SS,TL 
142.218124 WaitForPrevTooOld7 ID=e80e5b37098784a1 Error=transaction_too_old ErrorDescription=Transaction is too old to perform reads or be committed ErrorCode=1007 Version=410739741 SS=[ maxversion: -1 staleness: 0] Oldest=411567957

142.220600 ConsistencyCheck_StorageServerUnavailable ID=0000000000000000 SuppressedEventCount=0 StorageServer=6d67ff09522b6693 ShardBegin= ShardEnd=\\xff\\x
02/blog/L8\\xd2\\xac\\x85\\xcb\\x0d\\x87uq\\xc5\\xc5U\\xf6\\x83\\xb1\\x03* Address=2.0.1.0:1 UID=6d67ff09522b6693 GetKeyValuesToken=fbcb51bd0621af57 IsTSS=F
alse Error=transaction_too_old ErrorDescription=Transaction is too old to perform reads or be committed ErrorCode=1007 TS 
142.220600 TestFailure ID=0000000000000000 Workload=QuiescentCheck Reason=Consistency check: Storage server unavailable TS 
142.220600 ConsistencyCheck_FinishedCheck ID=0000000000000000 Repetitions=0 TS 

I traced the getVersion() in ConsistencyCheck and found that the read version is from the master. Then why SS's version is larger? It seems the problem is at:

			if (data->storageVersion() > version)
				throw transaction_too_old();

MAX_READ_TRANSACTION_LIFE_VERSIONS is set to 100000 in this test (from the trace file: "max_read_transaction_life_versions" Value="100000") ConsistencyCheck is doing reads the storage servers keep receiving empty commit versions and advance their oldestVersion to (empty-commit-version-number - 100000), which happens to be greater than the read version of the transaction, and so the transaction receives transaction_too_old error (thrown by waitForVersion() on a storage server), the transaction failed and ConsistencyCheck retry. This keeps repeating.

Unseed mismatch

Change knob MAX_READ_TRANSACTION_LIFE_VERSIONS to 500k, 20220301-043111-jzhou-69f606f4acbd96c3 has one error of unseed mismatch

-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/SharedBackupCorrectness.toml -s 100688279 -b on

20140 92342

Enable Unicast

20220301-174003-jzhou-4d930e776be61254 passed. Probably because the dirty read is very hard to reproduce in correctness.