-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Repros with main as of ea32e86, where our submodule is llvm/llvm-project@b8d38e8.
We use LLVM lit's "sharding" feature to split our large test suite across 8 VMs per architecture:
STL/azure-devops/native-build-test.yml
Line 21 in ea32e86
| shardFlags: '--num-shards=$(System.TotalJobsInPhase);--run-shard=$(System.JobPositionInPhase)' |
We expected that for a given set of tests, this would exactly partition them across the VMs, with no duplicates and no missed tests. That is, we expected the sharding algorithm to be deterministic, even across different machines (because these VMs run independently), as long as the set of tests is the same on each machine (which it is, because they've checked out the same commit). There should be no sensitivity to filesystem enumeration order, time of day, or anything else. (However, it's okay if adding/removing a single test radically changes the subset shards, and it's okay for each subset shard to be run in a totally randomized/shuffled order.)
We're observing non-deterministic behavior from lit. This was originally observed in #2793, whose initial version mistakenly had 2 tests that were XPASSing - but the XPASSes showed up only for x86, when nothing was architecture-sensitive. Looking at x64, we didn't see the affected tests running in any of the 8 shards, even though other tests from that subdirectory ran.
Eventually, we found that this repros locally! No machine variation is needed - simply two consecutive runs on the same machine, at the same commit.
Click to expand example:
D:\GitHub\STL\out\build\x64>python tests\utils\stl-lit\stl-lit.py ..\..\..\llvm-project\libcxx\test\std\language.support\support.limits\support.limits.general --num-shards=8 --run-shard=1 -o testing_x64.log
stl-lit.py: D:\GitHub\STL\llvm-project\llvm\utils\lit\lit\main.py:193: note: Selecting shard 1/8 = size 15/114 = tests #(8*k)+1 = [1, 9, 17, ...]
-- Testing: 15 of 114 tests, 15 workers --
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/version.version.pass.cpp:0 (1 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/tuple.version.pass.cpp:0 (2 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/algorithm.version.pass.cpp:0 (3 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/functional.version.pass.cpp:0 (4 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/numbers.version.pass.cpp:0 (5 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/latch.version.pass.cpp:0 (6 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/coroutine.version.pass.cpp:0 (7 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/cmath.version.pass.cpp:0 (8 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/semaphore.version.pass.cpp:0 (9 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/stack.version.pass.cpp:0 (10 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/map.version.pass.cpp:0 (11 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/barrier.version.pass.cpp:0 (12 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/queue.version.pass.cpp:0 (13 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/unordered_set.version.pass.cpp:0 (14 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/execution.version.pass.cpp:0 (15 of 15)
Testing Time: 2.70s
Excluded : 99
Passed : 11
Expectedly Failed: 4
D:\GitHub\STL\out\build\x64>python tests\utils\stl-lit\stl-lit.py ..\..\..\llvm-project\libcxx\test\std\language.support\support.limits\support.limits.general --num-shards=8 --run-shard=1 -o testing_x64.log
stl-lit.py: D:\GitHub\STL\llvm-project\llvm\utils\lit\lit\main.py:193: note: Selecting shard 1/8 = size 15/114 = tests #(8*k)+1 = [1, 9, 17, ...]
-- Testing: 15 of 114 tests, 15 workers --
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/typeinfo.version.pass.cpp:0 (1 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/iterator.version.pass.cpp:0 (2 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/optional.version.pass.cpp:0 (3 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/string.version.pass.cpp:0 (4 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/algorithm.version.pass.cpp:0 (5 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/memory.version.pass.cpp:0 (6 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/cstddef.version.pass.cpp:0 (7 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/format.version.pass.cpp:0 (8 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/coroutine.version.pass.cpp:0 (9 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/array.version.pass.cpp:0 (10 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/chrono.version.pass.cpp:0 (11 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/scoped_allocator.version.pass.cpp:0 (12 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/map.version.pass.cpp:0 (13 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/vector.version.pass.cpp:0 (14 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/execution.version.pass.cpp:0 (15 of 15)
Testing Time: 2.69s
Excluded : 99
Passed : 7
Expectedly Failed: 8
Note that each time, this is "Selecting shard 1/8", in a directory with variously PASSing and XFAILing tests. Yet the number of PASSes and XFAILs varies. Sorting and diffing the tests reveals how these are partially-overlapping subsets (e.g. both of them ran execution.version.pass.cpp, but array.version.pass.cpp and barrier.version.pass.cpp were run in only one of the subsets).
I haven't located exactly where in lit's Python implementation this is happening, but I would expect that after enumerating all available tests and before running the "select every Nth test" shard algorithm (I do see the latter in the code), there should be a step that sorts the available tests, so that the result is always deterministic.