Skip to content

LLVM lit sharding is non-deterministic #2794

@StephanTLavavej

Description

@StephanTLavavej

Repros with main as of ea32e86, where our submodule is llvm/llvm-project@b8d38e8.

We use LLVM lit's "sharding" feature to split our large test suite across 8 VMs per architecture:

shardFlags: '--num-shards=$(System.TotalJobsInPhase);--run-shard=$(System.JobPositionInPhase)'

We expected that for a given set of tests, this would exactly partition them across the VMs, with no duplicates and no missed tests. That is, we expected the sharding algorithm to be deterministic, even across different machines (because these VMs run independently), as long as the set of tests is the same on each machine (which it is, because they've checked out the same commit). There should be no sensitivity to filesystem enumeration order, time of day, or anything else. (However, it's okay if adding/removing a single test radically changes the subset shards, and it's okay for each subset shard to be run in a totally randomized/shuffled order.)

We're observing non-deterministic behavior from lit. This was originally observed in #2793, whose initial version mistakenly had 2 tests that were XPASSing - but the XPASSes showed up only for x86, when nothing was architecture-sensitive. Looking at x64, we didn't see the affected tests running in any of the 8 shards, even though other tests from that subdirectory ran.

Eventually, we found that this repros locally! No machine variation is needed - simply two consecutive runs on the same machine, at the same commit.

Click to expand example:
D:\GitHub\STL\out\build\x64>python tests\utils\stl-lit\stl-lit.py ..\..\..\llvm-project\libcxx\test\std\language.support\support.limits\support.limits.general --num-shards=8 --run-shard=1 -o testing_x64.log
stl-lit.py: D:\GitHub\STL\llvm-project\llvm\utils\lit\lit\main.py:193: note: Selecting shard 1/8 = size 15/114 = tests #(8*k)+1 = [1, 9, 17, ...]
-- Testing: 15 of 114 tests, 15 workers --
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/version.version.pass.cpp:0 (1 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/tuple.version.pass.cpp:0 (2 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/algorithm.version.pass.cpp:0 (3 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/functional.version.pass.cpp:0 (4 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/numbers.version.pass.cpp:0 (5 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/latch.version.pass.cpp:0 (6 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/coroutine.version.pass.cpp:0 (7 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/cmath.version.pass.cpp:0 (8 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/semaphore.version.pass.cpp:0 (9 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/stack.version.pass.cpp:0 (10 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/map.version.pass.cpp:0 (11 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/barrier.version.pass.cpp:0 (12 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/queue.version.pass.cpp:0 (13 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/unordered_set.version.pass.cpp:0 (14 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/execution.version.pass.cpp:0 (15 of 15)

Testing Time: 2.70s
  Excluded         : 99
  Passed           : 11
  Expectedly Failed:  4

D:\GitHub\STL\out\build\x64>python tests\utils\stl-lit\stl-lit.py ..\..\..\llvm-project\libcxx\test\std\language.support\support.limits\support.limits.general --num-shards=8 --run-shard=1 -o testing_x64.log
stl-lit.py: D:\GitHub\STL\llvm-project\llvm\utils\lit\lit\main.py:193: note: Selecting shard 1/8 = size 15/114 = tests #(8*k)+1 = [1, 9, 17, ...]
-- Testing: 15 of 114 tests, 15 workers --
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/typeinfo.version.pass.cpp:0 (1 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/iterator.version.pass.cpp:0 (2 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/optional.version.pass.cpp:0 (3 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/string.version.pass.cpp:0 (4 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/algorithm.version.pass.cpp:0 (5 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/memory.version.pass.cpp:0 (6 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/cstddef.version.pass.cpp:0 (7 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/format.version.pass.cpp:0 (8 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/coroutine.version.pass.cpp:0 (9 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/array.version.pass.cpp:0 (10 of 15)
XFAIL: libc++ :: std/language.support/support.limits/support.limits.general/chrono.version.pass.cpp:0 (11 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/scoped_allocator.version.pass.cpp:0 (12 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/map.version.pass.cpp:0 (13 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/vector.version.pass.cpp:0 (14 of 15)
PASS: libc++ :: std/language.support/support.limits/support.limits.general/execution.version.pass.cpp:0 (15 of 15)

Testing Time: 2.69s
  Excluded         : 99
  Passed           :  7
  Expectedly Failed:  8

Note that each time, this is "Selecting shard 1/8", in a directory with variously PASSing and XFAILing tests. Yet the number of PASSes and XFAILs varies. Sorting and diffing the tests reveals how these are partially-overlapping subsets (e.g. both of them ran execution.version.pass.cpp, but array.version.pass.cpp and barrier.version.pass.cpp were run in only one of the subsets).

I haven't located exactly where in lit's Python implementation this is happening, but I would expect that after enumerating all available tests and before running the "select every Nth test" shard algorithm (I do see the latter in the code), there should be a step that sorts the available tests, so that the result is always deterministic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    fixedSomething works now, yay!high priorityImportant!testRelated to test code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions