Add test/bench runners, benchmarks, additional scripts #752

geky · 2022-12-02T05:11:24Z

This PR brings in a number of changes to how littlefs is tested and measured.

Originally, the motivation was to add a method for benchmarking the filesystem, to lay the groundwork for future performance improvements, but the scope ended up growing to include a number of fixes/improvements to general littlefs testing.

Reworked test framework, no. 3

The test framework gets a rework again, taking what worked well in the current test framework and throwing out what doesn't.

The main goals behind this rework were to 1. simplify the framework, even if it means more boilerplate, as this should make it easier to extend with new features, and 2. run the tests as fast as possible.

Previously I've disregarded test performance, worried a focus on test performance risks complexity and difficulty in understanding the system that is being debugged, but my perspective is changing as faster tests => more tests => more confidence => ~~the dark side~~ a safer filesystem. If you've told me previously to parallelize the tests, etc, this is the part where you can say you told me so.

Tests incrementally compile, and we don't rebuild lfs.c for every suite

Previously the test's build system and runner was all self-contained in test.py. On one hand this meant you only needed test.py to build/run the tests, but on the other hand this design was confusing, limiting, and just all around problematic. One big issue was that, being outside of the build system, tests couldn't be built incrementally and every test suite needed a custom built version of lfs.c. This led to a slow debugging experience as each change to lfs.c needed at least 16 recompilations.

Now the test framework is integrated into the Makefile with separate build steps for applying prettyasserts.py and other scripts, all of which can be built incrementally, significantly reducing the time spent waiting for tests to recompile.
runners/test_runner is now its own standalone application

Previously any extra features/configuration had to be built into the test binaries during compilation. Now there is an explicit test_runner.c which can contain high-level test features that can be engaged at runtime through additional flags.

This makes it easier to add new test features, but also makes it easier to debug the test_runner itself, as it's no longer hidden inside test.py.

The actual tests are provided at link-time using a custom linker section, and are still generated by ./scripts/test.py -c
```
$ make test-runner
...
$ ./runners/test_runner -l
suite                      flags   cases       perms
test_alloc                     -      12       62/70
test_attrs                     -       4       20/20
...
$ ./runners/test_runner
running test_alloc_parallel:1g12gg2f3g1ghsj5
...
```
Tests now avoid spawning processes as much as possible

When you find a bug in C, it often leads to undefined behavior, memory corruption, etc, making the current test process no longer sound. But you also often want to keep running tests to see if there is any trends among the test failures. To accomplish this, the previous test framework ran each test in its own process.

Unfortunately, process spawning is not really a cheap operation. And with most tests not failing (hopefully), this ends up wasting a significant amount of time just spawning processes.

Now, with a more powerful test_runner, the test framework tries to run as many tests in a single process. Only spawning a new process when a test fails. This is all handled by scripts/test.py, which interacts with runners/test_runner, telling it which tests to run via the low-lever --step flag.
Powerloss is now simulated with setjmp/longjmp

As a part of reducing process spawning, powerloss is directly simulated in the test_runner using setjmp/longjmp. Previously powerloss was simulated by killing and restarting the process. Which is a simple, heavy-handed solution that works. Slowly.

Since there can be thousands of powerlosses in a single test, this needed to be moved into the test_runner, especially since powerloss testing is arguably the most important feature of littlefs's test framework.

As an added plus, the simulated block-device no longer needs to be persisted in the host's filesystem when powerloss testing, and can stay comfortably in the test_runner's RAM. The cost of persisting the block-device could be mitigated by using a RAM-backed tmpfs disk, but this still incurred a cost as all block-device operations would need to go through the OS.

Using setjmp/longjmp can lead to memory leaks when reentrant tests call malloc, but since littlefs uses malloc in only a handful of convenience functions (littlefs's whole goal is minimal RAM after all), this doesn't seem to been a problem so far.

Tests now run in parallel

Perhaps the lowest-hanging fruit, tests now run in parallel.

The exact implementation here is a bit naive/suboptimal, giving each process n/m tests to run for n tests and m cores, but this keeps the process/thread management in the high-level test.py python layer, simplifying thread management and avoiding a multi-threaded test_runner.

$ ./scripts/test.py ./runners/test_runner -j -v
using runner: ./runners/test_runner
./runners/test_runner --list-cases
./runners/test_runner --list-case-paths
found 17 suites, 130 cases, 4202/4315 permutations

./runners/test_runner --list-cases
./runners/test_runner --list-case-paths
./runners/test_runner -s0,,12
./runners/test_runner -s1,,12
./runners/test_runner -s2,,12
./runners/test_runner -s3,,12
./runners/test_runner -s4,,12
./runners/test_runner -s5,,12
./runners/test_runner -s6,,12
./runners/test_runner -s7,,12
./runners/test_runner -s8,,12
./runners/test_runner -s9,,12
./runners/test_runner -s10,,12
./runners/test_runner -s11,,12
...

The combination of the above improvements allows us to run the tests a lot faster, and/or cram in a lot more tests:

	Test permutations	Runtime (single core)	Runtime (6 cores/12 threads)	Tests per second (single core)	Tests per second (6 cores/12 threads)
Before	897	51.67 s	31.73 s	17.36 t/s	28.27 t/s
After	4202	1 m 22.26 s	26.19 s	51.08 t/s (+194.23%)	160.44 t/s (+467.53%)

(Most of the new permutations are from moving the different test geometries out of CI and into the test_runner. Note the previous test framework does parallelize builds, which are included.)

Exhaustive powerloss testing

In addition to the heuristic-based powerloss testing, the new test_runner can also exhaustively search all possible powerloss scenarios for a given reentrant test.

To speed this up, the test_runner uses a simulated, copy-on-write block-device (reintroducing emubd), such that all possible code-paths in all possible powerloss scenarios are executed at most once. And, because most of the block-device's state can be shared via copy-on-write operations, each powerloss branch needs at most one additional block of memory in RAM.

The runtime still grows exponentially, and we each have a finite lifetime, so it will be more useful to exhaustively search a bounded number of powerlosses. Here's a run of all possible 5-deep powerlosses in the test_move test suite:
```
$ ./scripts/test.py ./runners/test_runner test_move -P5 -j
using runner: ./runners/test_runner -P5
found 1 suites, 17 cases, 10/10 permutations

running tests: 1/1 suites, 17/17 cases, 10/10 perms, 7981852pls!

done: 10/10 passed, 0/10 failed, 7981852pls!, in 951.48s
```
Since it can be a bit annoying to wait 15 minutes to reproduce a test failure, each powerloss scenario is encoded in a leb16 suffix appended to the current test identifier. This, combined with a leb16-encoding of the test's configuration and the test's name, can uniquely identify and reproduce any test run in the test_runner:
```
test_dirs_many_reentrant:2gg2cb:k4o6
^                        ^  ^^^ ^ ^
'------------------------|--|||-|-|-- test_dirs_many_reentrant
                         '--|||-|-|--   2 =   0x2 = BLOCK_SIZE
                            '||-|-|-- gg2 = 0x200 = 512
                             '|-|-|--   c =   0xc = N
                              '-|-|--   b =   0xb = 11
                                '-|--  k4 =  0x44 = powerloss after 68 writes
                                  '--  o6 =  0x68 = powerloss after 104 writes
```
So once a failing test scenario is found, the exact state of the failure can be quickly reproduced for debugging:
```
$ ./scripts/test.py ./runners/test_runner -P2 -b -j
using runner: ./runners/test_runner -P2
found 17 suites, 130 cases, 390/400 permutations

running test_alloc: 12/12 cases, 0/0 perms
running test_attrs: 4/4 cases, 0/0 perms
running test_badblocks: 4/4 cases, 0/0 perms
running test_bd: 5/5 cases, 0/0 perms
running test_dirs: 12/14 cases, 12/18 perms, 39477pls!, 1/18 failures

done: 12/18 passed, 1/18 failed, 39477pls!, in 23.46s

tests/test_dirs.toml:180:failure: test_dirs_many_reentrant:2gg2cb:k4o6 (BLOCK_SIZE=512, N=11) failed
powerloss test_dirs_many_reentrant:2gg2cb:k4k6
powerloss test_dirs_many_reentrant:2gg2cb:k4l6
powerloss test_dirs_many_reentrant:2gg2cb:k4m6
powerloss test_dirs_many_reentrant:2gg2cb:k4n6
powerloss test_dirs_many_reentrant:2gg2cb:k4o6
tests/test_dirs.toml:196:assert: assert failed with false, expected eq true
        assert(err == 0 || err == LFS_ERR_EXIST);

$ ./scripts/test.py ./runners/test_runner test_dirs_many_reentrant:2gg2cb:k4o6 --gdb
...
196             assert(err == 0 || err == LFS_ERR_EXIST);
(gdb)
```
Unfortunately, the current tests are not the most well designed for exhaustive powerloss testing. Some of them, test_files and test_interspersed specifically, write large files a byte at a time. Under exhaustive powerloss testing, these result in, well, a lot of powerlosses, but outside of the writes with data-structure changes, don't reveal anything interesting. This is something that can probably be improved over time.

Exhaustively testing all powerlosses at a depth of 1 takes 12.79 minutes with 84,484 total powerlosses.

Exhaustively testing all powerlosses at a depth of 2 takes at least 4 days, and is still running... I'll let you know when it finishes...
scripts/bench.py and runners/bench_runner

This PR introduces scripts/bench.py and runners/bench_runner, siblings to scripts/test.py and runners/test_runner, for measuring the performance of littlefs. Instead of reporting pass/fail, the bench_runner reports the total number of bytes read, programmed, and erased during a bench case. This can be useful for comparing different littlefs implementations, as these numbers map directly to hardware-dependent performance in IO-bound applications.

One feature that makes this useful, added to both the bench_runner and test_runner, is a flexible configuration system evaluated at runtime. This has the downside of limiting configurable bench/test defines to uintmax_t integers, but makes it easy to quickly test/compare/reproduce different configurations:
```
$ ./scripts/bench.py ./runners/bench_runner bench_file_write -Gnor --list-defines
READ_SIZE=1
PROG_SIZE=1
BLOCK_SIZE=4096
BLOCK_COUNT=256
CACHE_SIZE=64
LOOKAHEAD_SIZE=16
BLOCK_CYCLES=-1
ERASE_VALUE=255
ERASE_CYCLES=0
BADBLOCK_BEHAVIOR=0
POWERLOSS_BEHAVIOR=0
CHUNK_SIZE=64
ORDER=0,1,2
SIZE=131072
$ ./scripts/bench.py ./runners/bench_runner bench_file_write -Gnor -DORDER=0 -DSIZE="range(0,24576,64)"
using runner: ./runners/bench_runner -Gnor -DORDER=0 '-DSIZE=range(0,24576,64)'
found 1 suites, 1 cases, 384/384 permutations

running benches: 1/1 suites, 1/1 cases, 384/384 perms

done: 4801420 readed, 4729056 proged, 5492736 erased, in 0.08s
```
At the moment I've only added a handful of benchmarks, though the number may increase in the future. The goal isn't to maintain a fully cohesive benchmark suite, as much as it is to have a set of tools for analyzing specific performance bottlenecks.

Reworked scripts/summary.py and other scripts to be a bit more flexible

This mainly means scripts/summary.py is no longer hard-wired to work with the compile-time measurements, allowing it to be used with other results such as benchmarks, though this comes with the cost of a large number of flags for controlling the output.

Each measurement script also now comes with a *-diff Makefile rule for quick comparisons.

$ make summary
./scripts/code.py lfs.o lfs_util.o -q  -o lfs.code.csv
./scripts/data.py lfs.o lfs_util.o -q  -o lfs.data.csv
./scripts/stack.py lfs.ci lfs_util.ci -q  -o lfs.stack.csv
./scripts/structs.py lfs.o lfs_util.o -q  -o lfs.structs.csv
./scripts/summary.py lfs.code.csv lfs.data.csv lfs.stack.csv lfs.structs.csv -fcode=code_size -fdata=data_size -fstack=stack_limit --max=stack -fstructs=struct_size -Y
                            code    data   stack structs
TOTAL                      25614       -    2176     908
$ make summary-diff
./scripts/summary.py <(./scripts/code.py ./lfs.o ./lfs_util.o -q -o-) <(./scripts/data.py ./lfs.o ./lfs_util.o -q -o-) <(./scripts/stack.py ./lfs.ci ./lfs_util.ci -q -o-) <(./scripts/structs.py ./lfs.o ./lfs_util.o -q -o-) -fcode=code_size -fdata=data_size -fstack=stack_limit --max=stack -fstructs=struct_size -Y -d <(./scripts/summary.py ./lfs.code.csv ./lfs.data.csv ./lfs.stack.csv ./lfs.structs.csv -q -o-)
                           ocode   odata  ostack    ostructs   ncode   ndata  nstack    nstructs   dcode   ddata  dstack    dstructs
TOTAL                      25614       -    2176         908   25614       -    2176         908      +0      +0      +0          +0

Reworked scripts/cov.py to take advantage of the --json-format introduced in GCC 9

It's a bit concerning that this was a breaking change in gcov's API, albeit on a major version, but the new --json-format is much easier to work with.

It's also worth noting this PR includes a change in ideology around coverage measurement. Instead of collecting coverage from as many sources as possible in CI, coverage is only collected from the central make test run. This will result in lower coverage numbers than previously, but these are the coverage numbers we actually care about: test coverage via easy to reproduce and isolate tests.

This also simplifies coverage collection in CI, which is a plus.

scripts/perf.py and scripts/perfbd.py

perf.py was added as an experiment with Linux's perf tool, which uses an interest method of sampling performance counters to build an understanding of the performance of a system. Unfortunately this isn't the most useful measurement for littlefs, as we should expect littlefs's performance to be dominated by IO overhead. But it may still be useful for tracking down CPU bottlenecks

perfbd.py takes the ideas in Linux's perf tool and applies them to the bench_runner. Instead of sampling performance counters, we can sample littlefs's trace output to find low-level block-device operations. Combining this with stack traces provided by the backtrace function, we can propagate IO cost to their callers, building a useful map of the source of IO operations in a given benchmark run:

$ ./scripts/bench.py ./runners/bench_runner bench_file_write -Gnor -DORDER=0 -DSIZE="range(0,24576,64)" -t lfs.bench.trace --trace-backtrace --trace-freq=10000
using runner: ./runners/bench_runner -Gnor -tlfs.bench.trace --trace-backtrace --trace-freq=10000 -DORDER=0 '-DSIZE=range(0,24576,64)'
found 1 suites, 1 cases, 384/384 permutations

running benches: 1/1 suites, 1/1 cases, 384/384 perms

done: 4801420 readed, 4729056 proged, 5492736 erased, in 0.10s
$ ./scripts/perfbd.py ./runners/bench_runner lfs.bench.trace -Flfs.c -s
function                      readed  proged  erased                                           
lfs_bd_erase                       0       0   36864                                           
lfs_format                       372      52   28672                                           
lfs_dir_orphaningcommit           64      52   28672                                           
lfs_dir_relocatingcommit          64      52   28672                                           
lfs_dir_compact                   28      52   28672                                           
lfs_fs_deorphan                    0       0   28672                                           
lfs_file_rawwrite               3268    3200    8192                                           
lfs_file_write                  3268    3200    8192                                           
lfs_file_flushedwrite           3268    3200    4096                                           
lfs_file_relocate                516     512    4096                                           
lfs_bd_flush                    2752    3252       0                                           
lfs_bd_prog                     2752    3200       0                                           
lfs_dir_commit                    64      52       0                                           
lfs_dir_commitcrc                 64      52       0                                           
lfs_bd_read                     3688       0       0                                           
lfs_bd_cmp.constprop.0          2752       0       0                                           
lfs_dir_fetchmatch               868       0       0                                           
lfs_dir_fetch                    856       0       0                                           
lfs_alloc                        516       0       0                                           
lfs_fs_rawtraverse               516       0       0                                           
lfs_file_open                     36       0       0                                           
lfs_file_rawopencfg               36       0       0                                           
lfs_mount                          8       0       0                                           
lfs_dir_find                       4       0       0                                           
lfs_dir_get                        4       0       0                                           
lfs_dir_getslice                   4       0       0                                           
lfs_file_close                     4       0       0                                           
lfs_file_rawclose                  4       0       0                                           
lfs_file_rawsync                   4       0       0                                           
TOTAL                          25780   16876  204800

It's worth noting that these numbers are samples. They are a subset and don't add up to the total IO cost of the benchmark. But they are still useful as a metric for understand benchmark performance.

You could parse the entire trace output without sampling, but this would be quite slow and not really show you any more info.

scripts/plot.py and scripts/plotmpl.py

Added plot.py and plotmpl.py for quick plotting of littlefs measurements in the terminal and with Matplotlib. I think these will mostly be useful for looking for growth rates in benchmark results. And also future documentation.

$ ./scripts/bench.py ./runners/bench_runner bench_file_write -Gnor -DORDER=0 -DSIZE="range(0,24576,64)" -o lfs.bench.csv
using runner: ./runners/bench_runner -Gnor -DORDER=0 '-DSIZE=range(0,24576,64)'
found 1 suites, 1 cases, 384/384 permutations

running benches: 1/1 suites, 1/1 cases, 384/384 perms

done: 4801420 readed, 4729056 proged, 5492736 erased, in 0.25s
$ ./scripts/plotmpl.py lfs.bench.csv -o lfs.bench.svg -tbench_file_write -l -xSIZE -ybench_readed -ybench_proged -ybench_erased --x2 --xunits=B --y2 --yunits=B --github
updated lfs.bench.svg, 3 datasets, 1152 points

scripts/tracebd.py, scripts/tailpipe.py, scripts/teepipe.py

These are some extra scripts for interacting with/viewing littlefs's trace output.

tailpipe.py and teepipe.py behave similarly to Unix's tail and tee programs, but work a bit better with Unix pipes, with resumability and fast paging.

The most interesting script is tracebd.py, which parses littlefs's trace output for block-device operations and renders it as ascii art. I've used this sort of block-device operation rendering previously for a quick demo and it can be surprisingly useful for understanding how filesystem operations interact with the block-device.

$ mkfifo trace
$ ./scripts/bench.py ./runners/bench_runner bench_file_write -Gnor -DORDER=0 -DSIZE="range(0,24576,64)" -t trace
...

$ ./scripts/tracebd.py trace -c10000 -z                 
e.........................................e.....................................
e.........................................e.....................................
e.........................................ee....................................
e.........................................ee....................................
e.........................................ee....................................
e.........................................ee....................................
e.........................................ee....................................
e.........................................ee....................................
e.........................................ee....................................
e.........................................ee....................................
e.........................................ee....................................
e.........................................eee...................................
e.........................................eee...................................
e.........................................eee...................................
e.........................................eee...................................
e.........................................eee...................................

Changed lfs.a -> liblfs.a in default build rule

The lib* prefix is usually required by the linker, so I suspect this won't break anything. But it's worth mentioning this change in case someone relies on the current build target.
Added a make help rule

I think I first saw this here, this self-documenting Makefile rule gives some of the useful Makefile rules a bit more discoverability.
Adopted script changes in CI, added a bot comment on PRs

Thanks to GitHub Actions, we have a lot of info about builds in CI. Unfortunately, statuses on GitHub have been becoming harder to find each UI change. To help keep this info discoverable I've added an automatically generated comment that @geky-bot should post after a succesful CI run. Hopefully this will contribute to PRs without being too annoying.

You can see some example comments on the PR I created on my test fork:
WIP NULL test pr geky/littlefs-test-repo#4

The increased testing did find a couple bugs: eba5553 and 0b11ce0. Their commit messages have more details on the bugs and their fixes. And with the new test identifiers I can tell you the exact state that will trigger the failures:

test_relocations_reentrant_renames:112gg261dk1e3f3:123456789abcdefg1h1i1j1k1l1m1n1o1p1q1r1s1t1u1v1g2h2i2j2k2l2m2n2o2p2q2r2s2t2 - eba5553 - found with linear heuristic powerlosses
test_dirs_many_reentrant:2gg2cb:k4o6 - 0b11ce0 - found with 2-deep exhaustive powerlosses

This is to try a different design for testing, the goals are to make the test infrastructure a bit simpler, with clear stages for building and running, and faster, by avoiding rebuilding lfs.c n-times.

This moves defines entirely into the runtime of the test_runner, simplifying thing and reducing the amount of generated code that needs to be build, at the cost of limiting test-defines to uintmax_t types. This is implemented using a set of index-based scopes (created by test.py) that allow different layers to override defines from other layers, accessible through the global `test_define` function. layers: 1. command-line overrides 2. per-case defines 3. per-geometry defines

- Indirect index map instead of bitmap+sparse array - test_define_t and test_type_t - Added back conditional filtering - Added suite-level defines and filtering

- Added filtering based on suite, case, perm, type, geometry - Added --skip, --count, and --every (will be used for parallelism) - Implemented --list-defines - Better helptext for flags with arguments - Other minor tweaks

In the test-runner, defines are parameterized constants (limited to integers) that are generated from the test suite tomls resulting in many permutations of each test. In order to make this efficient, these defines are implemented as multi-layered lookup tables, using per-layer/per-scope indirect mappings. This lets the test-runner and test suites define their own defines with compile-time indexes independently. It also makes building of the lookup tables very efficient, since they can be incrementally populated as we expand the test permutations. The four current define layers and when we need to build them: layer defines predefine_map define_map user-provided overrides per-run per-run per-suite per-permutation defines per-perm per-case per-perm per-geometry defines per-perm compile-time - default defines compile-time compile-time -

- Added --disk/--trace/--output options for information-heavy debugging - Renamed --skip/--count/--every to --start/--stop/--step. This matches common terms for ranges, and frees --skip for being used to skip test cases in the future. - Better handling of SIGTERM, now all tests are killed, reported as failures, and testing is halted irregardless of -k. This is a compromise, you throw away the rest of the tests, which is normally what -k is for, but prevents annoying-to-terminate processes when debugging, which is a very interactive process.

- Expanded test defines to allow for lists of configurations These are useful for changing multi-dimensional test configurations without leading to extremely large and less useful configuration combinations. - Made warnings more visible durring test parsing - Add lfs_testbd.h to implicit test includes - Fixed issue with not closing files in ./scripts/explode_asserts.py - Add `make test_runner` and `make test_list` build rules for convenience

- Added internal tests, which can run tests inside other source files, allowing access to "private" functions and data Note this required a special bit of handling our defining and later undefining test configurations to not polute the namespace of the source file, since it can end up with test cases from different suites/configuration namespaces. - Removed unnecessary/unused permutation argument to generated test functions. - Some cleanup to progress output of test.py.

Previously test defines were implemented using layers of index-mapped uintmax_t arrays. This worked well for lookup, but limited defines to constants computed at compile-time. Since test defines themselves are actually calculated at _run-time_ (yeah, they have deviated quite a bit from the original, compile-time evaluated defines, which makes the name make less sense), this means defines can't depend on other defines. Which was limiting since a lot of test defines relied on defines generated from the geometry being tested. This new implementation uses callbacks for the per-case defines. This means they can easily contain full C statements, which can depend on other test defines. This does means you can create infinitely-recursive defines, but the test-runner will just break at run-time so don't do that. One concern is that there might be a performance hit for evaluating all defines through callbacks, but if there is it is well below the noise floor: - constants: 43.55s - callbacks: 42.05s

- Added --exec for wrapping the test-runner with external commands, such as Qemu or Valgrind. - Added --valgrind, which just aliases --exec=valgrind with a few extra flags useful during testing. - Dropped the "valgrind" type for tests. These aren't separate tests that run in the test-runner, and I don't see a need for disabling Valgrind for any tests. This can be added back later if needed. - Readded support for dropping directly into gdb after a test failure, either at the assert failure, entry point of test case, or entry point of the test runner with --gdb, --gdb-case, or --gdb-main. - Added --isolate for running each test permutation in its own process, this is required for associating Valgrind errors with the right test case. - Fixed an issue where explicit test identifier conflicted with per-stage test identifiers generated as a part of --by-suite and --by-case.

This mostly required names for each test case, declarations of previously-implicit variables since the new test framework is more conservative with what it declares (the small extra effort to add declarations is well worth the simplicity and improved readability), and tweaks to work with not-really-constant defines. Also renamed test_ -> test, replacing the old ./scripts/test.py, unfortunately git seems to have had a hard time with this.

This simplifies the interaction between code generation and the test-runner. In theory it also reduces compilation dependencies, but internal tests make this difficult.

A small mistake in test.py's control flow meant the failing test job would succesfully kill all other test jobs, but then humorously start up a new process to continue testing.

GCC is a bit annoying here, it can't generate .cgi files without generating the related .o files, though I suppose the alternative risks duplicating a large amount of compilation work (littlefs is really a small project). Previously we rebuilt the .o files anytime we needed .cgi files (callgraph info used for stack.py). This changes it so we always built .cgi files as a side-effect of compilation. This is similar to the .d file generation, though may be annoying if the system cc doesn't support --callgraph-info.

This also adds coverage support to the new test framework, which due to reduction in scope, no longer needs aggregation and can be much simpler. Really all we need to do is pass --coverage to GCC, which builds its .gcda files during testing in a multi-process-safe manner. The addition of branch coverage leverages information that was available in both lcov and gcov. This was made easier with the addition of the --json-format to gcov in GCC 9.0, however the lax backwards compatibility for gcov's intermediary options is a bit concerning. Hopefully --json-format sticks around for a while.

These scripts can't easily share the common logic, but separating field details from the print/merge/csv logic should make the common part of these scripts much easier to create/modify going forward. This also tweaked the behavior of summary.py slightly.

On one hand this isn't very different than the source annotation in gcov, on the other hand I find it a bit more readable after a bit of experimentation.

Also renamed GCI -> CI, this holds .ci files, though there is a risk of confusion with continuous integration. Also added unused but generated .ci files to clean rule.

- Renamed explode_asserts.py -> pretty_asserts.py, this name is hopefully a bit more descriptive - Small cleanup of the parser rules - Added recognization of memcmp/strcmp => 0 statements and generate the relevant memory inspecting assert messages I attempted to fix the incorrect column numbers for the generated asserts, but unfortunately this didn't go anywhere and I don't think it's actually possible. There is no column control analogous to the #line directive. I thought you might be able to intermix #line directives to put arguments at the right column like so: assert(a == b); __PRETTY_ASSERT_INT_EQ( #line 1 a, #line 1 b); But this doesn't work as preprocessor directives are not allowed in macros arguments in standard C. Unfortunately this is probably not possible to fix without better support in the language.

Yes this is more expensive, since small programs need to rewrite the whole block in order to conform to the block device API. However, it reduces code duplication and keeps all of the test-related block device emulation in lfs_testbd. Some people have used lfs_filebd/lfs_rambd as a starting point for new block devices and I think it should be clear that erase does not need to have side effects. Though to be fair this also just means we should have more examples of block devices...

On one hand this seems like the wrong place for these tests, on the other hand, it's good to know that the block device is behaving as expected when debugging the filesystem. Maybe this should be moved to an external program for users to test their block devices in the future?

The main change here from the previous test framework design is: 1. Powerloss testing remains in-process, speeding up testing. 2. The state of a test, included all powerlosses, is encoded in the test id + leb16 encoded powerloss string. This means exhaustive testing can be run in CI, but then easily reproduced locally with full debugger support. For example: ./scripts/test.py test_dirs#reentrant_many_dir#10#1248g1g2 --gdb Will run the test test_dir, case reentrant_many_dir, permutation #10, with powerlosses at 1, 2, 4, 8, 16, and 32 cycles. Dropping into gdb if an assert fails. The changes to the block-device are a work-in-progress for a lazily-allocated/copy-on-write block device that I'm hoping will keep exhaustive testing relatively low-cost.

With more features being added to test.py, the one-line status is starting to get quite long and pass the ~80 column readability heuristic. To make this worse this clobbers the terminal output when the terminal is not wide enough. Simple solution is to disable line-wrapping, potentially printing some garbage if line-wrapping-disable is not supported, but also printing a final status update to fix any garbage and avoid a race condition where the script would show a non-final status. Also added --color which disables any of this attempting-to-be-clever stuff.

Before this was available implicitly by supporting both rambd and filebd as backends, but now that testbd is a bit more complicated and no longer maps directly to a block-device, this needs to be explicitly supported.

These have no real purpose other than slowing down the simulation for inspection/fun. Note this did reveal an issue in pretty_asserts.py which was clobbering feature macros. Added explicit, and maybe a bit hacky, #undef _FEATURE_H to avoid this.

As expected this takes a significant amount of time (~10 minutes for all 1 powerlosses, >10 hours for all 2 powerlosses) but this may be reducible in the future by optimizing tests for powerloss testing. Currently test_files does a lot of work that doesn't really have testing value.

… fifos This mostly involved futzing around with some of the less intuitive parts of Unix's named-pipes behavior. This is a bit important since the tests can quickly generate several gigabytes of trace output.

Based on a handful of local hacky variations, this sort of trace rendering is surprisingly useful for getting an understanding of how different filesystem operations interact with the underlying block-device. At some point it would probably be good to reimplement this in a compiled language. Parsing and tracking the trace output quickly becomes a bottleneck with the amount of trace output the tests generate. Note also that since tracebd.py run on trace output, it can also be used to debug logged block-device operations post-run.

geky · 2022-12-05T17:08:04Z

After only 4 days, 20 hours, with 144,437,889 powerlosses, the exhaustive powerloss testing with all 2-deep powerlosses finished successfully:

$ ./scripts/test.py runners/test_runner -b -j -P2
using runner: runners/test_runner -P2
found 17 suites, 130 cases, 390/400 permutations

running test_alloc: 12/12 cases, 0/0 perms
running test_attrs: 4/4 cases, 0/0 perms
running test_badblocks: 4/4 cases, 0/0 perms
running test_bd: 5/5 cases, 0/0 perms
running test_dirs: 14/14 cases, 18/18 perms, 87700pls!
running test_entries: 8/8 cases, 0/0 perms
running test_evil: 8/8 cases, 0/0 perms
running test_exhaustion: 5/5 cases, 0/0 perms
running test_files: 10/10 cases, 245/245 perms, 7559438pls!
running test_interspersed: 4/4 cases, 30/30 perms, 123954557pls!
running test_move: 17/17 cases, 10/10 perms, 6120pls!
running test_orphans: 2/2 cases, 13/13 perms, 73319pls!
running test_paths: 13/13 cases, 0/0 perms
running test_relocations: 4/4 cases, 24/24 perms, 142460pls!
running test_seek: 6/6 cases, 15/15 perms, 12501750pls!
running test_superblocks: 14/14 cases, 15/15 perms, 64018pls!
running test_truncate: 7/7 cases, 20/20 perms, 48527pls!

done: 390/390 passed, 0/390 failed, 144437889pls!, in 418702.72s

- Moved to Ubuntu 22.04 This notably means we no longer have to bend over backwards to install GCC 10! - Changed shell in gha to include the verbose/undefined flags, making debugging gha a bit less painful - Adopted the new test.py/test_runners framework, which means no more heavy recompilation for different configurations. This reduces the test job runtime from >1 hour to ~15 minutes, while increasing the number of geometries we are testing. - Added exhaustive powerloss testing, because of time constraints this is at most 1pls for general tests, 2pls for a subset of useful tests. - Limited coverage measurements to `make test` Originally I tried to maximize coverage numbers by including coverage from every possible source, including the more elaborate CI jobs which provide an extra level of fuzzing. But this missed the purpose of coverage measurements, which is to find areas where test cases can be improved. We don't want to improve coverage by just shoving more fuzz tests into CI, we want to improve coverage by adding specific, intentioned test cases, that, if they fail, highlight the reason for the failure. With this perspective, maximizing coverage measurement in CI is counter-productive. This changes makes it so the reported coverage is always less than actual CI coverage, but acts as a more useful metric. This also simplifies coverage collection, so that's an extra plus. - Added benchmarks to CI Note this doesn't suffer from inconsistent CPU performance because our benchmarks are based on purely simulated read/prog/erase measurements. - Updated the generated markdown table to include line+branch coverage info and benchmark results.

For long running processes (testing with >1pls) these logs can grow into multiple gigabytes, humorously we never access more than the last n lines as requested by --context. Piping the stdout with --stdout does not use additional RAM.

The littlefs CI is actually in a nice state that generates a lot of information about PRs (code/stack/struct changes, line/branch coverage changes, benchmark changes), but GitHub's UI has changed overtime to make CI statuses harder to find for some reason. This bot comment should hopefully make this information easy to find without creating too much noise in the discussion. If not, this can always be changed later.

changeprefix.py only works on prefixes, which is a bit of a problem for flags in the workflow scripts, requiring extra handling to not hide the prefix from changeprefix.py

Two flags introduced: -fcallgraph-info=su for stack analysis, and -ftrack-macro-expansions=0 for cleaner prettyassert.py warnings, are unfortunately not supported in Clang. The override vars in the Makefile meant it wasn't actually possible to remove these flags for Clang testing, so this commit changes those vars to normal, non-overriding vars. This means `make CFLAGS=-Werror` and `CFLAGS=-Werror make` behave _very_ differently, but this is just an unfortunate quirk of make that needs to be worked around.

- Renamed struct_.py -> structs.py again. - Removed lfs.csv, instead prefering script specific csv files. - Added *-diff make rules for quick comparison against a previous result, results are now implicitly written on each run. For example, `make code` creates lfs.code.csv and prints the summary, which can be followed by `make code-diff` to compare changes against the saved lfs.code.csv without overwriting. - Added nargs=? support for -s and -S, now uses a per-result _sort attribute to decide sort if fields are unspecified.

Mostly for benchmarking, this makes it easy to view and compare runner results similarly to other csv results.

The linear powerloss heuristic provides very good powerloss coverage without a significant runtime hit, so there's really no reason to run the tests without -Plinear. Previous behavior can be accomplished with an explicit -Pnone.

lfs_emubd_getreaded -> lfs_emubd_readed lfs_emubd_getproged -> lfs_emubd_proged lfs_emubd_geterased -> lfs_emubd_erased lfs_emubd_getwear -> lfs_emubd_wear lfs_emubd_getpowercycles -> lfs_emubd_powercycles

When you add a function to every benchmark suite, you know if should probably be provided by the benchmark runner itself. That being said, randomness in tests/benchmarks is a bit tricky because it needs to be strictly controlled and reproducible. No global state is used, allowing tests/benches to maintain multiple randomness stream which can be useful for checking results during a run. There's an argument for having global prng state in that the prng could be preserved across power-loss, but I have yet to see a use for this, and it would add a significant requirement to any future test/bench runner.

…ground The difference between ggplot's gray and GitHub's gray was a bit jarring. This also adds --foreground and --font-color for this sort of additional color control without needing to add a new flag for every color scheme out there.

Driven primarily by a want to compare measurements of different runtime complexities (it's difficult to fit O(n) and O(log n) on the same plot), this adds the ability to nest subplots in the same .svg which try to align as much as possible. This turned out to be surprisingly complicated. As a part of this, adopted matplotlib's relatively recent constrained_layout, which behaves much more consistently. Also dropped --legend-left, no one should really be using that.

As well as --legend* and --*ticklabels. Mostly for close feature parity, making it easier to move plots between plot.py and plotmpl.py.

- Added support for negative numbers in the leb16 encoding with an optional 'w' prefix. - Changed prettyasserts.py rule to .a.c => .c, allowing other .a.c files in the future. - Updated .gitignore with missing generated files (tags, .csv). - Removed suite-namespacing of test symbols, these are no longer needed. - Changed test define overrides to have higher priority than explicit defines encoded in test ids. So: ./runners/bench_runner bench_dir_open:0f1g12gg2b8c8dgg4e0 -DREAD_SIZE=16 Behaves as expected. Otherwise it's not easy to experiment with known failing test cases. - Fixed issue where the -b flag ignored explicit test/bench ids.

This allows debugging strategies such as binary searching for the point of "failure", which may be more complex than simply failing an assert.

geky and others added 30 commits April 16, 2022 13:50

Created new test_runner.c and test_.py

56a9903

This is to try a different design for testing, the goals are to make the test infrastructure a bit simpler, with clear stages for building and running, and faster, by avoiding rebuilding lfs.c n-times.

Some more minor improvements to the test_runner

4b0aa62

- Indirect index map instead of bitmap+sparse array - test_define_t and test_type_t - Added back conditional filtering - Added suite-level defines and filtering

More test_runner progress

9281ce2

- Added filtering based on suite, case, perm, type, geometry - Added --skip, --count, and --every (will be used for parallelism) - Implemented --list-defines - Better helptext for flags with arguments - Other minor tweaks

Added trace and persist flags to test_runner

92a600a

Putting together rewritten test.py script

6443693

Moved test suites into custom linker section

4a42326

This simplifies the interaction between code generation and the test-runner. In theory it also reduces compilation dependencies, but internal tests make this difficult.

Fix test.py hang on ctrl-C, cleanup TODOs

1616115

A small mistake in test.py's control flow meant the failing test job would succesfully kill all other test jobs, but then humorously start up a new process to continue testing.

Added support for annotated source in coverage.py

46cc6d4

On one hand this isn't very different than the source annotation in gcov, on the other hand I find it a bit more readable after a bit of experimentation.

Removed some prefixes from Makefile variables where not necessary

92eee8e

Also renamed GCI -> CI, this holds .ci files, though there is a risk of confusion with continuous integration. Also added unused but generated .ci files to clean rule.

Readded support for mirror writes to a file in testbd

3f4f859

Before this was available implicitly by supporting both rambd and filebd as backends, but now that testbd is a bit more complicated and no longer maps directly to a block-device, this needs to be explicitly supported.

Added tailpipe.py and improved redirecting test trace/log output over…

c9a6e3a

… fifos This mostly involved futzing around with some of the less intuitive parts of Unix's named-pipes behavior. This is a bit important since the tests can quickly generate several gigabytes of trace output.

geky added needs minor version new functionality only allowed in minor versions tooling labels Dec 2, 2022

geky mentioned this pull request Dec 5, 2022

Compile errors with VisualStudio #713

Open

geky added 11 commits December 6, 2022 23:07

Fixed bench workflow + changeprefix issue in prefix releases

4a20934

changeprefix.py only works on prefixes, which is a bit of a problem for flags in the workflow scripts, requiring extra handling to not hide the prefix from changeprefix.py

Merge remote-tracking branch 'origin/master' into test-and-bench-runners

0c781dd

Added make benchmarks/testmarks rules

9b687dd

Mostly for benchmarking, this makes it easy to view and compare runner results similarly to other csv results.

Changed test_runner to run with -Pnone,linear by default

cda2f6f

The linear powerloss heuristic provides very good powerloss coverage without a significant runtime hit, so there's really no reason to run the tests without -Plinear. Previous behavior can be accomplished with an explicit -Pnone.

Changed lfs_emubd_get* -> lfs_emubd_*

d8e7ffb

lfs_emubd_getreaded -> lfs_emubd_readed lfs_emubd_getproged -> lfs_emubd_proged lfs_emubd_geterased -> lfs_emubd_erased lfs_emubd_getwear -> lfs_emubd_wear lfs_emubd_getpowercycles -> lfs_emubd_powercycles

geky force-pushed the test-and-bench-runners branch from 78ebd80 to d677a96 Compare December 7, 2022 05:10

geky force-pushed the test-and-bench-runners branch 2 times, most recently from 076f871 to 17c9665 Compare December 16, 2022 06:18

geky added 2 commits December 16, 2022 16:47

Adopted --subplot* in plot.py

1f37eb5

As well as --legend* and --*ticklabels. Mostly for close feature parity, making it easier to move plots between plot.py and plotmpl.py.

geky force-pushed the test-and-bench-runners branch from 17c9665 to 1f37eb5 Compare December 16, 2022 22:47

geky added 2 commits December 17, 2022 12:35

Added --gdb-pl to test.py for breaking on specific powerlosses

c2147c4

This allows debugging strategies such as binary searching for the point of "failure", which may be more complex than simply failing an assert.

geky added the next minor label Dec 17, 2022

geky added this to the v2.6 milestone Apr 17, 2023

geky removed the needs minor version new functionality only allowed in minor versions label Apr 26, 2023

geky changed the base branch from master to devel April 26, 2023 06:04

geky merged commit 0a7eca0 into devel Apr 26, 2023

geky mentioned this pull request May 1, 2023

Minor release: v2.6 #814

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test/bench runners, benchmarks, additional scripts #752

Add test/bench runners, benchmarks, additional scripts #752

geky commented Dec 2, 2022 •

edited

Loading

geky commented Dec 5, 2022

Add test/bench runners, benchmarks, additional scripts #752

Add test/bench runners, benchmarks, additional scripts #752

Conversation

geky commented Dec 2, 2022 • edited Loading

geky commented Dec 5, 2022

geky commented Dec 2, 2022 •

edited

Loading