build: Parallelize 'run-tests' #1408

ronh-rs · 2025-01-03T16:29:28Z

Because nextest does not support running doctests, we are parallelizing
the run-tests.sh run in buildkite to run nextest in one job and doctests
in another. This means spending a little more on CPU, but shortening
the total wall clock duration of a build (plus gaining the features
provided by nextest).

Fix issue with joining on textual vs numerical or temporal columns. We always use the left side column to scan and the right side column to do the lookup, hence when the right side column is text, then the non-textual column will be converted into text in order to perform the lookup, which might not be properly matched. Concider textual values: '100', '0100', '000100', which all should match numerical value 100, but when we convert number 100 to text, we will get '100', so it will find only one match. But if we select the textual column to scan, and the numerical one to do the lookup, the above example will find all the matches. Fixes: READ-4974 Change-Id: I5f956c9692e9d859de38fb783da749a02c1d191c Reviewed-on: https://gerrit.readyset.name/c/readyset/+/8454 Tested-by: Buildkite CI Reviewed-by: Jason Brown <jason.b@readyset.io>

During restart of the replicator, we need to find the correct position to start streaming events. We have two concept of positions, the schema position, which is updated after snapshot, and the per table position, which is updated as each table gets events, or until we update the min lower bound using ReplicationAction::LogPosition. Our restart happens in three steps, first we attempt to snapshot new tables, then we catch up with events between min applied position and the current upstream position, then we start streaming events from the current upstream position. In order to find the min position, we were using ReplicationOffset::max_offset, which has a clash between its name saying it returns the max offset and documentation saying it returns the min offset. The function is actually returning the max offset. When starting readyset, we were getting the max_offset, then running snapshot, which was updating the schema offset to current position, and then running catch-up. In case of an error during catch-up, we will restart the replicator, which causes us to fetch the max_offset again, and this time it will be equal to upstream current position, causing the catch-up to be skipped on the second time. This patch changes the logic to properly use the min_offset when starting replicator. Fixes: REA-5098 Change-Id: I79b28005ce53964508319222a2f45807a9cd0ac0 Reviewed-on: https://gerrit.readyset.name/c/readyset/+/8541 Tested-by: Buildkite CI Reviewed-by: Johnathan Davis <jcd@readyset.io>

Note that we actually will have multiple threads with the same name (e.g. tokio threads), but we can't use the thread ID (turning it into a `u64` is behind an unstable feature flag), so they get collected together into one counter for all threads with the same name. This mainly means that multiple adapter threads will be aggregated, and at our current stage I don't think we would get anything meaningful out of breaking that up anyhow (i.e. we couldn't identify which tokio task was responsible anyway). So this mainly enables separating out domains to be able to identify if the problem is in dataflow, since domains do have a 1-1 thread-name correspondence. Also note that although we drop the counters for threads that are no longer reported, the metrics crate has no facility for "unregistering" metrics, so those counters will be reset to 0 and then continuously reported. This will likely lead to clutter as multiple compaction threads, specifically, will continue to be reported after they have ended. Change-Id: Idf7a08b18fb85681e368c07c3d94ef5d4f4ac0bf Reviewed-on: https://gerrit.readyset.name/c/readyset/+/8418 Reviewed-by: Jason Brown <jason.b@readyset.io> Tested-by: Buildkite CI

This also adds a default `MALLOC_CONF` which enables profiling on startup, but leaves it inactive, with a default sample rate and interval that in my limited testing results in reasonable fidelity with about 3 dumps per minute during a small workload with lots of upqueries and evictions. Change-Id: I2e6b41e137e152ba21f2fda5cdf9e1b973127f7a Reviewed-on: https://gerrit.readyset.name/c/readyset/+/8502 Tested-by: Buildkite CI Reviewed-by: Jason Brown <jason.b@readyset.io>

Currently can only be enabled, disabled, and manually dumped; other options like sample rate and dumping to a file at an interval can still only be configured via the `MALLOC_CONF` environment variable (or `_RJEM_MALLOC_CONF` on macOS). Can be used like so: ```sh $ curl -X POST localhost:6033/jemalloc/profiling/activate Memory profiling activated $ curl localhost:6033/jemalloc/profiling/dump > jeprof.heap $ jeprof --svg target/release/readyset jeprof.heap > jeprof.heap.svg $ curl -X POST localhost:6033/jemalloc/profiling/deactivate Memory profiling deactivated ``` Change-Id: Ifc8efa3ccd67d01123b42e0588a45f2c26c947cd Reviewed-on: https://gerrit.readyset.name/c/readyset/+/8503 Reviewed-by: Jason Brown <jason.b@readyset.io> Tested-by: Buildkite CI

Not seeing a worthwhile way to deduplicate this other than putting http handling in the readyset-alloc crate which doesn't seem ideal, so this is just copy and pasted. As explained in the comment, having this in both places (port 6033 and 6034, effectively) might give the false impression that you can toggle profiling on or off just for the adapter and not the server, and vice versa, but that is not the case in standalone deployments where they are in the same process and share a jemalloc instance. Despite the duplication and the risk of confusion, I think it's worth having it in both places so that one day when we have a distributed deployment, we aren't asking ourselves why we need a restart to turn on profiling in the adapter but not the server. Change-Id: I89ecf1b284893148f67933cd3f12acbccdbf983d Reviewed-on: https://gerrit.readyset.name/c/readyset/+/8504 Tested-by: Buildkite CI Reviewed-by: Jason Brown <jason.b@readyset.io>

`cargo test` only runs test in parallel within one test module. [`cargo nextest`] can execute tests in parallel across crates, but in doing so executes each test in its own process (supposedly for reduced flakiness). This means our existing in-process mutexes provided by the `serial_test` crate don't work when running under nextest. We write our own `#[serial]` procedural macro which, in addition to adding the same existing in-process mutex from the `serial_test` crate, also renames tests such that they can be identified by the nextest filterset DSL. (Although it has a limited ability to parse `cfg` attrs, there is no facility for inspecting arbitrary attrs attached to the test or anything like that; otherwise we could just use the existing `#[serial]` and somehow specify it in the filterset.) This allows creating [test groups] which don't allow parallel execution within them. We then get the benefits of parallel test execution across crates; so for the bulk of the time, 1 test in the mysql group is running and 1 test in the postgres group is running, getting us to (in a simple test on an M3 Max) 181% CPU usage and a commensurate 81% speedup in clock time. ```sh $ time TZ=UTC RUN_SLOW_TESTS=1 cargo nextest run \ -E 'not package(readyset-clustertest)' ... Summary [ 393.304s] 2667 tests run: 2667 passed, 84 skipped 620.70s user 207.41s system 181% cpu 7:35.10 total $ time TZ=UTC RUN_SLOW_TESTS=1 cargo test --all \ --exclude readyset-clustertest ... 710.37s user 154.68s system 104% cpu 13:44.92 total ``` [`cargo nextest`]: https://nexte.st/ [test groups]: https://nexte.st/docs/configuration/test-groups/ Change-Id: I75a6b762bb18b4ebe7927cfe60eb7cefea038ee8 Reviewed-on: https://gerrit.readyset.name/c/readyset/+/8507 Tested-by: Buildkite CI Reviewed-by: Johnathan Davis <jcd@readyset.io>

This won't make a runtime huge difference in CI, but exercises the new nextest configuration. Change-Id: I0fced9dd8d763c87c33bca3931284ef13765ebca Reviewed-on: https://gerrit.readyset.name/c/readyset/+/8544 Tested-by: Buildkite CI Reviewed-by: Ron Hough <ron@readyset.io>

Because nextest does not support running doctests, we are parallelizing the run-tests.sh run in buildkite to run nextest in one job and doctests in another. This means spending a little more on CPU, but shortening the total wall clock duration of a build (plus gaining the features provided by nextest). Change-Id: I8211a10ba41179d2b0ed1d888588863011df5afc Reviewed-on: https://gerrit.readyset.name/c/readyset/+/8547 Tested-by: Buildkite CI Reviewed-by: Michael Zink <michael.z@readyset.io>

CLAassistant · 2025-01-03T22:27:28Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 3 committers have signed the CLA.

✅ altmannmarcelo
❌ vassili-zarouba
❌ mvzink
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

vassili-zarouba and others added 2 commits January 2, 2025 20:53

readysetbot force-pushed the I8211a10ba41179d2b0ed1d888588863011df5afc branch 2 times, most recently from e22980a to c37868c Compare January 3, 2025 16:53

mvzink added 4 commits January 3, 2025 17:31

readysetbot force-pushed the I8211a10ba41179d2b0ed1d888588863011df5afc branch from c37868c to 02e4941 Compare January 3, 2025 17:34

mvzink and others added 3 commits January 3, 2025 20:56

readysetbot force-pushed the I8211a10ba41179d2b0ed1d888588863011df5afc branch from 02e4941 to 57c4997 Compare January 3, 2025 22:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build: Parallelize 'run-tests' #1408

build: Parallelize 'run-tests' #1408

ronh-rs commented Jan 3, 2025

CLAassistant commented Jan 3, 2025

build: Parallelize 'run-tests' #1408

Are you sure you want to change the base?

build: Parallelize 'run-tests' #1408

Conversation

ronh-rs commented Jan 3, 2025

CLAassistant commented Jan 3, 2025