refactor: fix flaky unit tests #3338

jarhodes314 · 2025-01-17T17:56:27Z

Proposed changes

This fixes some flaky unit tests I observed locally. One was the mqtt_channel tests, which were failing due to a genuine bug where the connection was closed before all messages were published. The other was a mapper test, which implicitly assumed messages would be delivered in a fixed order.

A few other things that have changed in this PR:

I've added --status-level fail to cargo nextest run in just test. This stops cargo nextest listing every passing test name, which was previously obscuring the error output in my terminal.
I've removed serial-test from mqtt_channel in favour of using distinct topic/session names against a single broker in each test
Replaced some std::env::var calls with env! and some relative file paths with absolute paths so the test processes can be called without running them under cargo test. This is important for using tools like cargo stress

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
Documentation Update (if none of the other choices apply)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

I don't believe either of these had associated issues.

Checklist

I have read the CONTRIBUTING doc
I have signed the CLA (in all commits with git commit -s)
I ran cargo fmt as mentioned in CODING_GUIDELINES
I used cargo clippy as mentioned in CODING_GUIDELINES
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Further comments

…ection gracefully

jarhodes314 · 2025-01-17T17:59:07Z

crates/tests/mqtt_tests/src/test_mqtt_server.rs

    let mut servers = HashMap::new();
    servers.insert("1".to_string(), server_config);

    rumqttd::Config {
        id: 0,
        router: router_config,
        cluster: None,
-        console: Some(console_settings),
+        console: None,


I don't know why this was ever enabled, I haven't observed us using it anywhere

It appears to be a defensive diagnostic mechanism kept in place to debug the server when some rare flaky failure occurs, as running the test again by turning it ON may not reproduce the failure.

We can safely ignore these console settings.

jarhodes314 · 2025-01-17T18:00:32Z

crates/extensions/c8y_mapper_ext/src/tests.rs


+    // The messages might get processed out of order, we don't care about the ordering of the messages
+    requests.sort();


This probably isn't technically necessary as I imagine the requests are always sent in the order from the original c8y message, but I think the point still stands that the ordering is irrelevant.

codecov · 2025-01-17T18:16:40Z

Codecov Report

Attention: Patch coverage is 93.16770% with 11 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
crates/common/mqtt_channel/src/tests.rs	90.00%	0 Missing and 9 partials ⚠️
crates/common/mqtt_channel/src/connection.rs	93.54%	1 Missing and 1 partial ⚠️

Additional details and impacted files

📢 Thoughts on this report? Let us know!

github-actions · 2025-01-17T18:17:53Z

Robot Results

✅ Passed	❌ Failed	⏭️ Skipped	Total	Pass %	⏱️ Duration
556	0	2	556	100	1h30m17.154075999s

albinsuresh

The main fix that ensures that all the messages to be published are exhausted properly with acks looks fine. But, the test failures look concerning. Just some queries/comments on the other bits as well.

albinsuresh · 2025-01-20T06:24:17Z

crates/tests/mqtt_tests/src/test_mqtt_server.rs

+            Ok(Ok(())) => {
+                // I don't know why it happened, but I have observed this once while testing
+                // So just log the error and retry starting the broker on a new port
+                eprintln!("MQTT-TEST ERROR: `broker.start()` should not terminate until after `spawn_broker` returns")


But when this happens and when we retry the broker start on the next iteration, the client thread that's started (line. 137) in the last iteration also must be aborted somehow, right? Because once that client thread enters the loop, I don't see how it can break out even on a connection error because of the very specific if let check.

The client thread is not part of the loop, it's only started once we have a healthy broker.

albinsuresh · 2025-01-20T06:49:48Z

crates/tests/mqtt_tests/src/test_mqtt_server.rs

    let mut servers = HashMap::new();
    servers.insert("1".to_string(), server_config);

    rumqttd::Config {
        id: 0,
        router: router_config,
        cluster: None,
-        console: Some(console_settings),
+        console: None,


It appears to be a defensive diagnostic mechanism kept in place to debug the server when some rare flaky failure occurs, as running the test again by turning it ON may not reproduce the failure.

didier-wenzek

Thank you for finding and fixing this bug on closing the connection a bit too soon. However, one has to do the same for QoS 2.

crates/common/mqtt_channel/src/connection.rs

Co-authored-by: Didier Wenzek <didier.wenzek@free.fr>

This should fix thin-edge#3021

crates/common/mqtt_channel/src/connection.rs

…ly closed

didier-wenzek

Using a semaphore with a unique permit is a nice idea to notify the receiver loop from the sender loop that the connection has to be closed. However, I don't fully get how the receiver loop close when there is stream of input messages.

didier-wenzek · 2025-01-27T10:40:07Z

crates/common/mqtt_channel/Cargo.toml

@@ -17,14 +17,13 @@ log = { workspace = true }
 rumqttc = { workspace = true }
 serde = { workspace = true }
 thiserror = { workspace = true }
-tokio = { workspace = true, features = ["rt", "time"] }
+tokio = { workspace = true, features = ["rt", "time", "rt-multi-thread"] }


This is not used.

Suggested change

tokio = { workspace = true, features = ["rt", "time", "rt-multi-thread"] }

tokio = { workspace = true, features = ["rt", "time"] }

It is used when running doctests as we use #[tokio::main] there. Trying to test in isolation without it breaks, and that made checking for flakiness hard.

didier-wenzek · 2025-01-27T10:44:54Z

crates/common/mqtt_channel/Cargo.toml

 zeroize = { workspace = true }

 [dev-dependencies]
 anyhow = { workspace = true }
 mqtt_tests = { workspace = true }
 serde_json = { workspace = true }
-serial_test = { workspace = true }


didier-wenzek · 2025-01-27T10:59:44Z

crates/common/mqtt_channel/src/tests.rs

+/// Prefixes a topic/session name with a module path and line number
+///
+/// This allows multiple tests to share an MQTT broker, allowing them to
+/// run concurrently within a single test process.
+macro_rules! uniquify {


didier-wenzek · 2025-01-27T11:02:32Z

crates/common/mqtt_channel/src/tests.rs

-    broker.publish(topic, "msg 1").await?;
-    broker.publish(topic, "msg 2").await?;
-    broker.publish(topic, "msg 3").await?;
+    broker.publish(topic, "msg 1").await.unwrap();


Curious: why do you prefer to unwrap() errors in a tests?

On my side, I prefer to raise such errors, to keep the test closer to real code.

Because there's no panic, ? doesn't give any indication of which line failed, and that makes cases like this impossible to debug (if I don't know which message didn't publish, I've got almost no clue as to the nature of the bug).

crates/common/mqtt_channel/src/connection.rs

didier-wenzek

Approved. Thank you for your perseverance in finding the more appropriate solution to properly detect end of connection.

didier-wenzek · 2025-01-27T14:40:55Z

I confirm my approval after this commit using tokio::select! 2483809

jarhodes314 added 3 commits January 17, 2025 17:44

Improve robustness of mqtt test broker startup

176b0a9

Ensure mqtt_channel waits for messages to publish before closing conn…

28f09e5

…ection gracefully

Fix flakiness from out of order messages

97c4550

jarhodes314 requested review from didier-wenzek, albinsuresh and rina23q as code owners January 17, 2025 17:56

jarhodes314 temporarily deployed to Test Pull Request January 17, 2025 17:56 — with GitHub Actions Inactive

jarhodes314 added the theme:testing Theme: Testing label Jan 17, 2025

jarhodes314 commented Jan 17, 2025

View reviewed changes

jarhodes314 had a problem deploying to Test Auto January 17, 2025 18:02 — with GitHub Actions Failure

albinsuresh reviewed Jan 20, 2025

View reviewed changes

didier-wenzek reviewed Jan 20, 2025

View reviewed changes

crates/common/mqtt_channel/src/connection.rs Outdated Show resolved Hide resolved

crates/common/mqtt_channel/src/connection.rs Outdated Show resolved Hide resolved

Update crates/common/mqtt_channel/src/connection.rs

31d2bde

Co-authored-by: Didier Wenzek <didier.wenzek@free.fr>

jarhodes314 temporarily deployed to Test Pull Request January 20, 2025 10:52 — with GitHub Actions Inactive

jarhodes314 had a problem deploying to Test Auto January 20, 2025 11:05 — with GitHub Actions Failure

Handle republishes and track last will publish

838f5aa

jarhodes314 temporarily deployed to Test Pull Request January 21, 2025 15:17 — with GitHub Actions Inactive

jarhodes314 had a problem deploying to Test Auto January 21, 2025 15:29 — with GitHub Actions Failure

jarhodes314 added 2 commits January 21, 2025 17:38

Avoid blocking on awaiting acks

ac28600

Don't risk blocking the mqtt-channel task

584daf2

jarhodes314 had a problem deploying to Test Pull Request January 21, 2025 18:16 — with GitHub Actions Failure

Ensure the bridge test proxy is always cleaned up

88c40c0

This should fix thin-edge#3021

jarhodes314 temporarily deployed to Test Pull Request January 21, 2025 18:19 — with GitHub Actions Inactive

jarhodes314 had a problem deploying to Test Auto January 21, 2025 18:28 — with GitHub Actions Failure

jarhodes314 commented Jan 22, 2025

View reviewed changes

crates/common/mqtt_channel/src/connection.rs Outdated Show resolved Hide resolved

jarhodes314 added 2 commits January 22, 2025 17:40

Explicitly close the connection to make sure the connection is actual…

69ae10a

…ly closed

Replace unneeded channel with simpler abstraction

c679907

jarhodes314 temporarily deployed to Test Pull Request January 23, 2025 10:07 — with GitHub Actions Inactive

jarhodes314 requested a review from a team as a code owner January 24, 2025 16:22

jarhodes314 had a problem deploying to Test Pull Request January 24, 2025 16:22 — with GitHub Actions Failure

jarhodes314 added 3 commits January 24, 2025 16:29

Remove serial_test from mqtt-channel

bcceb23

Make tests more portable to allow them to run under cargo stress

ad6572e

Make cargo-nextest output less verbose

51408a0

jarhodes314 force-pushed the bug/flaky-unit-tests branch from f4bc3fd to 51408a0 Compare January 24, 2025 16:29

jarhodes314 requested a review from reubenmiller as a code owner January 24, 2025 16:29

jarhodes314 temporarily deployed to Test Pull Request January 24, 2025 16:30 — with GitHub Actions Inactive

jarhodes314 temporarily deployed to Test Auto January 24, 2025 16:37 — with GitHub Actions Inactive

didier-wenzek reviewed Jan 27, 2025

View reviewed changes

jarhodes314 requested a deployment to Test Pull Request January 27, 2025 12:20 — with GitHub Actions Waiting

Improve explanation of shutdown process

529e71b

jarhodes314 force-pushed the bug/flaky-unit-tests branch from d7a6f2e to 529e71b Compare January 27, 2025 12:22

jarhodes314 temporarily deployed to Test Pull Request January 27, 2025 12:22 — with GitHub Actions Inactive

jarhodes314 had a problem deploying to Test Auto January 27, 2025 12:31 — with GitHub Actions Failure

jarhodes314 had a problem deploying to Test Auto January 27, 2025 12:54 — with GitHub Actions Failure

didier-wenzek approved these changes Jan 27, 2025

View reviewed changes

jarhodes314 requested a deployment to Test Pull Request January 27, 2025 13:28 — with GitHub Actions Waiting

Simplify event polling using tokio::select

2483809

jarhodes314 force-pushed the bug/flaky-unit-tests branch from 849f313 to 2483809 Compare January 27, 2025 13:28

jarhodes314 temporarily deployed to Test Pull Request January 27, 2025 13:28 — with GitHub Actions Inactive

jarhodes314 had a problem deploying to Test Auto January 27, 2025 13:39 — with GitHub Actions Failure

jarhodes314 had a problem deploying to Test Auto January 27, 2025 13:57 — with GitHub Actions Failure

jarhodes314 temporarily deployed to Test Auto January 27, 2025 15:57 — with GitHub Actions Inactive

jarhodes314 added this pull request to the merge queue Jan 27, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 27, 2025

jarhodes314 added this pull request to the merge queue Jan 27, 2025

Merged via the queue into thin-edge:main with commit 09a0d00 Jan 27, 2025
33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: fix flaky unit tests #3338

refactor: fix flaky unit tests #3338

jarhodes314 commented Jan 17, 2025 •

edited

Loading

jarhodes314 Jan 17, 2025

albinsuresh Jan 20, 2025 •

edited

Loading

didier-wenzek Jan 20, 2025

jarhodes314 Jan 17, 2025

codecov bot commented Jan 17, 2025 •

edited

Loading

github-actions bot commented Jan 17, 2025 •

edited

Loading

albinsuresh left a comment •

edited

Loading

albinsuresh Jan 20, 2025

jarhodes314 Jan 24, 2025

albinsuresh Jan 20, 2025 •

edited

Loading

didier-wenzek left a comment

didier-wenzek left a comment

didier-wenzek Jan 27, 2025

jarhodes314 Jan 27, 2025

didier-wenzek Jan 27, 2025

didier-wenzek Jan 27, 2025

didier-wenzek Jan 27, 2025

jarhodes314 Jan 27, 2025

didier-wenzek left a comment

didier-wenzek commented Jan 27, 2025


		// The messages might get processed out of order, we don't care about the ordering of the messages
		requests.sort();

	tokio = { workspace = true, features = ["rt", "time", "rt-multi-thread"] }
	tokio = { workspace = true, features = ["rt", "time"] }

refactor: fix flaky unit tests #3338

refactor: fix flaky unit tests #3338

Conversation

jarhodes314 commented Jan 17, 2025 • edited Loading

Proposed changes

Types of changes

Paste Link to the issue

Checklist

Further comments

Choose a reason for hiding this comment

albinsuresh Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 17, 2025 • edited Loading

Codecov Report

github-actions bot commented Jan 17, 2025 • edited Loading

Robot Results

albinsuresh left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albinsuresh Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

didier-wenzek left a comment

Choose a reason for hiding this comment

didier-wenzek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

didier-wenzek left a comment

Choose a reason for hiding this comment

didier-wenzek commented Jan 27, 2025

jarhodes314 commented Jan 17, 2025 •

edited

Loading

albinsuresh Jan 20, 2025 •

edited

Loading

codecov bot commented Jan 17, 2025 •

edited

Loading

github-actions bot commented Jan 17, 2025 •

edited

Loading

albinsuresh left a comment •

edited

Loading

albinsuresh Jan 20, 2025 •

edited

Loading