Investigate and fix sidechain integration test failures #1087

murerfel · 2022-11-01T14:38:34Z

This issue is to investigate the intermittent failures we see in our integration tests.
Related GH issue: #1023

Observations:

Getters in the integration test scripts return wrong results
We have nonce clashes on rare occasions
Slot times in about 5% of the cases are exceeded, with 2 effects:
- The proposed block is discarded before it's broadcasted, or
- the block is broadcasted and reaches the import queue AFTER the next block is being produced. This will result in a 'block is already imported' warning.
- Both of the above effects are NOT forks, and will maintain sidechain integrity. What we lose is efficiency and responsiveness.
Calls from the integration test scripts are executed simultaneously

Conclusions

I found that we removed the sleeps in our integration tests scripts at some point. Those sleeps were placed between executing calls and getters.
- When multiple distinct calls are sent to 2 workers simultaneously, it can happen that we have a nonce clash, because they both try to execute a call and increase the same account nonce at the same time (before that block can be imported)
A couple of weeks ago, we changed the getter execution to be immediate and not part of the TOP pool anymore.
- As a result, when we send a call and a getter almost simultaneously, the getter will return a result before the call was executed and included in a sidechain block.

Solution

Simply re-introduce the sleeps in our integration test scripts again, so that we wait for the calls to be executed and included in a sidechain block before calling a getter to verify that effect.

Closes #1023

Fix calculation of remaining slot time for log output (cherry picked from commit 03b8616)

(cherry picked from commit 1b52a2e)

(cherry picked from commit 185030d8178bafee0fd04b0760d7006cc6bd857b)

(cherry picked from commit 4299c20351ef891f1a1173333a3da74051de5971)

(cherry picked from commit c814f99fd647670281c38b7aa21d3769c27f77c0)

Since getters are executed immediately and are not put into the TOP pool anymore (where order is guaranteed), we can get wrong results because of timings

Also increased the slot fraction for calls again to 0.7 (from 0.4)

cli/demo_direct_call.sh

murerfel · 2022-11-02T10:35:31Z

docker/docker-compose.yml

-    restart: always
+    restart: "no"


Just a minor change, discovered while running the tests in docker. It's confusing in a test setting when the worker automatically restarts after a crash.

murerfel · 2022-11-02T10:36:23Z

enclave-runtime/src/top_pool_execution.rs

+			if slot.duration_remaining().is_none() {
+				warn!("No time remaining in slot, skipping AURA execution");
+				return Ok(())
+			}
+
+			log_remaining_slot_duration(&slot, "Before AURA");


I added some log messages and timers to help diagnose the timings and phases of the sidechain slot

murerfel · 2022-11-02T10:37:50Z

sidechain/consensus/aura/src/lib.rs

-pub const BLOCK_PROPOSAL_SLOT_PORTION: f32 = 0.8;
+pub const BLOCK_PROPOSAL_SLOT_PORTION: f32 = 0.7;


Small reduction in the fraction of the slot time we use for executing calls. When running on CI, I found that we often don't have enough time to broadcast a sidechain block before the next slot starts (resulting in duplicate block numbers and discarded blocks)

murerfel · 2022-11-02T10:39:05Z

sidechain/consensus/aura/src/lib.rs

-			proposing_remaining_duration(&slot_info, duration_now()) > SLOT_DURATION / 2
-				&& proposing_remaining_duration(&slot_info, duration_now())
-					< SLOT_DURATION.mul_f32(BLOCK_PROPOSAL_SLOT_PORTION + 0.01)
+			proposing_remaining_duration(&slot_info, duration_now())
+				< SLOT_DURATION.mul_f32(BLOCK_PROPOSAL_SLOT_PORTION + 0.01)


The first condition of this assertion > SLOT_DURATION / 2 is only true if the BLOCK_PROPOSAL_SLOT_PORTION is > 0.5, which is not guaranteed (and so this test failed when I lowered the value to 0.4 for testing)

murerfel · 2022-11-02T10:39:51Z

sidechain/consensus/slots/Cargo.toml

-tokio = "*"
+tokio = { version = "1.6.1", features = ["full"] }


Necessary change to allow running cargo test in this crate alone.

murerfel · 2022-11-02T10:40:35Z

sidechain/consensus/slots/src/slots.rs

+	pub fn duration_remaining(&self) -> Option<Duration> {
+		let duration_now = duration_now();
+		if self.ends_at <= duration_now {
+			return None
+		}
+		Some(self.ends_at - duration_now)
+	}


Added a convenience function to the SlotInfo struct to get the remaining duration in this slot.

OverOrion

Thank you very much for this and ouch for those re-introduced sleeps. 😆

cli/demo_direct_call.sh

sidechain/consensus/aura/src/lib.rs

OverOrion

LGTM, thank you!

Felix Müller added 6 commits November 1, 2022 15:05

Check remaining time before running AURA

003763b

Fix calculation of remaining slot time for log output (cherry picked from commit 03b8616)

Reduce block proposal / calls execution fraction to 0.6 (60%)

4098596

(cherry picked from commit 1b52a2e)

Add more log output and tooling to debug sidechain timing behavior

e9817a5

(cherry picked from commit 185030d8178bafee0fd04b0760d7006cc6bd857b)

Reduce block production fraction down to 0.4 (40%)

3ba64d0

(cherry picked from commit 4299c20351ef891f1a1173333a3da74051de5971)

add more docker log output for sidechain timing

aef2437

(cherry picked from commit c814f99fd647670281c38b7aa21d3769c27f77c0)

clippy fix

4d70a0c

murerfel self-assigned this Nov 1, 2022

Felix Müller added 2 commits November 1, 2022 17:37

More log output for debugging timings on sidechain

626eb77

Re-introduce sleeps on integration tests again

1f50df6

Since getters are executed immediately and are not put into the TOP pool anymore (where order is guaranteed), we can get wrong results because of timings

murerfel mentioned this pull request Nov 2, 2022

CI sidechain integration test failed randomly #1023

Closed

Revert some of the investigation changes (log levels for example)

7f26996

Also increased the slot fraction for calls again to 0.7 (from 0.4)

murerfel marked this pull request as ready for review November 2, 2022 10:32

murerfel added A0-core Affects a core part C1-low 📌 Does not elevate a release containing this beyond "low priority" P1-asap F3-test E0-breaksnothing labels Nov 2, 2022

murerfel requested review from clangenb and OverOrion November 2, 2022 10:33

murerfel commented Nov 2, 2022

View reviewed changes

murerfel changed the title ~~Investigate sidechain timings (test failures)~~ Investigate and fix sidechain integration test failures Nov 2, 2022

murerfel added the B1-releasenotes label Nov 2, 2022

OverOrion reviewed Nov 2, 2022

View reviewed changes

cli/demo_direct_call.sh Show resolved Hide resolved

sidechain/consensus/aura/src/lib.rs Show resolved Hide resolved

add comment to sleeps in integration test script

cef4ad6

murerfel requested a review from OverOrion November 7, 2022 07:29

OverOrion approved these changes Nov 7, 2022

View reviewed changes

murerfel merged commit 0ad314f into master Nov 7, 2022

murerfel deleted the fm/investigate-sidechain-timings branch November 7, 2022 07:40

murerfel mentioned this pull request Nov 7, 2022

Worker spams warnings in CI logs #1088

Open

clangenb mentioned this pull request Dec 24, 2022

Investigate flakiness of the sidechain integration test #1132

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate and fix sidechain integration test failures #1087

Investigate and fix sidechain integration test failures #1087

murerfel commented Nov 1, 2022 •

edited

Loading

murerfel Nov 2, 2022 •

edited

Loading

murerfel Nov 2, 2022

murerfel Nov 2, 2022

murerfel Nov 2, 2022

murerfel Nov 2, 2022

murerfel Nov 2, 2022

OverOrion left a comment

OverOrion left a comment

		pub const BLOCK_PROPOSAL_SLOT_PORTION: f32 = 0.8;
		pub const BLOCK_PROPOSAL_SLOT_PORTION: f32 = 0.7;

		tokio = "*"
		tokio = { version = "1.6.1", features = ["full"] }

Investigate and fix sidechain integration test failures #1087

Investigate and fix sidechain integration test failures #1087

Conversation

murerfel commented Nov 1, 2022 • edited Loading

Observations:

Conclusions

Solution

murerfel Nov 2, 2022 • edited Loading

Choose a reason for hiding this comment

murerfel Nov 2, 2022

Choose a reason for hiding this comment

murerfel Nov 2, 2022

Choose a reason for hiding this comment

murerfel Nov 2, 2022

Choose a reason for hiding this comment

murerfel Nov 2, 2022

Choose a reason for hiding this comment

murerfel Nov 2, 2022

Choose a reason for hiding this comment

OverOrion left a comment

Choose a reason for hiding this comment

OverOrion left a comment

Choose a reason for hiding this comment

murerfel commented Nov 1, 2022 •

edited

Loading

murerfel Nov 2, 2022 •

edited

Loading