Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate and fix sidechain integration test failures #1087

Merged
merged 10 commits into from
Nov 7, 2022

Conversation

murerfel
Copy link
Contributor

@murerfel murerfel commented Nov 1, 2022

This issue is to investigate the intermittent failures we see in our integration tests.
Related GH issue: #1023

Observations:

  • Getters in the integration test scripts return wrong results
  • We have nonce clashes on rare occasions
  • Slot times in about 5% of the cases are exceeded, with 2 effects:
    • The proposed block is discarded before it's broadcasted, or
    • the block is broadcasted and reaches the import queue AFTER the next block is being produced. This will result in a 'block is already imported' warning.
    • Both of the above effects are NOT forks, and will maintain sidechain integrity. What we lose is efficiency and responsiveness.
  • Calls from the integration test scripts are executed simultaneously

Conclusions

  • I found that we removed the sleeps in our integration tests scripts at some point. Those sleeps were placed between executing calls and getters.
    • When multiple distinct calls are sent to 2 workers simultaneously, it can happen that we have a nonce clash, because they both try to execute a call and increase the same account nonce at the same time (before that block can be imported)
  • A couple of weeks ago, we changed the getter execution to be immediate and not part of the TOP pool anymore.
    • As a result, when we send a call and a getter almost simultaneously, the getter will return a result before the call was executed and included in a sidechain block.

Solution

  • Simply re-introduce the sleeps in our integration test scripts again, so that we wait for the calls to be executed and included in a sidechain block before calling a getter to verify that effect.

Closes #1023

Felix Müller added 6 commits November 1, 2022 15:05
Fix calculation of remaining slot time for log output

(cherry picked from commit 03b8616)
(cherry picked from commit 185030d8178bafee0fd04b0760d7006cc6bd857b)
(cherry picked from commit 4299c20351ef891f1a1173333a3da74051de5971)
(cherry picked from commit c814f99fd647670281c38b7aa21d3769c27f77c0)
@murerfel murerfel self-assigned this Nov 1, 2022
Felix Müller added 2 commits November 1, 2022 17:37
Since getters are executed immediately and are not put into the
TOP pool anymore (where order is guaranteed), we can get wrong
results because of timings
Also increased the slot fraction for calls again to 0.7 (from 0.4)
@murerfel murerfel marked this pull request as ready for review November 2, 2022 10:32
@murerfel murerfel added A0-core Affects a core part C1-low 📌 Does not elevate a release containing this beyond "low priority" P1-asap F3-test E0-breaksnothing labels Nov 2, 2022
cli/demo_direct_call.sh Show resolved Hide resolved
restart: always
restart: "no"
Copy link
Contributor Author

@murerfel murerfel Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor change, discovered while running the tests in docker. It's confusing in a test setting when the worker automatically restarts after a crash.

Comment on lines +140 to +145
if slot.duration_remaining().is_none() {
warn!("No time remaining in slot, skipping AURA execution");
return Ok(())
}

log_remaining_slot_duration(&slot, "Before AURA");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some log messages and timers to help diagnose the timings and phases of the sidechain slot

pub const BLOCK_PROPOSAL_SLOT_PORTION: f32 = 0.8;
pub const BLOCK_PROPOSAL_SLOT_PORTION: f32 = 0.7;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small reduction in the fraction of the slot time we use for executing calls. When running on CI, I found that we often don't have enough time to broadcast a sidechain block before the next slot starts (resulting in duplicate block numbers and discarded blocks)

Comment on lines -538 to +539
proposing_remaining_duration(&slot_info, duration_now()) > SLOT_DURATION / 2
&& proposing_remaining_duration(&slot_info, duration_now())
< SLOT_DURATION.mul_f32(BLOCK_PROPOSAL_SLOT_PORTION + 0.01)
proposing_remaining_duration(&slot_info, duration_now())
< SLOT_DURATION.mul_f32(BLOCK_PROPOSAL_SLOT_PORTION + 0.01)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first condition of this assertion > SLOT_DURATION / 2 is only true if the BLOCK_PROPOSAL_SLOT_PORTION is > 0.5, which is not guaranteed (and so this test failed when I lowered the value to 0.4 for testing)

tokio = "*"
tokio = { version = "1.6.1", features = ["full"] }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessary change to allow running cargo test in this crate alone.

Comment on lines +84 to +90
pub fn duration_remaining(&self) -> Option<Duration> {
let duration_now = duration_now();
if self.ends_at <= duration_now {
return None
}
Some(self.ends_at - duration_now)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a convenience function to the SlotInfo struct to get the remaining duration in this slot.

@murerfel murerfel changed the title Investigate sidechain timings (test failures) Investigate and fix sidechain integration test failures Nov 2, 2022
Copy link
Contributor

@OverOrion OverOrion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this and ouch for those re-introduced sleeps. 😆

cli/demo_direct_call.sh Show resolved Hide resolved
sidechain/consensus/aura/src/lib.rs Show resolved Hide resolved
@murerfel murerfel requested a review from OverOrion November 7, 2022 07:29
Copy link
Contributor

@OverOrion OverOrion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@murerfel murerfel merged commit 0ad314f into master Nov 7, 2022
@murerfel murerfel deleted the fm/investigate-sidechain-timings branch November 7, 2022 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A0-core Affects a core part B1-releasenotes C1-low 📌 Does not elevate a release containing this beyond "low priority" E0-breaksnothing F3-test P1-asap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI sidechain integration test failed randomly
2 participants