Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workaround random SIGSEGV on macOS aarch64 CI #5757

Merged
merged 3 commits into from
Jan 16, 2024
Merged

Conversation

etan-status
Copy link
Contributor

Separate a let block into multiple let statements to reduce probability of hitting random SIGSEGV during flaky CI tests.

whatever... 🤯

Separate a `let` block into multiple `let` statements to reduce
probability of hitting random `SIGSEGV` during flaky CI tests.

whatever... 🤯
@etan-status
Copy link
Contributor Author

workaround runs here in endless loop until fail: https://ci.status.im/job/nimbus-eth2/job/platforms/job/macos/job/aarch64/job/dev%252Fetan%252Fzz-perpetual/

Seems to be more stable than without it (~10 successful runs so far, none failed)

Copy link

github-actions bot commented Jan 16, 2024

Unit Test Results

         9 files  ±0    1 101 suites  ±0   27m 29s ⏱️ -55s
  4 213 tests ±0    3 866 ✔️ ±0  347 💤 ±0  0 ±0 
16 834 runs  ±0  16 436 ✔️ ±0  398 💤 ±0  0 ±0 

Results for commit 970e965. ± Comparison against base commit 7443a4a.

♻️ This comment has been updated with latest results.

# Until a shared pattern can be identified, it is better to apply
# workarounds that at least avoid the known to be reachable triggers.
# The solution is hacky and far from desirable; it is what it is.
let
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you compared the generated C code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the extra let in this PR introduces an additional state to the closure iterator code. it is still roughly 60 states for this proc, and no I have no clue why it works.

For the other one I refer to here, no, haven't checked, but tried randomly reducing complexity by splitting among statements and using a temporary variable there made it work for Wasm. But as it's only on experimental project, have not followed up on it. But it also points to inline results unwrapping being a possible trigger for random bugs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the extra let in this PR introduces an additional state t

this is an important thing to add to the doc comment - the change in size of the environment changes a number of things such as alignment and which parts of an environment contain pointers and so on, which in turn may have surprising behavioural effects, ie most likely this extra state masks some underlying issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 1225 already includes that remark

@arnetheduck
Copy link
Member

arnetheduck commented Jan 16, 2024

Seems to be more stable than without it (~10 successful runs so far, none failed)

and this is not an artefact of the time of the day (load on server) or something?

@etan-status
Copy link
Contributor Author

There are 50 successful runs in a row with the extra let now. While before, it would SIGSEGV every ~3 times.

You can verify on dev/etan/zz-perpetual branch, push to it, then go to https://ci.status.im/job/nimbus-eth2/job/platforms/job/macos/job/aarch64/job/dev%252Fetan%252Fzz-perpetual/ and click on left side "Build with Parameters". It will automatically repeat the keymanager test until it fails.

@etan-status
Copy link
Contributor Author

I have removed the extra let once more on the perpetual testing branch, and it gave SIGSEGV after just two attempts. After more than 60 attempts in a row were successful with the extra let.

image

So, yes, this is most likely not time of day dependent.

@etan-status etan-status enabled auto-merge (squash) January 16, 2024 11:11
@etan-status etan-status merged commit b382833 into unstable Jan 16, 2024
11 checks passed
@etan-status etan-status deleted the dev/etan/ts-segv branch January 16, 2024 12:41
etan-status added a commit that referenced this pull request Feb 25, 2024
The weird `let` bug from #5757 appeared again :-) Document findings.
etan-status added a commit that referenced this pull request Feb 26, 2024
The weird `let` bug from #5757 appeared again :-) Document findings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants