Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multi-threaded access to non-TLS servers (eg - SNP nodes fetching endorsements from THIM) #6836

Merged
merged 11 commits into from
Feb 20, 2025

Conversation

eddyashton
Copy link
Member

#6830 includes a bunch of debugging hacks to try and get a repro of the failure in virtual, to isolate the segfault. Includes a tiny THIM shim server.

This is a cleaner version of that, hopefully including the minimal changes for a CI repro (multi-threaded SNP CI) and the minimal fixes (dispatch send_data by threads, and add a Mutex in QuoteEndorsementsClient).

@eddyashton
Copy link
Member Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@eddyashton
Copy link
Member Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@eddyashton
Copy link
Member Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@eddyashton eddyashton changed the title Run SNP CI with multiple worker threads Fix multi-threaded access to non-TLS servers (eg - SNP nodes fetching endorsements from THIM) Feb 20, 2025
@eddyashton eddyashton marked this pull request as ready for review February 20, 2025 09:01
@eddyashton eddyashton requested a review from a team as a code owner February 20, 2025 09:01
@eddyashton
Copy link
Member Author

2 open questions on this:

I think this is a sufficient fix, we have a green run where we previously had consistent reds. But also had a single red run on this final version of the code. I think that's something unrelatedly flaky in modules_test:

57:   1) polyfill
57:        isValidX509CertChain
57:          returns true for valid cert chains:
57:      Error: Timeout of 2000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/CCF/js/ccf-app/test/polyfill.test.ts)
57:       at process.processImmediate (node:internal/timers:478:21)

But will try a few more runs to confirm.

I was hoping to leave on the -DWORKER_THREADS=2, so that we maintain CI coverage of multi-threaded SNP runs, but it causes a significant slowdown in the tests. The CTest step takes ~35 minutes rather than ~23 minutes, pushing the whole Deploy ACI job worryingly close to the 1hr timeout. Almost all of the e2e tests run slower, I'm guessing because we're overloading the 4-core C-ACI instances with multiple parallel services and multiple nodes. If anyone has a quick bright idea here let me know, otherwise I'll turn off the -DWORKER_THREADS=2 and we'll lose automated test of this again, for now.

@eddyashton
Copy link
Member Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@eddyashton
Copy link
Member Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

eddyashton added a commit that referenced this pull request Feb 20, 2025
…rs (eg - SNP nodes fetching endorsements from THIM) (#6836) (#6839)
@eddyashton eddyashton enabled auto-merge February 20, 2025 17:16
@eddyashton
Copy link
Member Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@eddyashton eddyashton added this pull request to the merge queue Feb 20, 2025
Merged via the queue into microsoft:main with commit b231961 Feb 20, 2025
21 checks passed
@eddyashton eddyashton deleted the multi_threaded_snp_take2 branch February 20, 2025 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5.x-todo PRs which should be backported to 5.x
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants