Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CPU topology initialization (Fixes subsequent hang in CmiBarrier) #2838

Closed
wants to merge 12 commits into from

Conversation

minitu
Copy link
Contributor

@minitu minitu commented May 6, 2020

A hang could be observed when CmiBarrier is used after CmiInitCPUTopology in _initCharm (src/ck-core/init.C). This issue was caused by comm threads prematurely exiting the CmiNetworkProgress loop (lines 485-490) and thus failing to pass the CmiReduce message (line 538) to PE 0.

I believe the original code seemed to work fine (when CmiBarrier was not used) because at some later point in time the comm threads would have resumed processing their message queues, including the remaining CmiReduce messages.

This PR modifies the CPU topology init process (how messages are passed, in particular), by referring to the CPU affinity init code which did not exhibit issues with subsequent CmiBarrier calls.

@minitu minitu added this to the 6.11 milestone May 6, 2020
@minitu minitu self-assigned this May 6, 2020
@minitu
Copy link
Contributor Author

minitu commented May 6, 2020

The hang can be observed with comm thread builds (e.g. UCX), with a CmiBarrier placed after CPU topology init.
Test program: hello/1darray
On LLNL Lassen: jsrun -n2 -a1 -c2 -K1 -r2 ./hello +ppn 1 +setcpuaffinity

@minitu minitu changed the title Fix CmiBarrier hang after CPU topology init Update CPU Topology Initialization (Fixes subsequent hang in CmiBarrier) May 28, 2020
@minitu minitu changed the title Update CPU Topology Initialization (Fixes subsequent hang in CmiBarrier) Update CPU topology initialization (Fixes subsequent hang in CmiBarrier) May 28, 2020
@minitu minitu removed this from the 6.11 milestone Jul 9, 2020
@minitu
Copy link
Contributor Author

minitu commented Jul 9, 2020

Removing from 6.11 for now, as it is unclear how this should be fixed.

@minitu minitu marked this pull request as draft August 4, 2020 21:33
@minitu
Copy link
Contributor Author

minitu commented Jan 28, 2021

Closing, will reopen or create a new PR once we figure out how to solve the CmiBarrier hang.

@minitu minitu closed this Jan 28, 2021
@stwhite91 stwhite91 deleted the cputopology-fix branch July 25, 2024 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant