Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hang in tc microbenchmark + fix overlapping prep timers (supersede #127 + #129) #137

Merged
merged 3 commits into from
Aug 12, 2024

Conversation

elgarten
Copy link
Contributor

Fix a hang in the TC microbenchmark (originally #127):
Added a shared WaitGroup between vertex doAll and the nested, per vertex,
edge doAll. Original code would hang because of the separate wait groups:
after enqueueing a doAll in edge_tc_couting, harts wait for it to complete
(tc_algos.cpp:42). However, this occurs on every hart because of the outer
doAll in tc_no_chunk, therefore every hart is waiting and none is available
to complete the work being waited on. When using one combined wait group,
the outer doAll tasks are able to complete after enqueuing, but before
completion of, the inner doAll tasks. Thus, harts are freed to complete the
inner doAll and therefore forward progress.

Fix overlapping prep timers (originally #129):
Add a barrier to sequence the output of each node.

Added a shared WaitGroup between vertex doAll and the nested, per vertex,
edge doAll. Original code would hang because of the separate wait groups:
after enqueueing a doAll in edge_tc_couting, harts wait for it to complete
(tc_algos.cpp:42). However, this occurs on every hart because of the outer
doAll in tc_no_chunk, therefore every hart is waiting and none is available
to complete the work being waited on. When using one combined wait group,
the outer doAll tasks are able to complete after enqueuing, but before
completion of, the inner doAll tasks. Thus, harts are freed to complete the
inner doAll and therefore forward progress.
@elgarten elgarten requested a review from tewaro July 30, 2024 15:41
@elgarten elgarten force-pushed the brenden/timer-barrier-tc-hang branch from c35c4f4 to 676c8a7 Compare July 30, 2024 15:46
pando-rt/src/init.cpp Show resolved Hide resolved
@tewaro
Copy link
Contributor

tewaro commented Aug 5, 2024

Add a comment about drvx, for when we refactor.

@tewaro tewaro self-requested a review August 5, 2024 18:51
@elgarten
Copy link
Contributor Author

elgarten commented Aug 6, 2024

@AdityaAtulTewari added synchronization via pando::ControlProcessor::barrier() in pando-rt/src/drvx/cp.cpp.

@elgarten elgarten requested a review from tewaro August 6, 2024 19:16
@elgarten elgarten merged commit 30140ba into main Aug 12, 2024
20 checks passed
@elgarten elgarten deleted the brenden/timer-barrier-tc-hang branch August 12, 2024 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants