Fix hang in tc microbenchmark + fix overlapping prep timers (supersede #127 + #129) #137

elgarten · 2024-07-30T15:41:02Z

Fix a hang in the TC microbenchmark (originally #127):
Added a shared WaitGroup between vertex doAll and the nested, per vertex,
edge doAll. Original code would hang because of the separate wait groups:
after enqueueing a doAll in edge_tc_couting, harts wait for it to complete
(tc_algos.cpp:42). However, this occurs on every hart because of the outer
doAll in tc_no_chunk, therefore every hart is waiting and none is available
to complete the work being waited on. When using one combined wait group,
the outer doAll tasks are able to complete after enqueuing, but before
completion of, the inner doAll tasks. Thus, harts are freed to complete the
inner doAll and therefore forward progress.

Fix overlapping prep timers (originally #129):
Add a barrier to sequence the output of each node.

Added a shared WaitGroup between vertex doAll and the nested, per vertex, edge doAll. Original code would hang because of the separate wait groups: after enqueueing a doAll in edge_tc_couting, harts wait for it to complete (tc_algos.cpp:42). However, this occurs on every hart because of the outer doAll in tc_no_chunk, therefore every hart is waiting and none is available to complete the work being waited on. When using one combined wait group, the outer doAll tasks are able to complete after enqueuing, but before completion of, the inner doAll tasks. Thus, harts are freed to complete the inner doAll and therefore forward progress.

pando-rt/src/init.cpp

tewaro · 2024-08-05T18:51:13Z

Add a comment about drvx, for when we refactor.

elgarten · 2024-08-06T19:16:17Z

@AdityaAtulTewari added synchronization via pando::ControlProcessor::barrier() in pando-rt/src/drvx/cp.cpp.

elgarten requested a review from tewaro July 30, 2024 15:41

This was referenced Jul 30, 2024

fix overlapping prep timers #129

Closed

fix hang in tc microbenchmark #127

Closed

fix overlapping prep timers

676c8a7

elgarten force-pushed the brenden/timer-barrier-tc-hang branch from c35c4f4 to 676c8a7 Compare July 30, 2024 15:46

tewaro requested changes Jul 31, 2024

View reviewed changes

pando-rt/src/init.cpp Show resolved Hide resolved

tewaro self-requested a review August 5, 2024 18:51

tewaro approved these changes Aug 5, 2024

View reviewed changes

add synchronization to drv timers

6c79fb4

elgarten requested a review from tewaro August 6, 2024 19:16

elgarten merged commit 30140ba into main Aug 12, 2024
20 checks passed

elgarten deleted the brenden/timer-barrier-tc-hang branch August 12, 2024 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hang in tc microbenchmark + fix overlapping prep timers (supersede #127 + #129) #137

Fix hang in tc microbenchmark + fix overlapping prep timers (supersede #127 + #129) #137

elgarten commented Jul 30, 2024

tewaro commented Aug 5, 2024

elgarten commented Aug 6, 2024

Fix hang in tc microbenchmark + fix overlapping prep timers (supersede #127 + #129) #137

Fix hang in tc microbenchmark + fix overlapping prep timers (supersede #127 + #129) #137

Conversation

elgarten commented Jul 30, 2024

tewaro commented Aug 5, 2024

elgarten commented Aug 6, 2024