-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORE: Fix bug in MT progress queue #147
Conversation
@alex--m regarding moving the structure outside of core - one of the structure features is the "was_progressed" flag - I think it is a feature that special for tasks objects, and not for any object. what do you think? |
@lappazos I think even if we consider it a "task queue" and not a generic locked list it may both still be valueable to other components and be better placed in a separate section outside the "core" folder. This also makes it easier to test it separately and evaluate that threshold Manju asked about, without making it too progress-logic-specific. Regarding the "was-progressed" - yes, it makes sense to me to keep it even if the entire implementation moves to a common area. |
We could do a generic implementation of LF queue. Need to define a simple struct - queue element:
Then in the task struct instead of was_progressed you will have ucc_lf_queue_elem. And then you can perform do This way it will be generic and re-usable if needed. |
ed67f8d
to
e4f1673
Compare
e4f1673
to
df151d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general i guess it should fix the hang that we saw in torch. However, need the blessing from @Sergei-Lebedev - need to check it works (we had slightly different fix).
Secondly, @lappazos since we are adding a generic data structure - plz add a gtest and test it for correctness. need to design stressful test cases: 1. 1 producer, 1 cosumer; 2. 1 producer multiple consumers, 3. multiple producers multiple consumers; need to stress -test the fallback to the list specifically.
/* This data structure is thread safe */ | ||
|
||
// Number of elements in a single lock free pool - could be changed, but in tests performed great | ||
#define LINE_SIZE 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want LINE_SIZE to be runtime parameter? @Sergei-Lebedev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
} | ||
elem = NULL; | ||
ucc_spin_lock(&queue->locked_queue_lock[which_pool]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, an element from the locked list can only be extracted when LINE_SIZE elements are placed in the alternative pool. Looks like there is a big imbalance in favor of pools there. this might lead to the stalls in the progress.. i don't know if we can improve that - and make the dequeue extracting elements with the same probability rate ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you meant. the locked queue is a direct continuation of the LINE_SIZE pool. when inserting, you search for an empty spot from the beginning of LINE_SIZE till the end of locked_queue. when you pop, you search for elem from the beginning of LINE_SIZE to the end of locked queue.
Of course, as we know, it will not maintain order, meaning - new elems that were inserted after some other elem, could be popped before it, doesn't matter if they are on LINE_SIZE or locked queue. But, we have a guarantee that no "old" elem will be popped before some new elem. meaning, if some 2 objects A & B are in pool and B was popped and reinserted, we can assure that B will never be popped again before A will.
ill add the tests. |
161af1a
to
f7fe8e2
Compare
Done |
@Sergei-Lebedev when you have time could you plz check that this MT implementation works with Pytorch (where we saw hang last time) - this will be last check. |
@lappazos test container build failed with this PR with following error |
Fixed @Sergei-Lebedev |
{ | ||
ucc_test_queue_t *test = (ucc_test_queue_t *)arg; | ||
for (int j = 0; j < 5000000; j++) { | ||
// TODO: should we use ucc_mc_alloc instead of ucc_malloc? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no
add #defeine NUM_ITERS 500000
Param comms alltoall test passed |
@Sergei-Lebedev Performance wise: if we are to make the lock-free implementation default, which tests we need to run to make sure there is no perf degradation? |
I usually run nonblocking mode in pytorch param benchmark, it stresses progress queue more. Do we have other use cases? |
dc480a9
to
e075530
Compare
Replace with #210 |
What
Fix bug in MT progress queue
Why ?
Hang while using mt progress queue after PR 125
How ?
Create locked queue per pool instead of global. Also, organize the structs in a more accurate way.
In addition, remove counters (seems like only hure performance) - add unnecessary actions per call. When empty, dequeue func simply finish the iteration.