CORE: Fix bug in MT progress queue #147

lappazos · 2021-04-06T12:22:05Z

What

Fix bug in MT progress queue

Why ?

Hang while using mt progress queue after PR 125

How ?

Create locked queue per pool instead of global. Also, organize the structs in a more accurate way.
In addition, remove counters (seems like only hure performance) - add unnecessary actions per call. When empty, dequeue func simply finish the iteration.

lappazos · 2021-04-08T08:34:04Z

@alex--m regarding moving the structure outside of core - one of the structure features is the "was_progressed" flag - I think it is a feature that special for tasks objects, and not for any object. what do you think?

alex--m · 2021-04-08T10:23:50Z

@lappazos I think even if we consider it a "task queue" and not a generic locked list it may both still be valueable to other components and be better placed in a separate section outside the "core" folder. This also makes it easier to test it separately and evaluate that threshold Manju asked about, without making it too progress-logic-specific. Regarding the "was-progressed" - yes, it makes sense to me to keep it even if the entire implementation moves to a common area.

vspetrov · 2021-04-09T06:55:21Z

@lappazos I think even if we consider it a "task queue" and not a generic locked list it may both still be valueable to other components and be better placed in a separate section outside the "core" folder. This also makes it easier to test it separately and evaluate that threshold Manju asked about, without making it too progress-logic-specific. Regarding the "was-progressed" - yes, it makes sense to me to keep it even if the entire implementation moves to a common area.

We could do a generic implementation of LF queue. Need to define a simple struct - queue element:

struct ucc_lf_queue_elem {
    uint8_t was_progressed;
}

Then in the task struct instead of was_progressed you will have ucc_lf_queue_elem. And then you can perform do
ucc_lf_enqueue(ctx->pq, &task->lf_elem)
and
lf_elem = ucc_lf_dequeue(ctx->pq)
task = container_of(lf_elem, ...)

This way it will be generic and re-usable if needed.

lappazos · 2021-04-11T13:36:40Z

@alex--m @vspetrov - Done

vspetrov

In general i guess it should fix the hang that we saw in torch. However, need the blessing from @Sergei-Lebedev - need to check it works (we had slightly different fix).
Secondly, @lappazos since we are adding a generic data structure - plz add a gtest and test it for correctness. need to design stressful test cases: 1. 1 producer, 1 cosumer; 2. 1 producer multiple consumers, 3. multiple producers multiple consumers; need to stress -test the fallback to the list specifically.

vspetrov · 2021-04-12T12:58:50Z

src/utils/ucc_lock_free_queue.h

+/* This data structure is thread safe */
+
+// Number of elements in a single lock free pool - could be changed, but in tests performed great
+#define LINE_SIZE 8


do we want LINE_SIZE to be runtime parameter? @Sergei-Lebedev

@Sergei-Lebedev ?

src/utils/ucc_lock_free_queue.h

vspetrov · 2021-04-12T13:10:47Z

src/utils/ucc_lock_free_queue.h

+        }
+    }
+    elem = NULL;
+    ucc_spin_lock(&queue->locked_queue_lock[which_pool]);


So, an element from the locked list can only be extracted when LINE_SIZE elements are placed in the alternative pool. Looks like there is a big imbalance in favor of pools there. this might lead to the stalls in the progress.. i don't know if we can improve that - and make the dequeue extracting elements with the same probability rate ?

I'm not sure what you meant. the locked queue is a direct continuation of the LINE_SIZE pool. when inserting, you search for an empty spot from the beginning of LINE_SIZE till the end of locked_queue. when you pop, you search for elem from the beginning of LINE_SIZE to the end of locked queue.
Of course, as we know, it will not maintain order, meaning - new elems that were inserted after some other elem, could be popped before it, doesn't matter if they are on LINE_SIZE or locked queue. But, we have a guarantee that no "old" elem will be popped before some new elem. meaning, if some 2 objects A & B are in pool and B was popped and reinserted, we can assure that B will never be popped again before A will.

lappazos · 2021-04-12T13:51:15Z

In general i guess it should fix the hang that we saw in torch. However, need the blessing from @Sergei-Lebedev - need to check it works (we had slightly different fix).
Secondly, @lappazos since we are adding a generic data structure - plz add a gtest and test it for correctness. need to design stressful test cases: 1. 1 producer, 1 cosumer; 2. 1 producer multiple consumers, 3. multiple producers multiple consumers; need to stress -test the fallback to the list specifically.

ill add the tests.

lappazos · 2021-05-11T09:31:27Z

In general i guess it should fix the hang that we saw in torch. However, need the blessing from @Sergei-Lebedev - need to check it works (we had slightly different fix).
Secondly, @lappazos since we are adding a generic data structure - plz add a gtest and test it for correctness. need to design stressful test cases: 1. 1 producer, 1 cosumer; 2. 1 producer multiple consumers, 3. multiple producers multiple consumers; need to stress -test the fallback to the list specifically.

ill add the tests.

Done

vspetrov · 2021-05-11T09:34:38Z

@Sergei-Lebedev when you have time could you plz check that this MT implementation works with Pytorch (where we saw hang last time) - this will be last check.

Sergei-Lebedev · 2021-05-21T05:59:22Z

@lappazos test container build failed with this PR with following error
/usr/bin/ld: ucc_info-ucc_info.o: undefined reference to symbol 'pthread_spin_init@@GLIBC_2.2.5'
/usr/bin/ld: /lib/x86_64-linux-gnu/libpthread.so.0: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:493: ucc_info] Error 1

lappazos · 2021-05-25T11:02:47Z

@lappazos test container build failed with this PR with following error
/usr/bin/ld: ucc_info-ucc_info.o: undefined reference to symbol 'pthread_spin_init@@GLIBC_2.2.5'
/usr/bin/ld: /lib/x86_64-linux-gnu/libpthread.so.0: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:493: ucc_info] Error 1

Fixed @Sergei-Lebedev

src/core/ucc_progress_queue_mt.c

src/utils/ucc_lock_free_queue.h

vspetrov · 2021-05-27T10:22:09Z

test/gtest/utils/test_lock_free_queue.cc

+{
+    ucc_test_queue_t *test = (ucc_test_queue_t *)arg;
+    for (int j = 0; j < 5000000; j++) {
+        // TODO: should we use ucc_mc_alloc instead of ucc_malloc?


no
add #defeine NUM_ITERS 500000

Sergei-Lebedev · 2021-05-27T10:29:34Z

In general i guess it should fix the hang that we saw in torch. However, need the blessing from @Sergei-Lebedev - need to check it works (we had slightly different fix).

Param comms alltoall test passed

vspetrov · 2021-05-27T10:31:12Z

In general i guess it should fix the hang that we saw in torch. However, need the blessing from @Sergei-Lebedev - need to check it works (we had slightly different fix).

Param comms alltoall test passed

@Sergei-Lebedev Performance wise: if we are to make the lock-free implementation default, which tests we need to run to make sure there is no perf degradation?

Sergei-Lebedev · 2021-05-27T10:33:58Z

In general i guess it should fix the hang that we saw in torch. However, need the blessing from @Sergei-Lebedev - need to check it works (we had slightly different fix).

Param comms alltoall test passed

@Sergei-Lebedev Performance wise: if we are to make the lock-free implementation default, which tests we need to run to make sure there is no perf degradation?

I usually run nonblocking mode in pytorch param benchmark, it stresses progress queue more. Do we have other use cases?

lappazos · 2021-05-31T15:36:14Z

Replace with #210

Sergei-Lebedev requested review from bureddy, vspetrov, Sergei-Lebedev and manjugv April 6, 2021 12:25

Sergei-Lebedev added the Ready-for-Review label Apr 6, 2021

manjugv added the Target: v0.1.x PRs/Issue for the v0.1.x release label Apr 6, 2021

lappazos force-pushed the MT_Progress_Engine_Fix branch from ed67f8d to e4f1673 Compare April 11, 2021 13:36

lappazos force-pushed the MT_Progress_Engine_Fix branch from e4f1673 to df151d2 Compare April 11, 2021 13:38

vspetrov reviewed Apr 12, 2021

View reviewed changes

manjugv removed the Ready-for-Review label May 5, 2021

lappazos force-pushed the MT_Progress_Engine_Fix branch 2 times, most recently from 161af1a to f7fe8e2 Compare May 10, 2021 09:14

lappazos requested a review from vspetrov May 10, 2021 09:15

vspetrov added the Ready-for-Review label May 11, 2021

vspetrov reviewed May 27, 2021

View reviewed changes

src/core/ucc_progress_queue_mt.c Outdated Show resolved Hide resolved

vspetrov reviewed May 27, 2021

View reviewed changes

src/core/ucc_progress_queue_mt.c Outdated Show resolved Hide resolved

vspetrov reviewed May 27, 2021

View reviewed changes

src/utils/ucc_lock_free_queue.h Outdated Show resolved Hide resolved

vspetrov reviewed May 27, 2021

View reviewed changes

Lior Paz added 2 commits May 27, 2021 13:27

CORE: Fix bug in MT progress queue

bf5fc0c

CORE: Move to utils

a40379a

Lior Paz added 3 commits May 27, 2021 13:49

TEST: Add gtest for lock free queue

9901c42

CORE: Update Makefile

8aef87a

REVIEW: Code Review

e075530

lappazos force-pushed the MT_Progress_Engine_Fix branch from dc480a9 to e075530 Compare May 30, 2021 14:05

lappazos mentioned this pull request May 31, 2021

CORE: Fix MT progress queue #210

Merged

lappazos closed this May 31, 2021

lappazos deleted the MT_Progress_Engine_Fix branch May 31, 2021 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CORE: Fix bug in MT progress queue #147

CORE: Fix bug in MT progress queue #147

lappazos commented Apr 6, 2021

lappazos commented Apr 8, 2021

alex--m commented Apr 8, 2021

vspetrov commented Apr 9, 2021

lappazos commented Apr 11, 2021

vspetrov left a comment

vspetrov Apr 12, 2021

lappazos May 5, 2021

vspetrov Apr 12, 2021

lappazos Apr 12, 2021

lappazos commented Apr 12, 2021

lappazos commented May 11, 2021

vspetrov commented May 11, 2021

Sergei-Lebedev commented May 21, 2021

lappazos commented May 25, 2021 •

edited

Loading

vspetrov May 27, 2021

Sergei-Lebedev commented May 27, 2021 •

edited

Loading

vspetrov commented May 27, 2021

Sergei-Lebedev commented May 27, 2021

lappazos commented May 31, 2021

CORE: Fix bug in MT progress queue #147

CORE: Fix bug in MT progress queue #147

Conversation

lappazos commented Apr 6, 2021

What

Why ?

How ?

lappazos commented Apr 8, 2021

alex--m commented Apr 8, 2021

vspetrov commented Apr 9, 2021

lappazos commented Apr 11, 2021

vspetrov left a comment

Choose a reason for hiding this comment

vspetrov Apr 12, 2021

Choose a reason for hiding this comment

lappazos May 5, 2021

Choose a reason for hiding this comment

vspetrov Apr 12, 2021

Choose a reason for hiding this comment

lappazos Apr 12, 2021

Choose a reason for hiding this comment

lappazos commented Apr 12, 2021

lappazos commented May 11, 2021

vspetrov commented May 11, 2021

Sergei-Lebedev commented May 21, 2021

lappazos commented May 25, 2021 • edited Loading

vspetrov May 27, 2021

Choose a reason for hiding this comment

Sergei-Lebedev commented May 27, 2021 • edited Loading

vspetrov commented May 27, 2021

Sergei-Lebedev commented May 27, 2021

lappazos commented May 31, 2021

lappazos commented May 25, 2021 •

edited

Loading

Sergei-Lebedev commented May 27, 2021 •

edited

Loading