Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE: Fix bug in MT progress queue #147

Closed
wants to merge 5 commits into from

Conversation

lappazos
Copy link
Contributor

@lappazos lappazos commented Apr 6, 2021

What

Fix bug in MT progress queue

Why ?

Hang while using mt progress queue after PR 125

How ?

Create locked queue per pool instead of global. Also, organize the structs in a more accurate way.
In addition, remove counters (seems like only hure performance) - add unnecessary actions per call. When empty, dequeue func simply finish the iteration.

@manjugv manjugv added the Target: v0.1.x PRs/Issue for the v0.1.x release label Apr 6, 2021
@lappazos
Copy link
Contributor Author

lappazos commented Apr 8, 2021

@alex--m regarding moving the structure outside of core - one of the structure features is the "was_progressed" flag - I think it is a feature that special for tasks objects, and not for any object. what do you think?

@alex--m
Copy link
Contributor

alex--m commented Apr 8, 2021

@lappazos I think even if we consider it a "task queue" and not a generic locked list it may both still be valueable to other components and be better placed in a separate section outside the "core" folder. This also makes it easier to test it separately and evaluate that threshold Manju asked about, without making it too progress-logic-specific. Regarding the "was-progressed" - yes, it makes sense to me to keep it even if the entire implementation moves to a common area.

@vspetrov
Copy link
Collaborator

vspetrov commented Apr 9, 2021

@lappazos I think even if we consider it a "task queue" and not a generic locked list it may both still be valueable to other components and be better placed in a separate section outside the "core" folder. This also makes it easier to test it separately and evaluate that threshold Manju asked about, without making it too progress-logic-specific. Regarding the "was-progressed" - yes, it makes sense to me to keep it even if the entire implementation moves to a common area.

We could do a generic implementation of LF queue. Need to define a simple struct - queue element:

struct ucc_lf_queue_elem {
    uint8_t was_progressed;
}

Then in the task struct instead of was_progressed you will have ucc_lf_queue_elem. And then you can perform do
ucc_lf_enqueue(ctx->pq, &task->lf_elem)
and
lf_elem = ucc_lf_dequeue(ctx->pq)
task = container_of(lf_elem, ...)

This way it will be generic and re-usable if needed.

@lappazos lappazos force-pushed the MT_Progress_Engine_Fix branch from ed67f8d to e4f1673 Compare April 11, 2021 13:36
@lappazos
Copy link
Contributor Author

@alex--m @vspetrov - Done

@lappazos lappazos force-pushed the MT_Progress_Engine_Fix branch from e4f1673 to df151d2 Compare April 11, 2021 13:38
Copy link
Collaborator

@vspetrov vspetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general i guess it should fix the hang that we saw in torch. However, need the blessing from @Sergei-Lebedev - need to check it works (we had slightly different fix).
Secondly, @lappazos since we are adding a generic data structure - plz add a gtest and test it for correctness. need to design stressful test cases: 1. 1 producer, 1 cosumer; 2. 1 producer multiple consumers, 3. multiple producers multiple consumers; need to stress -test the fallback to the list specifically.

/* This data structure is thread safe */

// Number of elements in a single lock free pool - could be changed, but in tests performed great
#define LINE_SIZE 8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want LINE_SIZE to be runtime parameter? @Sergei-Lebedev

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/utils/ucc_lock_free_queue.h Show resolved Hide resolved
}
}
elem = NULL;
ucc_spin_lock(&queue->locked_queue_lock[which_pool]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, an element from the locked list can only be extracted when LINE_SIZE elements are placed in the alternative pool. Looks like there is a big imbalance in favor of pools there. this might lead to the stalls in the progress.. i don't know if we can improve that - and make the dequeue extracting elements with the same probability rate ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you meant. the locked queue is a direct continuation of the LINE_SIZE pool. when inserting, you search for an empty spot from the beginning of LINE_SIZE till the end of locked_queue. when you pop, you search for elem from the beginning of LINE_SIZE to the end of locked queue.
Of course, as we know, it will not maintain order, meaning - new elems that were inserted after some other elem, could be popped before it, doesn't matter if they are on LINE_SIZE or locked queue. But, we have a guarantee that no "old" elem will be popped before some new elem. meaning, if some 2 objects A & B are in pool and B was popped and reinserted, we can assure that B will never be popped again before A will.

@lappazos
Copy link
Contributor Author

In general i guess it should fix the hang that we saw in torch. However, need the blessing from @Sergei-Lebedev - need to check it works (we had slightly different fix).
Secondly, @lappazos since we are adding a generic data structure - plz add a gtest and test it for correctness. need to design stressful test cases: 1. 1 producer, 1 cosumer; 2. 1 producer multiple consumers, 3. multiple producers multiple consumers; need to stress -test the fallback to the list specifically.

ill add the tests.

@lappazos lappazos force-pushed the MT_Progress_Engine_Fix branch 2 times, most recently from 161af1a to f7fe8e2 Compare May 10, 2021 09:14
@lappazos lappazos requested a review from vspetrov May 10, 2021 09:15
@lappazos
Copy link
Contributor Author

In general i guess it should fix the hang that we saw in torch. However, need the blessing from @Sergei-Lebedev - need to check it works (we had slightly different fix).
Secondly, @lappazos since we are adding a generic data structure - plz add a gtest and test it for correctness. need to design stressful test cases: 1. 1 producer, 1 cosumer; 2. 1 producer multiple consumers, 3. multiple producers multiple consumers; need to stress -test the fallback to the list specifically.

ill add the tests.

Done

@vspetrov
Copy link
Collaborator

@Sergei-Lebedev when you have time could you plz check that this MT implementation works with Pytorch (where we saw hang last time) - this will be last check.

@Sergei-Lebedev
Copy link
Contributor

@lappazos test container build failed with this PR with following error
/usr/bin/ld: ucc_info-ucc_info.o: undefined reference to symbol 'pthread_spin_init@@GLIBC_2.2.5'
/usr/bin/ld: /lib/x86_64-linux-gnu/libpthread.so.0: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:493: ucc_info] Error 1

@lappazos
Copy link
Contributor Author

lappazos commented May 25, 2021

@lappazos test container build failed with this PR with following error
/usr/bin/ld: ucc_info-ucc_info.o: undefined reference to symbol 'pthread_spin_init@@GLIBC_2.2.5'
/usr/bin/ld: /lib/x86_64-linux-gnu/libpthread.so.0: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:493: ucc_info] Error 1

Fixed @Sergei-Lebedev

{
ucc_test_queue_t *test = (ucc_test_queue_t *)arg;
for (int j = 0; j < 5000000; j++) {
// TODO: should we use ucc_mc_alloc instead of ucc_malloc?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no
add #defeine NUM_ITERS 500000

@Sergei-Lebedev
Copy link
Contributor

Sergei-Lebedev commented May 27, 2021

In general i guess it should fix the hang that we saw in torch. However, need the blessing from @Sergei-Lebedev - need to check it works (we had slightly different fix).

Param comms alltoall test passed

@vspetrov
Copy link
Collaborator

In general i guess it should fix the hang that we saw in torch. However, need the blessing from @Sergei-Lebedev - need to check it works (we had slightly different fix).

Param comms alltoall test passed

@Sergei-Lebedev Performance wise: if we are to make the lock-free implementation default, which tests we need to run to make sure there is no perf degradation?

@Sergei-Lebedev
Copy link
Contributor

In general i guess it should fix the hang that we saw in torch. However, need the blessing from @Sergei-Lebedev - need to check it works (we had slightly different fix).

Param comms alltoall test passed

@Sergei-Lebedev Performance wise: if we are to make the lock-free implementation default, which tests we need to run to make sure there is no perf degradation?

I usually run nonblocking mode in pytorch param benchmark, it stresses progress queue more. Do we have other use cases?

@lappazos lappazos force-pushed the MT_Progress_Engine_Fix branch from dc480a9 to e075530 Compare May 30, 2021 14:05
@lappazos
Copy link
Contributor Author

Replace with #210

@lappazos lappazos closed this May 31, 2021
@lappazos lappazos deleted the MT_Progress_Engine_Fix branch May 31, 2021 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ready-for-Review Target: v0.1.x PRs/Issue for the v0.1.x release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants