Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execute user's callback after setting request status #32

Merged
merged 10 commits into from
May 2, 2023

Conversation

pentschev
Copy link
Member

Once the user-defined callback is executed, it's important that the status for the request is already set, this may be used by the callback itself, as for example in the ucxx::RequestTagMulti where the request status must be known to ensure it did result in an error.

@pentschev pentschev requested a review from a team as a code owner April 19, 2023 15:18
@pentschev pentschev added bug Something isn't working non-breaking Introduces a non-breaking change labels Apr 19, 2023
When a request completes immediately, the callback is executed before
the request method returns, and thus it isn't possible to get the status
of the object for which we don't have a reference yet. For now we just
hack around those cases (which happen infrequently), but in the future
we should find a better way to test this.
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

cpp/tests/request.cpp Outdated Show resolved Hide resolved
cpp/src/request.cpp Outdated Show resolved Hide resolved
@pentschev pentschev changed the base branch from branch-0.31 to branch-0.32 April 26, 2023 21:16
@pentschev pentschev mentioned this pull request Apr 28, 2023
4 tasks
To ensure that a user-defined request callback has access to the status,
an argument containing the status is now added to the callback function
prototype. This allows the callback to know the request status before
the `ucxx::Request`s is set, and thus preventing the application from
believing the request has completed while the callback did not run yet.

To make this change more evident and organized, new types
`RequestCallbackUserFunction` and `RequestCallbackUserData` are added.
@pentschev
Copy link
Member Author

@wence- I pushed the changes we discussed previously in #26 (comment) via caa129f. It seems that the issue of the request result being set before the callback is executed is now resolved and it seems more stable. However, I still do see segfaults if I run tests in a loop, I'm not sure if they are related to changes in this PR or if they are just showing up now because other issues were resolved. Also the issue I was attempting to workaround with #37 is still present, and is quite annoying since it seems to be only happening frequently only in the 11.2.2, centos7, amd64, 3.9, v100, earliest, I'm not sure why that is, the conda packages that are installed in other builds are the same, what changes really seems to be limited to OS and CUDA versions.

@wence-
Copy link
Contributor

wence- commented May 2, 2023

However, I still do see segfaults if I run tests in a loop, I'm not sure if they are related to changes in this PR or if they are just showing up now because other issues were resolved.

Can you provide a reproducer? Just while 1; ./UCXX_TESTS ?

@wence-
Copy link
Contributor

wence- commented May 2, 2023

However, I still do see segfaults if I run tests in a loop, I'm not sure if they are related to changes in this PR or if they are just showing up now because other issues were resolved.

Can you provide a reproducer? Just while 1; ./UCXX_TESTS ?

FWIW, the issues I saw in #26 are not reproducible with this new change.

@pentschev
Copy link
Member Author

Can you provide a reproducer? Just while 1; ./UCXX_TESTS ?

Yes, exactly that.

@wence-
Copy link
Contributor

wence- commented May 2, 2023

Can you provide a reproducer? Just while 1; ./UCXX_TESTS ?

Yes, exactly that.

I guess my system is slightly different (no nvlink, but not sure if that is relevant for these tests). Using

# Library version: 1.14.0
# Library path: /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/bin/../lib/libucs.so.0
# API headers version: 1.14.0
# Git branch '', revision 8a97995

@pentschev
Copy link
Member Author

I can't remember exactly what tests were segfaulting but I can't reproduce that now, furthermore if I skip the same tests that we're already skipping in CI they all complete. On a DGX-1 running the following in a loop seems to always complete successfully:

while true; do UCX_TLS=tcp,cuda_copy UCX_TCP_CM_REUSEADDR=y timeout 10m ./cpp/build/gtests/UCXX_TEST --gtest_filter=-*DelayedSubmission*ProgressTagMulti*:ListenerTest.CloseCallback:ListenerTest.IsAlive:ListenerTest.RaiseOnError  ; done

I believe those issues may all be related to #24 and we should tackle them in the near future but are probably not related to this PR.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor docstring updates for the new callback interface, otherwise LGTM!

cpp/include/ucxx/request_tag_multi.h Show resolved Hide resolved
cpp/include/ucxx/request_tag_multi.h Show resolved Hide resolved
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
Copy link
Member Author

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions are now in, thanks @wence- !

cpp/include/ucxx/request_tag_multi.h Show resolved Hide resolved
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Peter!

@pentschev pentschev added breaking Introduces a breaking change and removed non-breaking Introduces a non-breaking change labels May 2, 2023
@pentschev pentschev requested a review from a team as a code owner May 2, 2023 12:37
Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving ops-codeowner file changes

@pentschev
Copy link
Member Author

Thanks @wence- and @ajschmidt8 for reviews!

@pentschev
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 4acbcf2 into rapidsai:branch-0.32 May 2, 2023
@pentschev pentschev deleted the fix-request-user-callback branch May 2, 2023 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Introduces a breaking change bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants