Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE: fix timeout handle for triggered post #679

Merged

Conversation

Sergei-Lebedev
Copy link
Contributor

What

Correctly handle time out error for triggered task

  • For TL NCCL need to use commAbort, otherwise commDestroy could hang
  • For other TLs need to destroy executor during finalize

@Sergei-Lebedev
Copy link
Contributor Author

cc @Fuzzkatt

@Fuzzkatt
Copy link

Fuzzkatt commented Nov 17, 2022

Fixes the following unit tests on upstream Pytorch CI:

Sample command to reproduce: BACKEND="ucc" WORLD_SIZE=3 python test/distributed/test_distributed_spawn.py -v -k test_verify_model_across_rank_with_logger

In general, when passing in mismatched inputs across ranks, UCC would originally fail to catch the error within the function itself and propagate the error into a barrier timeout.

For example, when passing in the mismatched inputs on https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/distributed/distributed_test.py#L7690, it would propagate the error into https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/distributed/distributed_test.py#L7707 instead of logging it properly within https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/distributed/distributed_test.py#L7695.

With this fix, UCC catches the error within the function correctly.

src/components/tl/nccl/tl_nccl.h Show resolved Hide resolved
src/core/ucc_coll.c Show resolved Hide resolved
@Sergei-Lebedev Sergei-Lebedev force-pushed the topic/triggered_timeout branch from de81b7d to 39e24c7 Compare November 18, 2022 13:23
@Sergei-Lebedev Sergei-Lebedev merged commit 1c7a712 into openucx:master Nov 18, 2022
@Sergei-Lebedev Sergei-Lebedev deleted the topic/triggered_timeout branch November 18, 2022 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants