Better error when device mismatches when calling gather() on CUDA #2180

muellerzr · 2023-11-22T15:25:25Z

What does this PR do?

This PR adds a new explicit error when a user tries to call .gather() in a GPU scenario and the device of the passed in tensor != the device in PartialState (aka CUDA). Avoids users getting err:

  File "/home/student/Experiemnts/MultiLabelClassification_LLMs/.mlc/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2897, in all_gather_into_tensor
    return honor_type(
  File "/home/student/Experiemnts/MultiLabelClassification_LLMs/.mlc/lib/python3.10/site-packages/accelerate/utils/operations.py", line 83, in honor_type
    return type(obj)(generator)
  File "/home/student/Experiemnts/MultiLabelClassification_LLMs/.mlc/lib/python3.10/site-packages/accelerate/utils/operations.py", line 112, in <genexpr>
    recursively_apply(
  File "/home/student/Experiemnts/MultiLabelClassification_LLMs/.mlc/lib/python3.10/site-packages/accelerate/utils/operations.py", line 128, in recursively_apply
    return func(data, *args, **kwargs)
  File "/home/student/Experiemnts/MultiLabelClassification_LLMs/.mlc/lib/python3.10/site-packages/accelerate/utils/operations.py", line 307, in _gpu_gather_one
    gather_op(output_tensors, tensor)
  File "/home/student/Experiemnts/MultiLabelClassification_LLMs/.mlc/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/student/Experiemnts/MultiLabelClassification_LLMs/.mlc/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2897, in all_gather_into_tensor
    work = group._allgather_base(output_tensor, input_tensor)
RuntimeError: Tensors must be CUDA and dense
    work = group._allgather_base(output_tensor, input_tensor)
RuntimeError: Tensors must be CUDA and dense

And instead gives them a much clearer err:

"One or more of the tensors passed to `gather` was not on the GPU while the `Accelerator` is configured for CUDA. "
                "Please move it to the GPU before calling `gather`."

Fixes # (issue)

https://discuss.huggingface.co/t/problem-with-model-inference-using-accelerate/63078

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@BenjaminBossan @SunMarc

HuggingFaceDocBuilderDev · 2023-11-22T15:29:08Z

The documentation is not available anymore as the PR was closed or merged.

BenjaminBossan

👍 for better error messages.

I have two nits, not blockers for the PR.

src/accelerate/utils/operations.py

BenjaminBossan · 2023-11-23T12:43:48Z

src/accelerate/utils/operations.py

@@ -298,6 +298,13 @@ def _gpu_gather_one(tensor):
        if not tensor.is_contiguous():
            tensor = tensor.contiguous()

+        # Check if `tensor` is not on CUDA
+        if state.device.type == "cuda" and tensor.device.type != "cuda":


Are there other device mismatches that could be checked here?

@muellerzr @BenjaminBossan Can this logic be extended to other devices ? Seems like a generic exception handling case.

Yes it can, for now it’s just a thing on CUDA but if it’s useful for other devices that can be added. This is just a known base case

SunMarc

LGTM !

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

Better err

7332442

muellerzr requested review from BenjaminBossan and SunMarc November 22, 2023 15:25

BenjaminBossan approved these changes Nov 23, 2023

View reviewed changes

SunMarc approved these changes Nov 24, 2023

View reviewed changes

Update src/accelerate/utils/operations.py

6de8384

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

abhilash1910 mentioned this pull request Nov 29, 2023

Add allgather check for xpu #2199

Merged

muellerzr merged commit 1516379 into main Nov 29, 2023
25 checks passed

muellerzr deleted the check-device branch November 29, 2023 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better error when device mismatches when calling gather() on CUDA #2180

Better error when device mismatches when calling gather() on CUDA #2180

muellerzr commented Nov 22, 2023

HuggingFaceDocBuilderDev commented Nov 22, 2023 •

edited

Loading

BenjaminBossan left a comment

BenjaminBossan Nov 23, 2023

abhilash1910 Nov 28, 2023

muellerzr Nov 28, 2023

SunMarc left a comment

Better error when device mismatches when calling gather() on CUDA #2180

Better error when device mismatches when calling gather() on CUDA #2180

Conversation

muellerzr commented Nov 22, 2023

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Nov 22, 2023 • edited Loading

BenjaminBossan left a comment

Choose a reason for hiding this comment

BenjaminBossan Nov 23, 2023

Choose a reason for hiding this comment

abhilash1910 Nov 28, 2023

Choose a reason for hiding this comment

muellerzr Nov 28, 2023

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Nov 22, 2023 •

edited

Loading