Fix device issue in `OpenLlamaModelTest::test_model_parallelism` #24195

ydshieh · 2023-06-12T12:24:31Z

What does this PR do?

See the comments in the changes.

Currently, CI has a failure

src/transformers/models/open_llama/modeling_open_llama.py:740: in forward
    logits = torch.einsum("blh,vh->blv", hidden_states, self.model.embed_tokens.weight)
   ...
   ...
   RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

ydshieh · 2023-06-12T12:25:08Z

src/transformers/models/open_llama/modeling_open_llama.py

@@ -736,12 +736,16 @@ def forward(

        hidden_states = outputs[0]
        if self.config.shared_input_output_embedding:
-            logits = torch.einsum("blh,vh->blv", hidden_states, self.model.embed_tokens.weight)
+            logits = torch.einsum(
+                "blh,vh->blv", hidden_states.to(self.model.embed_tokens.weight.device), self.model.embed_tokens.weight


send hidden_states (lighter) to embedding's (heavy) device.

ydshieh · 2023-06-12T12:25:23Z

src/transformers/models/open_llama/modeling_open_llama.py

        else:
            logits = self.lm_head(hidden_states)

        loss = None
        if labels is not None:
+            # move labels to correct device to enable model parallelism
+            labels = labels.to(logits.device)


just copied from other modeling files.

HuggingFaceDocBuilderDev · 2023-06-12T12:41:22Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Thanks for the fixes!

…gingface#24195) fix Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

fix

0629f02

ydshieh commented Jun 12, 2023

View reviewed changes

ydshieh requested a review from sgugger June 12, 2023 12:25

sgugger approved these changes Jun 12, 2023

View reviewed changes

ydshieh merged commit a9cdb05 into main Jun 12, 2023

ydshieh deleted the fix_openllama branch June 12, 2023 13:21

novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023

Fix device issue in OpenLlamaModelTest::test_model_parallelism (hug…

776943a

…gingface#24195) fix Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix device issue in `OpenLlamaModelTest::test_model_parallelism` #24195

Fix device issue in `OpenLlamaModelTest::test_model_parallelism` #24195

ydshieh commented Jun 12, 2023

ydshieh Jun 12, 2023

ydshieh Jun 12, 2023

HuggingFaceDocBuilderDev commented Jun 12, 2023 •

edited

Loading

sgugger left a comment

Fix device issue in OpenLlamaModelTest::test_model_parallelism #24195

Fix device issue in OpenLlamaModelTest::test_model_parallelism #24195

Conversation

ydshieh commented Jun 12, 2023

What does this PR do?

ydshieh Jun 12, 2023

Choose a reason for hiding this comment

ydshieh Jun 12, 2023

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jun 12, 2023 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

Fix device issue in `OpenLlamaModelTest::test_model_parallelism` #24195

Fix device issue in `OpenLlamaModelTest::test_model_parallelism` #24195

HuggingFaceDocBuilderDev commented Jun 12, 2023 •

edited

Loading