Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Place metric functions for BLEU and Rogue on correct devices when using multiple GPUs #3671

Merged
merged 3 commits into from
Sep 27, 2023

Conversation

arnavgarg1
Copy link
Contributor

The issue was that the metric function wasn't being moved/placed on the right device, leading to a weird behavior for these metrics using the response prediction key because the inputs that are passed in are not tensors, they're lists of strings. However, it seems like these metric functions need to moved to CUDA (instead of staying on the CPU) so that when the metric_fn.compute() call is called to gather evaluation metric summaries, it does not run into this error:

RuntimeError: Tensors must be CUDA and dense

Tested successfully with:

  1. Only CPU machine
  2. Multi-GPU Quantized training (4 GPUs)
  3. Multi-GPU DeepSpeed Stage 3 training (4 GPUs)
Screenshot 2023-09-27 at 4 50 09 PM

@github-actions
Copy link

Unit Test Results

       6 files  ±0         6 suites  ±0   53m 16s ⏱️ + 2m 5s
2 807 tests ±0  2 793 ✔️ ±0  12 💤 ±0  2 ±0 
2 847 runs  ±0  2 824 ✔️ ±0  21 💤 ±0  2 ±0 

For more details on these failures, see this check.

Results for commit e7d0f6f. ± Comparison against base commit 4af5331.

@arnavgarg1 arnavgarg1 merged commit 1286123 into master Sep 27, 2023
@arnavgarg1 arnavgarg1 deleted the dist_text_metrics branch September 27, 2023 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants