Replies: 1 comment
-
Hi, if you try a newer container (e.g. >= 24.05) there shouldn't be an issue. If there is please re-open this ticket & apologies for the late response. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am trying to run an example that looks similar to this one: https://github.com/NVIDIA/NeMo/blob/48b8204d57e59c8790aaa6eaa20384b046b1a574/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py
I am using the Docker container
nvcr.io/nvidia/nemo:24.01.framework
withtorchrun
command. My initial model is a converted Mistral 7B to Nemo format. Execution looks like this:When I use L4 GPU everything is ok (but end up with Cuda OOM) also H100 works. On the other hand, when I switch to A100 80GB initialization hangs before checkpoint loading. Below is the screenshot for L4. For A100 it hangs before (I never see a blue message) "Loading distributed checkpoint ...". Any ideas how to fix it?
Beta Was this translation helpful? Give feedback.
All reactions