Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiments on some nodes are 10x slower than other #19

Open
h4duan opened this issue Feb 21, 2024 · 1 comment
Open

Experiments on some nodes are 10x slower than other #19

h4duan opened this issue Feb 21, 2024 · 1 comment

Comments

@h4duan
Copy link

h4duan commented Feb 21, 2024

Hi,

I just launched a 4-node experiment on mi1008x (t006-[009-010],t007-[009-010]) and found that my experiment ran significantly slower (more than 10 times) than before. Then I ran the exact same experiment on another 4 node (t004-007,t006-007,t008-[007,009]) and the speed is the same as before. I haven't experience this issue before. I'm wondering if there's something wrong with the nodes in (t006-[009-010],t007-[009-010]). Thanks!

@jordap
Copy link

jordap commented Feb 22, 2024

Hello @h4duan. Did you try to reproduce the issue again in the same nodes? I noticed one of the GPUs (t006-009, ID 0) remained unused during the execution you are mentioning, but I've been able to run successfully in that same GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants