Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HANDS-ON BUG] #560

Open
MojtabaAbdi opened this issue Sep 8, 2024 · 13 comments
Open

[HANDS-ON BUG] #560

MojtabaAbdi opened this issue Sep 8, 2024 · 13 comments

Comments

@MojtabaAbdi
Copy link

MojtabaAbdi commented Sep 8, 2024

### Bonus Unit 1 Notebook Error
Hello. I have a problem with executing my code in Bonus Unit 1 and it arises from this line, where, honestly talking, I have not manipulated anything:

!mlagents-learn ./config/ppo/Huggy.yaml --env=./trained-envs-executables/linux/Huggy/Huggy --run-id="Huggy2" --no-graphics

Below is a screetshot of an execution of the cell:
HuggyBuggy

Actually I have copied the Bonus Unit 1 notebook to my google drive and ran there.

@RubSevian
Copy link

I have same problem

@RubSevian
Copy link

image
I fixed this problem with a quick fix of 56 lines on torch.float32 in the file /content/ml-agents/ml-agents/mlagents/torch_utils/torch.py .
P.S this line has already been fixed in the screenshot

@simoninithomas
Copy link
Member

Hi, I think the solution for now provided by @RubSevian is the best (thanks 🤗 ) I'm going to check with MLAgents team to see where this error comes from.

@MojtabaAbdi
Copy link
Author

@RubSevian @simoninithomas Thank you a lot. It worked for me too.

@iyaijuil
Copy link

iyaijuil commented Sep 9, 2024

Hi, I think the solution for now provided by @RubSevian is the best (thanks 🤗 ) I'm going to check with MLAgents team to see where this error comes from.

Hi, I also meet the same problem in unit5 SnowballTarget, I tried the same solution by @RubSevian but still can't fix it (it worked when I tried to fix Unit1 problem)

Here is the screenshot of an execution of the cell after I applied @RubSevian solution:

Screenshot 2024-09-09 at 12 52 29 PM

"RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)"

@RubSevian
Copy link

@iyaijuil Based on your mistake, I can make an assumption that the problem is in choosing a device, perhaps you need to specify what specifically to use the cpu or video card (CUDA)

@MrPark97
Copy link

Hi, I think the solution for now provided by @RubSevian is the best (thanks 🤗 ) I'm going to check with MLAgents team to see where this error comes from.

Hi, I also meet the same problem in unit5 SnowballTarget, I tried the same solution by @RubSevian but still can't fix it (it worked when I tried to fix Unit1 problem)

Here is the screenshot of an execution of the cell after I applied @RubSevian solution:

Screenshot 2024-09-09 at 12 52 29 PM "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)"

I've encountered same problem with 5th unit

@iyaijuil
Copy link

@iyaijuil Based on your mistake, I can make an assumption that the problem is in choosing a device, perhaps you need to specify what specifically to use the cpu or video card (CUDA)

Thanks for your reply. I used google colab to train the model. I followed the tutorial to use T4 GPU as my runtime type, and I used Macbook pro M3. Is it because there is any conflict within this set up?

@maartenx01
Copy link

I'm encountering the same issue on Unit 5 of Deep RL Course of RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm) No issues with Units 1-4.

@Andimeo
Copy link

Andimeo commented Sep 18, 2024

Same for me. Don't know how to explicitly set device.

I've even tried to add a .to(device) for each forward function in. mlagents/trainers/torch_entities/networks.py. But another error (about ambiguous bool or something) shows.

@MojtabaAbdi
Copy link
Author

Actually, you don't need to train using a GPU. It took me 12 minutes to train the model with a cpu on colab. Thereby, you won't encounter errors.

@maartenx01
Copy link

Actually, you don't need to train using a GPU. It took me 12 minutes to train the model with a cpu on colab. Thereby, you won't encounter errors.

Thank you so much! This worked!

@grib0ed0v
Copy link
Contributor

Looks like the proposed fix (changing torch.cuda.FloatTensor to torch.float32) was merged in upstream of ml-agents .

But to me, it also doesn't work. I experienced the same as @iyaijuil described.

I finally just run experiment on cpu by adding env variable.

!CUDA_VISIBLE_DEVICES='' mlagents-learn ./config/ppo/SnowballTarget.yaml --env=./training-envs-executables/linux/SnowballTarget/SnowballTarget --run-id="SnowballTarget1" --no-graphics

To me, it took around 8 min training for 200k on Colab CPU, so I agree with @MojtabaAbdi - just run on CPU and that's it.

[INFO] SnowballTarget. Step: 200000. Time Elapsed: 443.264 s. Mean Reward: 25.114. Std of Reward: 2.328. Training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants