Training details about MineAgent #9

mansicer · 2023-04-02T06:54:50Z

Hi. Thank you for releasing the precious benchmark! I'm working on implementing the PPO agent you reported in the paper. However, I found some misalignments between the code and your paper.

Trimmed action space

As mentioned by #4, the code below does not correspond to the 89 action dims in Appendix G.2.

MineCLIP/main/mineagent/run_env_in_loop.py

Line 75 in e6c06a0

action_dim=[3, 3, 4, 25, 25, 8],

About the `compass` observation

In the paper I see that the compass has a shape of (2,). However, I see an input of (4,) shape in your code.

MineCLIP/main/mineagent/run_env_in_loop.py

Line 25 in e6c06a0

"compass": torch.rand((B, 4), device=device),

Training on `MultiDiscrete` action space

Is the 89-dimension action space in the paper a MultiDiscrete action space like the original MineDojo action space, or you simply treat it as a Discrete action space?

In addition, can you release the training code on three task groups in the paper (or share this code via my GitHub email)? It will be beneficial for baseline comparisons!

The text was updated successfully, but these errors were encountered:

iSach · 2023-06-12T16:16:13Z

Hello,

Did you manage to reimplement the training code for the agents with PPO?

I'm getting some issues with the nested dicts despite using the multi-input policy.

mansicer · 2023-06-13T06:04:44Z

@iSach Hi. I tried to reimplement PPO from CleanRL code. I use Gym's AsyncVectorEnv for sampling and manually preprocess the batched Dict obs space in some ways. Feel free if you wanna elaborate relevant issues.

iSach · 2023-06-13T06:44:23Z

@iSach Hi. I tried to reimplement PPO from CleanRL code. I use Gym's AsyncVectorEnv for sampling and manually preprocess the batched Dict obs space in some ways. Feel free if you wanna elaborate relevant issues.

I'm not extremely familiar with running more complex environments like these (have only run very basic envs in gym's tutorials). Do you have a repo or a gist to look at?

My main issue is dealing with the nested dict's in the env's observation space. I tried to implement a custom features extractor based on the "SimpleFeatureFusion", but can't manage to get something running at all.

mansicer · 2023-06-14T06:35:27Z

Do you have a repo or a gist to look at?

Unfortunately, currently not. I don't think my previous code is bug-free or worth referencing. However, I do suggest that you can start from their provided code like run_env_in_loop.py and try feeding the environment input to the network first. I do start from their example code.

iSach · 2023-06-17T16:01:43Z

Do you have a repo or a gist to look at?

Unfortunately, currently not. I don't think my previous code is bug-free or worth referencing. However, I do suggest that you can start from their provided code like run_env_in_loop.py and try feeding the environment input to the network first. I do start from their example code.

I tried, but I'm getting so many problems with PPO because of the weird environment. I don't understand how to get a clean training code. I don't understand why they would release everything except the code for reproducing results. Especially considering the few tasks demonstrated in the code.

elcajas · 2023-06-19T05:28:48Z

About the policy algorithm training:

Do you start PPO update when PPO buffer is full or after a certain number of env steps?
Do you use a data loader in PPO update? What is the batch size?
How many PPO update iterations do you apply?
What is SI buffer max capacity?
Does the value function head updates also the backbone model parameters?
Since a trimmed version of the action space is used, does the agent still use the MulticategoricalActor?
Using MineCLIP reward, how to store states with corresponding rewards? How to calculate the rewards of the first 15 steps of the episode?
When adding successful trajectories to SI buffer, when do you update the mean and std of the reward?

I would appreciate it if you can clarify the points above. It would be helpful if you release the policy training code in the future.

mansicer · 2023-06-20T02:30:19Z

Hi @elcajas,

Since the authors do not reply this issue, I do not continue reimplementing PPO in MineDojo. For things I can share, I implement PPO based on the CleanRL version and adopt a vector env to speed up. The network backbone is like the FeatureFusion from this repo.

Do you start PPO update when PPO buffer is full or after a certain number of env steps?

After a fixed number of env steps.

Do you use a data loader in PPO update? What is the batch size? Other hyper-parameters...

I refer to the CleanRL code and Table A.3 from the MineDojo paper.

Does the value function head updates also the backbone model parameters?

Yes.

Since a trimmed version of the action space is used, does the agent still use the MulticategoricalActor?

No. Use the default discrete version of PPO is okay.

Using MineCLIP reward, how to store states with corresponding rewards? How to calculate the rewards of the first 15 steps of the episode?

Unfortunately I haven't tried that.

When adding successful trajectories to SI buffer, when do you update the mean and std of the reward?

I'm not clear about this question. Can you provide some details?

Generally, that's just some of my experiences although I do not work on it recently. I sincerely hope the authors and our community can open-source some RL approaches to this benchmark.

mansicer · 2023-06-20T02:34:47Z

Also found a bug in the example code. See #11.

AsWali · 2024-02-26T00:27:14Z

@elcajas I am having the same questions as you. Did you get any further ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training details about MineAgent #9

Training details about MineAgent #9

mansicer commented Apr 2, 2023

iSach commented Jun 12, 2023

mansicer commented Jun 13, 2023

iSach commented Jun 13, 2023 •

edited

Loading

mansicer commented Jun 14, 2023

iSach commented Jun 17, 2023

elcajas commented Jun 19, 2023

mansicer commented Jun 20, 2023

mansicer commented Jun 20, 2023

AsWali commented Feb 26, 2024

Training details about MineAgent #9

Training details about MineAgent #9

Comments

mansicer commented Apr 2, 2023

Trimmed action space

About the compass observation

Training on MultiDiscrete action space

iSach commented Jun 12, 2023

mansicer commented Jun 13, 2023

iSach commented Jun 13, 2023 • edited Loading

mansicer commented Jun 14, 2023

iSach commented Jun 17, 2023

elcajas commented Jun 19, 2023

About the policy algorithm training:

mansicer commented Jun 20, 2023

mansicer commented Jun 20, 2023

AsWali commented Feb 26, 2024

About the `compass` observation

Training on `MultiDiscrete` action space

iSach commented Jun 13, 2023 •

edited

Loading