Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to eval finetuned model on aloha-sim-cube gym environment #34

Open
nicehiro opened this issue Jan 15, 2024 · 5 comments
Open

Failed to eval finetuned model on aloha-sim-cube gym environment #34

nicehiro opened this issue Jan 15, 2024 · 5 comments

Comments

@nicehiro
Copy link

Hi, thanks for your great work!

I have finetuned the model by using examples/02_finetune_new_observation_action.py. And I'm running examples/03_eval_finetuned.py to show the finetuned results.

I followed the instructions

Finally modify the sys.path.append statement below to add the ACT repo to your path and start a virtual display:
Xvfb :1 -screen 0 1024x768x16 &
export DISPLAY=:1

and add sys.path.append("/path/to/act"). But still cannot make gym.make("aloha-sim-cube-v0") successful.

Another problem is that I cannot successfully load the finetuned model. Here's the backtrace.

Traceback (most recent call last):
  File "/code/octo/examples/03_eval_finetuned.py", line 101, in <module>
    app.run(main)
  File "/opt/conda/envs/octo/lib/python3.10/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/opt/conda/envs/octo/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/code/octo/examples/03_eval_finetuned.py", line 35, in main
    model = OctoModel.load_pretrained(FLAGS.finetuned_path)
  File "/code/octo/octo/model/octo_model.py", line 274, in load_pretrained
    params = checkpointer.restore(step, params_shape)
  File "/opt/conda/envs/octo/lib/python3.10/site-packages/orbax/checkpoint/checkpoint_manager.py", line 550, in restore
    restored_items = self._restore_impl(
  File "/opt/conda/envs/octo/lib/python3.10/site-packages/orbax/checkpoint/checkpoint_manager.py", line 582, in _restore_impl
    restored[item_name] = self._checkpointers[item_name].restore(
  File "/opt/conda/envs/octo/lib/python3.10/site-packages/orbax/checkpoint/checkpointer.py", line 165, in restore
    restored = self._restore_with_args(directory, *args, **kwargs)
  File "/opt/conda/envs/octo/lib/python3.10/site-packages/orbax/checkpoint/checkpointer.py", line 103, in _restore_with_args
    restored = self._handler.restore(directory, args=ckpt_args)
  File "/opt/conda/envs/octo/lib/python3.10/site-packages/orbax/checkpoint/pytree_checkpoint_handler.py", line 1063, in restore
    restored_item = _transform_checkpoint(
  File "/opt/conda/envs/octo/lib/python3.10/site-packages/orbax/checkpoint/pytree_checkpoint_handler.py", line 601, in _transform_checkpoint
    item = utils.deserialize_tree(restored, item)
  File "/opt/conda/envs/octo/lib/python3.10/site-packages/orbax/checkpoint/utils.py", line 281, in deserialize_tree
    return jax.tree_util.tree_map_with_path(
  File "/opt/conda/envs/octo/lib/python3.10/site-packages/jax/_src/tree_util.py", line 857, in tree_map_with_path
    return treedef.unflatten(f(*xs) for xs in zip(*all_keypath_leaves))
  File "/opt/conda/envs/octo/lib/python3.10/site-packages/jax/_src/tree_util.py", line 857, in <genexpr>
    return treedef.unflatten(f(*xs) for xs in zip(*all_keypath_leaves))
  File "/opt/conda/envs/octo/lib/python3.10/site-packages/orbax/checkpoint/utils.py", line 278, in _reconstruct_from_keypath
    result = result[key_name]
KeyError: 'diffusion_model'

It looks like I didn't save the diffusion model in the training process. Did I miss something in the configuration?

Thanks.

@kpertsch
Copy link
Collaborator

Thanks for giving the model a try!
Sorry about the issues with the eval_finetuned example -- it seems that some lines got deleted in our cleanup. Should hopefully be fixed in #40
Once it's merged, can you try again to gym.make the environment?

For the model loading: it's surprising that it tries to load a key "diffusion_model" since the 02_finetune_new_observation_action.py example replaces the diffusion head with an L1 head, so there should be no more diffusion in the model. Can you inspect the config saved alongside the finetuned model checkpoint and see whether it correctly replaced the diffusion head with the L1 head or whether there is any other diffusion head in there? Just to make sure: you set the finetuned_path argument to where the finetuning checkpoint from example (2) was saved, correct?

@nicehiro
Copy link
Author

Once it's merged, can you try again to gym.make the environment?

Yes. I'd like to.

Just to make sure: you set the finetuned_path argument to where the finetuning checkpoint from example (2) was saved, correct?

Yes. I'm using the following command, where /output/finetuned_model is the saved finetuned model.

python examples/03_eval_finetuned.py --finetuned_path="/output/finetuned_model"

The action_head in config.json is:

image

@safsin
Copy link

safsin commented Mar 5, 2024

I'm able to import sim_env, but the example 03_eval_finetuned throws the KeyError: 'proprio' in line 328, gym_wrappers.py.

On changing line 72 in 03_eval_finetuned.py to ...model.dataset_statistics['bridge_dataset']..., it throws the ValueError: operands could not be broadcast together with shapes (1, 14) (8, ). I get the same error on trying the other datasets. Please help with running this example code.

@BUAAZhangHaonan
Copy link

BUAAZhangHaonan commented Apr 9, 2024

I'm able to import sim_env, but the example 03_eval_finetuned throws the KeyError: 'proprio' in line 328, gym_wrappers.py.

On changing line 72 in 03_eval_finetuned.py to ...model.dataset_statistics['bridge_dataset']..., it throws the ValueError: operands could not be broadcast together with shapes (1, 14) (8, ). I get the same error on trying the other datasets. Please help with running this example code.

I encountered the same problem, my device did not have enough GPU memory to fine-tune on the aloha environment, and I did not get the results after fine-tuning. So I am not sure if it is caused by not doing inference on the results of fine-tuning. But I checked the dataset_statistics.json file and found that the proprio of all datasets has 8 dimensions, so I think it should also have 8 dimensions after fine-tuning. You can see the config after fine-tuning from this issue #42 (comment), it shows that the action_dim is 14 instead of 8.

@kpertsch
Copy link
Collaborator

kpertsch commented Apr 9, 2024

Yes, ALOHA is a bimanual setup so its action space is 14-dimensional. Our pre-training data is all single-arm data with an 8-dimensional action space.
So you can only evaluate the Octo model on the ALOHA setup after fine-tuning since we need to train a new action head with the correct action dimensionality.

WenchangGaoT pushed a commit to WenchangGaoT/octo1 that referenced this issue May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants