Modality-invariant Visual Odometry for Embodied Vision

Marius Memmel, Roman Bachmann, Amir Zamir

CVPR 2023

Visualization of an agent using the proposed Visual Odometry Transformer (VOT) as GPS+compass substitute.
Backbone is a ViT-B with MultiMAE pre-training and depth input.

Setup

Docker

This repository provides a Dockerfile that can be used to setup an environment for running the code. It installs the corresponding versions of habitat-lab and habitat-sim, habitat-sim, timm, and their dependencies. Note that for running the code at least one GPU supporting cuda 11.0 is required.

Download Data

This repository requires two datasets to train and evaluate the VOT models:

Gibson scene dataset
PointGoal Navigation splits, specifically pointnav_gibson_v2.zip.

Please follow Habitat's instructions to download them. The following data structure is assumed under ./dataset:

.
+-- dataset
|  +-- Gibson
|  |  +-- gibson
|  |  |  +-- Adrian.glb
|  |  |  +-- Adrian.navmesh
|  |  |  ...
|  +-- habitat_datasets
|  |  +-- pointnav
|  |  |  +-- gibson
|  |  |  |  +-- v2
|  |  |  |  |  +-- train
|  |  |  |  |  +-- val
|  |  |  |  |  +-- valmini

Generate Dataset

This repository provides a script to generate the proposed training and validation datasets. Run ./generate_data.sh and specify the following arguments to generate the dataset. A dataset of 250k samples takes approx. 120GB.

Argument	Usage
`--act_type`	Type of actions to be saved, `-1` for saving all actions
`--N_list`	Sizes for train and validation dataset. Thesis uses `250000` and `25000`
`--name_list`	Names for train and validation dataset, default is `train` and `val`

When generating the data, habitat-sim sometimes causes an "isNvidiaGpuReadable(eglDevId) [EGL] EGL device 0, CUDA device 0 is not readable" error. To fix it follow this issue. Overwrite habitat-sim/src/esp/gfx/WindowlessContext.cpp by the provided WindowlessContext.cpp.

Pre-trained MultiMAE

Download the pre-trained MultiMAE checkpoint from this link, rename the model checkpoint to MultiMAE-B-1600.pth and place it in ./pretrained.

Pre-trained RL Policy

Download the pre-trained RL navigation policy checkpoint from PointNav-VO download the pre-trained checkpoint of the RL navigation policy from this link and place rl_tune_vo.pth under pretrained_ckpts/rl/no_tune.pth.

Training

To train a VOT model, specify the experiment configuration in a yaml file similar to here. Then run

./start_vo.sh --config-yaml PATH/TO/CONFIG/FILE.yaml

Evaluation

To evaluate a trained VOT model, specify the evaluation configuration in a yaml file similar to here. Then run

./start_rl.sh --run-type eval --config-yaml PATH/TO/CONFIG/FILE.yaml

Note that passing --run-type train fine-tunes the navigation policy to the VOT model. This thesis does not make use of this functionality.

Visualizations

To visualize agent behavior, the evaluation configuration has a VIDEO_OPTION here that renders videos directly to a logging platform or disk.

To visualize attention maps conditioned on the action, refer to the visualize_attention_maps notebook that provides functionality to plot all attention heads of a trained VOT.

Visualizations Examples

Attention maps of the last attention layer of the VOT trained on RGBD and pre-trained with a MultiMAE. The model focuses on regions present in both time steps t,t+1. Action taken by the agent is left.

Impact of the action token on a VOT trained on RGBD and pre-trained with a MultiMAE. Ground truth action: fwd. Injected actions: fwd, left, right. Embedded fwd causes the attention to focus on the image center while both left and right move attention towards regions of the image that would be consistent across time steps t,t+1 in case of rotation.

Privileged Information

To run modality ablations and privileged information experiments, define the modality in the evaluation configuration as VO.REGRESS.visual_strip=["rgb"] or VO.REGRESS.visual_strip=["depth"]. Set VO.REGRESS.visual_strip_proba=1.0 to define the probability of deactivating the input modality.

Privileged Information Examples

Visualization of an agent using the Visual Odometry Transformer (VOT) as GPS+compass substitute. Backbone is a ViT-B with MultiMAE pre-training and RGB-D input. The scene is from the evaluation split of the Gibson4+ dataset. A red image boundary indicates collisions of the agent with its environment. All agents navigate close to the goal even though important modalities are not available.

Training: RGBD, Test: RGBD

Training: Training: RGBD, Test: Depth, RGB dropped 50% of the time

Training: Training: RGBD, Test: RGB, Depth dropped 50% of the time

Training: Training: RGBD, Test: RGB, Depth dropped 100% of the time

Training: Training: RGBD, Test: Depth, RGB dropped 100% of the time

References

This repository is a fork of PointNav-VO by Xiaoming Zhao. Please refer to the code and the thesis for changes made to the original repository.

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
configs		configs
media/visualizations		media/visualizations
pointnav_vo		pointnav_vo
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
WindowlessContext.cpp		WindowlessContext.cpp
generate_data.sh		generate_data.sh
launch.py		launch.py
start_rl.sh		start_rl.sh
start_vo.sh		start_vo.sh
thesis.pdf		thesis.pdf
visualize_attention_maps.ipynb		visualize_attention_maps.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modality-invariant Visual Odometry for Embodied Vision

Table of Contents

Setup

Docker

Download Data

Generate Dataset

Pre-trained MultiMAE

Pre-trained RL Policy

Training

Evaluation

Visualizations

Visualizations Examples

Privileged Information

Privileged Information Examples

References

About

Releases

Packages

Languages

License

memmelma/VO-Transformer

Folders and files

Latest commit

History

Repository files navigation

Modality-invariant Visual Odometry for Embodied Vision

Table of Contents

Setup

Docker

Download Data

Generate Dataset

Pre-trained MultiMAE

Pre-trained RL Policy

Training

Evaluation

Visualizations

Visualizations Examples

Privileged Information

Privileged Information Examples

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages