D3Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation
Yixuan Wang1*, Zhuoran Li2, 3*, Mingtong Zhang1, Katherine Driggs-Campbell1, Jiajun Wu2, Li Fei-Fei2, Yunzhu Li1, 2
1University of Illinois Urbana-Champaign,
2Stanford University,
3National University of Singapore
teaser_capcut.mp4
In this notebook, we show how to build D3Fields and visualize reconstructed mesh, mask fields, and descriptor fields. We also demonstrate how to track keypoints of a video.
We recommend Mambaforge instead of the standard anaconda distribution for faster installation:
# create conda environment
mamba env create -f env.yaml
conda activate d3fields
# download pretrained models
bash scripts/download_ckpts.sh
bash scripts/download_data.sh
python vis_repr.py # visualize the representation
python vis_tracking.py # visualize the tracking
Fusion
is the core class of D3Fields. It contains the following key functions:
update
: it takes in the observation and updates the internal states.text_queries_for_inst_mask
: it will query the instance mask according to the text query and thresholds.text_queries_for_inst_mask_no_track
: it is similar totext_queries_for_inst_mask
, but it will not invoke the underlying XMem tracking module.eval
: it will evaluate associated features for arbitrary 3D points.batch_eval
: for a large batch of points, it will evaluate them batch by batch to avoid out-of-memory error. The important attributes ofFusion
are:curr_obs_torch
: a dictionary containing the following keys:color
: multiview color images in the format of np.uint8 BGR numpy arrayscolor_tensor
: multiview color images in the format of float32 BGR torch tensorsdepth
: multiview depth images in the format of np.float32 torch tensors, unit in metersmask
: multiview instance mask images in the format of np.uint8 torch tensors (V, H, W, num_inst)consensus_mask_label
: mask labels aggregated from all views in the format of a list of strings.
To run D3Fields on your own dataset, you could follow the following steps:
- Prepare dataset in the following structure:
dataset_name
├── camera_0
│ ├── color
| | ├── 0.png
| | ├── 1.png
| | ├── ...
│ ├── depth
| | ├── 0.png
| | ├── 1.png
| | ├── ...
│ ├── camera_extrinsics.npy
│ ├── camera_params.npy
├── camera_1
├── ...
The definition of camera_extrinsics.npy
and camera_params.npy
is defined as follows:
camera_extrinsics.npy: (4, 4) numpy array, the extrinsics of the camera, which transforms a point from world coordinate to camera coordinate
camera_params.npy: (4,) numpy array, the camera parameters in the following order: fx, fy, cx, cy
- Prepare the PCA pickle file for the query texts. Find four images of the queries texts (e.g. mug) with clean bakcground and central objects. Change
obj_type
withinscripts/prepare_pca.py
and run it. - Specify the workspace boundary as x_lower, x_upper, y_lower, y_upper, z_lower, z_upper.
- Run
python vis_repr_custom.py
, such aspython vis_repr_custom.py --data_path data/2023-09-15-13-21-56-171587 --pca_path pca_model/mug.pkl --query_texts mug --query_thresholds 0.3 --x_lower -0.4 --x_upper 0.4 --y_upper 0.3 --y_lower -0.4 --z_upper 0.02 --z_lower -0.2
Tips for debugging:
- Make sure the transformation is right by visualizing
pcd
withinvis_repr_custom.py
using Open3D. - If the GPU is out of memory, run
vis_repr_custom.py
with smallerstep
. This will generate a more sparse voxel grid. - Make sure Grounded SAM outputs reasonable results by checking
curr_obs_torch['mask']
andcurr_obs_torch['consensus_mask_label']
ofFusion
class.
If you find this repo useful for your research, please consider citing the paper
@article{wang2023d3fields,
title={D$^3$Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation},
author={Wang, Yixuan and Li, Zhuoran and Zhang, Mingtong and Driggs-Campbell, Katherine and Wu, Jiajun and Fei-Fei, Li and Li, Yunzhu},
journal={arXiv preprint arXiv:2309.16118},
year={2023}
}