(Project Page) (PDF) (Slides) (Video)
Learning effective representations of visual data that generalize to a variety of downstream tasks has been a long quest for computer vision. Most representation learning approaches rely solely on visual data such as images or videos. In this paper, we explore a novel approach, where we use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations. For this study, we collect a dataset of human interactions capturing body part movements and gaze in their daily lives. Our experiments show that our self-supervised representation that encodes interaction and attention cues outperforms a visual-only state-of-the-art method MoCo, on a variety of target tasks:
- Scene classification (semantic)
- Action recognition (temporal)
- Depth estimation (geometric)
- Dynamics prediction (physics)
- Walkable surface estimation (affordance)
- Clone the repository using the command:
git clone https://github.com/ehsanik/muscleTorch
cd muscleTorch
- Install requirements:
pip3 install -r requirements.txt
- Download the images from here and extract it to HumanDataset/images.
- Download the sensor data from here and extract it to HumanDataset/annotation_h5.
- Download pretrained weights from here for reproducing the numbers in the paper, extract it to HumanDataset/saved_weights.
We introduce a new dataset of human interactions for our representation learning framework. We record egocentric videos from a GoPro camera attached to the subjects' forehead. We simultaneously capture body movements, as well as the gaze. We use Tobii Pro2 eye-tracking to track the center of the gaze in the camera frame. We record the body part movements using BNO055 Inertial Measurement Units (IMUs) in 10 different locations (torso, neck, 2 triceps, 2 forearms, 2 thighs, and 2 legs).
The structure of the dataset is as follows:
HumanDataset
└── images
│ └── <video_stamp>
│ └── images_<video_stamp>_<INDEX>.jpg
└── annotation_h5
│ ├── [test/train]_<feature_name>.h5
│ ├── [test/train]_image_name.json
│ ├── [test/train]_h5pyind_2_frameind.json
│ └── [test/train]_timestamp.json
└── saved_weights
├── trained_representations
| └── <Learned_Representations>.pytar
└── trained_end_tasks
├── Action_Recognition
├── Depth_Estimation
├── Dynamic_Prediction
├── Scene_Classification
└── Walkable_Surface_Estimation
└── <Trained_End_Tasks_Weights>.pytar
To train your own model:
python3 main.py --gpu-ids 0 --arch MoCoGazeIMUModel --input_length 5 --sequence_length 5 --output_length 5 \
--dataset HumanContrastiveCombinedDataset --workers 20 --num_classes -1 --loss MoCoGazeIMULoss \
--num_imus 6 --imu_names neck body llegu rlegu larmu rarmu \
--input_feature_type gaze_points move_label --base-lr 0.0005 --dropout 0.5 --data PATHTODATA/human_data
See scripts/training_representation.sh
for additional training scripts.
To test using the pretrained model and reproduce the results in the paper refer to scripts/end_task_representation.sh
.
If you find this project useful in your research, please consider citing:
@article{ehsani2020learning,
title={Learning Visual Representation from Human Interactions},
author={Ehsani, Kiana and Gordon, Daniel and Nguyen, Thomas and Mottaghi, Roozbeh and Farhadi, Ali},
journal={arXiv},
year={2020}
}