CookAR: Affordance Augmentations in Wearable AR to Support Kitchen Tool Interactions for People with Low Vision
Jaewook Lee1,
Andrew D. Tjahjadi1,
Jiho Kim1,
Junpu Yu1,
Minji Park2,
Jiawen Zhang1,
Jon E. Froehlich1,
Yapeng Tian3,
Yuhang Zhao4
1University of Washington,
2Sungkyunkwan University,
3University of Texas at Dallas,
4University of Wisconsin-Madison
CookAR is a Computer Vision-powered prototype AR system with real-time object affordance augmentations to support safe and efficient interactions with kitchen tools for people with impaired vision abilities (Low Vision). In this repo, we present the exact fine-tuned instance segmentation model for affordance augmentations, along with the first egocentric dataset of kitchen tool affordances collected and annotated by the research team.
To use CookAR, we recommand using Conda. CookAR also depends on MMDetection toolbox and PyTorch. If your GPU supports CUDA, please install it first.
conda create --name=CookAR python=3.8
conda activate CookAR
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 //change according to your cuda version
pip install -U openmim
mim install mmengine
mim install "mmcv>=2.0.0"
git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection
pip install -v -e .
It is recommended that you first install PyTorch and then MMDetection otherwise it might not be correctly complied with CUDA.
- Once you installed everything, firstly make three folders inside the mmdetection directory namely
./data
,./checkpoints
and./work_dir
either manually or usingmkdir
in conda. - Download pre-trained config and weights files from mmdetection by running
mim download mmdet --config rtmdet-ins_l_8xb32-300e_coco --dest ./checkpoints
and runpython test_install.py
to check see if things are working correctly. You should see an image with segmentation masks pops out.
Along with CookAR, we present the very first kitchen tool affordance image dataset, which contains 10,152 images (8,346 for training, 1193 for validation, and 596 for testing) with 18 categories of objects listed below. Raw images were extracted from EPIC-KITCHENS video dataset.
Carafe Base | Carafe Handle |
Cup Base | Cup Handle |
Fork Tines | Fork Handle |
Knife Blade | Knife Handle |
Ladle Bowl | Ladle Handle |
Pan Base | Pan Handle |
Scissor Blade | Scissor Handle |
Spatula Head | Spatula Handle |
Spoon Bowl | Spoon Handle |
In this section we provide a brief guideline about how to fine-tune the CookAR models on your customzied datasets and how to run on imgae or video of your choice. Specifically, we break this section into four parts:
- Download the checkpoints
- Download and check the dataset
- Download and edit configuration file
- Start training
- Run on image or video
CookAR is initially fine-tuned on RTMDet-Ins-L with frozen backbone stages, which can be found at the official repo. You can find a more detailed tutorial on fine-tuning RTMDet related models at here.
- Vanilla CookAR: Use this link to download our fine-tuned weights.
You can directly use it for your tasks ( jump to step 3 ) or build upon it with your own data.
- CookAR Dataset: Use this link to download our self-built dataset in COCO-MMDetection format.
If you are fine-tuning with your own dataset, make sure it is also in COCO-MMDetection format and it is recommanded to run coco_classcheck.py
in fine-tuning folder to check the classes contained.
In this repo, we also provide the config file used in our fine-tuning process, which can be found in configs folder. To use the model on your tasks directly, no modification is required and jump to step 5.
Before start your own training, check and run config_setup.py
in fine-tuning folder to edit the config file. Make sure that the number of classes is correctly modified in reflect of the dataset provided and all classes are listed in the same order shown by coco_classcheck.py
.
Simply run python tools/train.py PATH/TO/CONFIG
.
Use the provided scripts infer_img.py
and infer_video.py
to run inferences on a single image or video.
We thank Yang Li (University of Washington), Sieun Kim (Seoul National University), and XunMei Liu (University of Washington) for their assistance with this repo.
@inproceedings{10.1145/3654777.3676449,
author = {Lee, Jaewook and Tjahjadi, Andrew D. and Kim, Jiho and Yu, Junpu and Park, Minji and Zhang, Jiawen and Froehlich, Jon E. and Tian, Yapeng and Zhao, Yuhang},
title = {CookAR: Affordance Augmentations in Wearable AR to Support Kitchen Tool Interactions for People with Low Vision},
year = {2024},
isbn = {9798400706288},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3654777.3676449},
doi = {10.1145/3654777.3676449},
abstract = {Cooking is a central activity of daily living, supporting independence as well as mental and physical health. However, prior work has highlighted key barriers for people with low vision (LV) to cook, particularly around safely interacting with tools, such as sharp knives or hot pans. Drawing on recent advancements in computer vision (CV), we present CookAR, a head-mounted AR system with real-time object affordance augmentations to support safe and efficient interactions with kitchen tools. To design and implement CookAR, we collected and annotated the first egocentric dataset of kitchen tool affordances, fine-tuned an affordance segmentation model, and developed an AR system with a stereo camera to generate visual augmentations. To validate CookAR, we conducted a technical evaluation of our fine-tuned model as well as a qualitative lab study with 10 LV participants for suitable augmentation design. Our technical evaluation demonstrates that our model outperforms the baseline on our tool affordance dataset, while our user study indicates a preference for affordance augmentations over the traditional whole object augmentations.},
booktitle = {Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology},
articleno = {141},
numpages = {16},
keywords = {accessibility, affordance segmentation, augmented reality, visual augmentation},
location = {Pittsburgh, PA, USA},
series = {UIST '24}
}