We used LanceDB to store frames every thirty seconds and the title of 13000+ videos, 5 random from each top category from the Youtube 8M dataset. Then, we used the CLIP model to embed frames and titles together. With LanceDB, we can perform embedding, keyword, and SQL search on these videos.
wget https://vectordb-recipes.s3.us-west-2.amazonaws.com/multimodal_video_lance.tar.gz
tar -xvf multimodal_video_lance.tar.gz
mkdir -p data/video-lancedb
mv multimodal_video.lance data/video-lancedb/
Run the script
python main.py
Here is how the multimodal_video
dataset (the raw data) was generated:
downloadcategoryids.sh
- Uses the YouTube8M dataset to retrieve 5 video ids from each categorydownloadvideos.py
- Uses youtube-dl to download the videos and take a screenshot every 30 secondsinsert.py
Uses the CLIP embedding model to embed each screenshot and insert into LanceDBinsert_titles.py
We also get titles and embed them into LanceDB- We create a full text search index using tantivy with
tbl.create_fts_index("text")
This dataset is available in our s3 bucket: https://vectordb-recipes.s3.us-west-2.amazonaws.com/multimodal_video_lance.tar.gz