1. Download Source Videos

First we would like to thank the work from Music-AVQA and AVQA for providing data. In this step, download the original video based on the link provided in the work above.

2. Align with the AVinstruct

A sample of AVinstruct is shown below:

    {
        "video_id": "00005580",
        "multi_choice": "",
        "correct_answer": "",
        "video": "/mnt/sda/imagebind_v_feats/00005580.npy", # visual-features
        "audio": "/mnt/sda/imagebind_a_feats/00005580.mp3.npy", # audio-features
        "conversations": [
            {
                "from": "human",
                "value": "Summarize the video to help answer:\n<Q>How many instruments are sounding in the video?<Q>\n<video>"
            },
            {
                "from": "gpt",
                "value": "The video shows three people playing violin and cello in a park. Two instruments are sounding in the video, the violin and the cello."
            }
        ]
    },

where video_id is used to retrieve the name of the downloaded video. We recommend using a frozen encoder to obtain the corresponding audiovisual embedding and save it as an npy file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

1. Download Source Videos

2. Align with the AVinstruct

Files

README.md

Latest commit

History

README.md

File metadata and controls

1. Download Source Videos

2. Align with the AVinstruct