You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, thanks for your great work. Here are some questions when I read your paper and implement the work. In the paper, you said "you trained your model on a dataset of approximately 750,000 videos sampled from AudioSet." As we know, AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.
So the AudioSet you used for training model, is human-labeled 10-second clips drawn from YouTube videos?
Can we download the full videos according to YouTubeID provided in audioset and spilt them to video clips for training? Actually, we have trained with this dataset for serval models. But some of them always predict labels as "1" (aligned), and others always predict labels as "0" (not aligned).
The text was updated successfully, but these errors were encountered:
Hello, thanks for your great work. Here are some questions when I read your paper and implement the work. In the paper, you said "you trained your model on a dataset of approximately 750,000 videos sampled from AudioSet." As we know, AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.
The text was updated successfully, but these errors were encountered: