You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm glad to see this great work! When you mention handling multiple sound sources during raw data processing, I’d like to ask: how do you detect multiple sound sources, or identify individual sound events? Could you share any methods for data preparation?
The text was updated successfully, but these errors were encountered:
Our dataset is enhanced by GPT and builds upon existing audio-caption datasets. Consequently, the problem you mentioned simplifies when using the original captions.
For strongly labeled data such as AudioSet_SL and AudioCaps, sound sources can be effectively identified by inputting the original descriptions into GPT.
For weakly labeled data, we recommend consulting WavCaps, whose exemplary efforts have significantly aided our work.
If you mean temporal-level events detection, the active events detection is applied after the construction of the single-source pool to enhance the temporal diversity.
Although we cannot guarantee that the sound sources in our dataset are as good as those in the strongly labeled data, the trade-off is acceptable given the larger scale. Detailed descriptions and quality analysis are provided in our paper's appendix.
Thanks for your attention. I hope my answers will help you with your questions.
I'm glad to see this great work! When you mention handling multiple sound sources during raw data processing, I’d like to ask: how do you detect multiple sound sources, or identify individual sound events? Could you share any methods for data preparation?
The text was updated successfully, but these errors were encountered: