💬 bias_in_AVS

Official repository for "Unveiling and Mitigating Bias in Audio Visual Segmentation" in ACM MM 2024.

Paper Title: "Unveiling and Mitigating Bias in Audio Visual Segmentation"

Authors: Peiwen Sun, Honggang Zhang and Di Hu

Accepted by: The 32nd ACM International Conference on Multimedia (ACM MM 2024)

🚀: Project page here: Project Page

📄: Paper here: Paper

🔍: Supplementary material: Supplementary

Abstract

ACommunity researchers have developed a range of advanced audio-visual segmentation models aimed at improving the quality of sounding objects' masks. While masks created by these models may initially appear plausible, they occasionally exhibit anomalies with incorrect grounding logic. We attribute this to real-world inherent preferences and distributions as a simpler signal for learning than the complex audio-visual grounding, which leads to the disregard of important modality information. Generally, the anomalous phenomena are often complex and cannot be directly observed systematically. In this study, we made a pioneering effort with the proper synthetic data to categorize and analyze phenomena as two types “audio priming bias” and “visual prior” according to the source of anomalies. For audio priming bias, to enhance audio sensitivity to different intensities and semantics, a perception module specifically for audio perceives the latent semantic information and incorporates information into a limited set of queries, namely active queries. Moreover, the interaction mechanism related to such active queries in the transformer decoder is customized to adapt to the need for interaction regulating among audio semantics. For visual prior, multiple contrastive training strategies are explored to optimize the model by incorporating a biased branch, without even changing the structure of the model. During experiments, observation demonstrates the presence and the impact that has been produced by the biases of the existing model. Finally, through experimental evaluation of AVS benchmarks, we demonstrate the effectiveness of our methods in handling both types of biases, achieving competitive performance across all three subsets. Page

Code instruction

The overall training pipeline follows the list below.

Data Preparation
- Prepare the training data
- Download the pretrained ckpt
- (Optional) Download the well-trained ckpt for finetuning
Audio pretransform
- Audio clustering follows the HuBERT clustering pipeline in github
- Audio classification follows the BETAS pipeline in github
- Save the cluster or class information in pth
(Optional) When training AVSS dataset, we gradually add v1s, v2, v1m in the data pool. It brings minor benefits to the performance of a curriculum training strategy.
Training the model with the debias strategy.
Evaluating on AVS Benchmark.

Data Preparation

Please refer to the link AVSBenchmark to download the datasets. You can put the data under data folder or rename your own folder. Remember to modify the path in config files. The data directory is as below:

|--data
 |--v2
 |--v1m
 |--v1s
 |--metadata.csv

Note: v1s is also known as S4, and v1m is also known as MS3. The AVSBench benchmark is strictly followed.

We use Mask2Former model with Swin-B pre-trained on ADE20k as the backbone, which can be downloaded in this link from repo. Don't forget to modify the placeholder in python files to your own path.

Our well trained model can be downloaded in this link. Don't forget to modify the placeholder in Python files to your own path.

Audio pretransform

Before everything, if you want to set all queries as active queries, simply use torch.ones in here. And then skip the audio pretransform below.

Audio clustering follows the HuBERT clustering pipeline in github Audio classification follows the BETAS pipeline in github Save the cluster or class information in pth As an example,

|--preprocess/classification_avs/threshold_0.4
   |--V9JdDs7RK3c_1.pth
      |-- torch.Tensor(7, 21)
   |--...

|--preprocess/classification_avs/threshold_0.4
   |--A7N2Japi3-A_5.pth
      |-- torch.Tensor(7, 45, 51)
   |--...

We encourage the researchers to extract the clustering and classification information.

Training

Note: Replace all placeholders with your own paths.

For S4 and MS3 subtasks, you can simply modify config in python files and replace the pth path of pre-transform of clustering or classification:

cd AVS
sh run_avs_m2f.sh # for training
sh run_avs_m2f_test.sh # for testing

For AVSS subtask, the procedure is basically the same,

cd AVSS
sh run.sh # for training
# sh run_test.sh # for testing you can simply comment out the training part of the code

Note: Before getting into debias strategy, the vanilla model needs to reach adequate performance instead of from pure scratch.

Testing

Normally, just like the former works, test can be done during training. However, we still are able to make small changes on the training code. For example, comment out the training part and the remaining part is just testing.

Download checkpoints

We also provide pre-trained models for all three subtasks. You can download them from the following links.

A few discussions and tips for experiments

The contrastive debias strategy requires SMALLER learning rate than the original learning rate to reach adequate performance.
Since the active queries requires the clustering and classification depended on the dataset distribution, we have tested the unseen performance on AVS-V3 in GAVS, which is proven limited.
The debias strategy costs nearly 2x FLOPs and 1.5x training time. However, the bias it deals with still worth it.

Citation

If you find this work useful, please consider citing it.

@article{sun2024unveiling,
  title={Unveiling and Mitigating Bias in Audio Visual Segmentation},
  author={Sun, Peiwen and Zhang, Honggang and Hu, Di},
  journal={arXiv preprint arXiv:2407.16638},
  year={2024}
}

FAQ

The frequently asked questions through E-mail are updated here.

Why assign $C$ classes and $1$ cluster in Semantic-aware Active Queries?
1. The original audio tends to be multisourced in MS3 and AVSS.
2. The original clustering implementation in HuBERT assigns a single cluster to each time slot. This presents a problem of how to ensemble the clusters, and approaches such as voting and temporal pooling have been explored. However, in experiments, the behavior of assigning multiple clusters per audio seems strange and unstable. Personally, if the underlying clustering is more robust, the performance can be more stable.
The code of the classification and clustering?
1. Since this framework is rather simple and straightfoward, the original implemetation of mine is implemented though Jupyter. Due to the relatively chaotic management of the memory pool and execution order of the Jupyter scripts, and I believe this is relatively easy to implement. So we encourage the community to implement it independently, and the subsequent loading only needs to maintain a consistent format.

Apologies for some hard code in this repo. I will keep updating if necessary.

Thanks

Part of the code is adapted from transformers
Part of the code is adapted from GAVS by Yaoting Wang

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
AVS		AVS
AVSS		AVSS
image		image
preprocess		preprocess
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💬 bias_in_AVS

Abstract

Code instruction

Data Preparation

Audio pretransform

Training

Testing

Download checkpoints

A few discussions and tips for experiments

Citation

FAQ

Thanks

About

Releases

Packages

Languages

GeWu-Lab/bias_in_AVS

Folders and files

Latest commit

History

Repository files navigation

💬 bias_in_AVS

Abstract

Code instruction

Data Preparation

Audio pretransform

Training

Testing

Download checkpoints

A few discussions and tips for experiments

Citation

FAQ

Thanks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages