Is a Good Description Worth a Thousand Pictures? Reducing Multimodal Alignment to Text-Based, Unimodal Alignment
In this paper, we investigated whether the multimodal alignment problem could be effectively reduced to the unimodal alignment problem, wherein a language model would make a moral judgment purely based on a description of an image. Focusing on GPT-4 and LLaVA as two prominent examples of multimodal systems, we demonstrated, rather surprisingly, that this reduction can be achieved with a relatively small loss in moral judgment performance in the case of LLaVa, and virtually no loss in the case of GPT-4.
This setup is specifically for the Mila cluster. Revision will be needed for other infrastructures.
- Ensure you have access to the cluster.
- Ensure you have the necessary permissions to run interactive jobs.
-
Get an interactive job:
- Follow the cluster's documentation to start an interactive job.
-
Clone the repository:
cd ~ git clone https://github.com/amimem/alignment.git
-
Set permissions and run setup script:
cd alignment chmod +x scripts/*.sh source scripts/setup.sh
For future interactive jobs, simply run:
source scripts/int.sh
To submit a batch job, run:
source scripts/slurm.sh