The increasing prevalence of multimodal data in our society has led to the increased need for machines to make sense of such data holistically. However, data scientists and machine learning engineers aspiring to work on such data face challenges fusing the knowledge from existing tutorials which often deal with each mode separately. Drawing on our experience in classifying multimodal municipal issue feedback in the Singapore government, we conduct a hands-on tutorial to help flatten the learning curve for practitioners who want to apply machine learning to multimodal data.
Unfortunately, we are not able to conduct the tutorial using the municipal issue feedback data due to its sensitivity. Instead, we use a subset of the WebVision dataset. This dataset consists of labelled images, together with descriptions of them, crawled from the web. We chose this dataset because of its similar characteristics to our municipal issue feedback data (text descriptions correlate highly with the labels but associated images provide even better context).
In this tutorial, we teach participants how to classify multimodal data consisting of both text and images using Transformers. It is targeted at an audience who have some familiarity with neural networks and are comfortable with writing code.
The outline of the tutorial is as follows:
- Sharing of Experience: Municipal issue feedback classification in the Singapore government
- Text Classification: Train a text classification model using BERT
- Text and Image Classification (v1): Train a dual-encoder text and image classification model using BERT and ResNet-50
- Text and Image Classification (v2): Train a joint-encoder text and image classification model using ALign BEfore Fuse (ALBEF)
- Question and Answer/Discussion
The tutorial will be conducted using Google Colab. We will be using the file multimodal_training.ipynb
for the session. To run the notebook on Colab:
- Go to the GitHub option and search for
dsaidgovsg/multimodal-learning-hands-on-tutorial
- Select the
main
branch - Open
multimodal_training.ipynb
- Follow the instructions in the cells
The content in the notebook is meant to be a step-by-step guide to show the difference between the difference model architectures. Thus, the code can be quite repetitve.
We have streamlined the code into a python script which you can run from the terminal to train the models or do prediction from pretrained models.
Steps to run the scripts are as follows:
- If you have not already done so, clone this repo to your working directory
git clone https://github.com/dsaidgovsg/multimodal-learning-hands-on-tutorial.git
- Inside your working directory, run
bash prepare_folders_and_download_files.sh
. The script will create the folder structure and download the files used during the tutorial into these folders. - Install the libraries required via
pip install -r requirements.txt
- To do prediction on the test set using the downloaded pretrained models trained for 20 iterations, run
python3 multimodal_testing.py
- To do your own training and prediction, run
python3 multimodal_training.py
. Edit theargs
dictionary in themain
function if you want to change the training parameters.
Disclaimer
The following source files in this repo were copied from ALBEF's GitHub repo (click on filename to go to the original file location in ALBEF's GitHub repo):
We copied the files so that our code to train the ALBEF models can be run without having to download and copy source files from another site. We also made minor modifications so that the files are compatible with the latest version of Hugging Face Transformers. The rights and ownership of the code belongs to Salesforce, and ALBEF's author, Junnan Li.
We will be using three different model architectures in the tutorial. Their architecture diagrams are shown below.
A text-encoder model which uses only the text to predict the label.
A dual encoder which comprises a separate text encoder (BERT) and an image encoder (ResNet-50).
A joint text-image encoder which aligns the BERT text encoder's embeddings with the image encoder's (Vision Transformers).
The slides for the KDD'22 hands-on tutorial session are here.