Most of the recent work in sarcasm detection has been carried out on textual data. In this paper, we argue that incorporating multimodal cues can improve the automatic classification of sarcasm.
We propose a new sarcasm dataset, Multimodal Sarcasm Detection Dataset, MUStARD, compiled from popular TV shows. It consists of audiovisual utterances annotated with sarcasm labels. Each utterance is accompanied by its context of historical utterances in the dialogue. Our initial results show that the use of multimodal information can reduce the relative error rate of sarcasm detection by up to 12.9% in F1-score.
We also exemplify various scenarios where incongruity in sarcasm is evident across different modalities, thus stressing the role of multimodal approaches to solve this problem. We also introduce several baselines and show that multimodal models are significantly more effective when compared to their unimodal variants.
Most research has been in the domain of text, and primarily using Twitter datasets. In our dataset, we capitalize on the conversational format and provide context by including preceding utterances along with speaker identities.
Sarcasm in speech has mainly focused on the identification of prosodic cues in the form of acoustic patterns related to sarcastic behavior. Studied features include mean amplitude, amplitude range, speech rate, etc. In general, prosodic features such as intonation and stress are considered important indicators of sarcasm. We take motivation from this previous research and include similar speech parameters as features in our dataset and baseline experiments.
Multimodal sarcasm: prior work couples textual features with cognitive features such as the gaze behavior of readers or EEG/MEG signals. But there's not much work exploring muiltimodal avenues to understand sarcasm conveyed by the opinion holder. To the best of our knowledge, ours is the first work to propose a resource on video-level sarcasm.
We concluded web searches on Youtube such as Friends sarcasm, Chandler sarcasm, etc. We obtain videos from 3 TV shows: Friends, The golden girls, and Sarcasmaholics Anonymous. To obtain non-sarcastic videos, we select a subset of 400 videos from MELD, a multimodal emotion recognition dataset derived from Friends. We also collect videos from the Big Bang Theory show. The collected dataset consists of 6.4k videos. The majority of our videos aren't labelled, so that's our next step. We had 2 annotators.
From 5.8k utterances in TBBT, 98% were annotated as non-sarcastic, and our initial inter-annotation agreement was low. So we stopped the annotation process and reconciled the annotation differences before proceeding further. The annotators discussed their disagreement for a subset of 20 videos, and re-annotated the videos. Then we got an improved inter-annotator agreement.
The resulting set consists of 345 videos labelled as sarcastic and 6k as non-sarcastic. To make it balanced, we get 345 randomly selected non-sarcastic videos to make a dataset of 690 videos.
61% of utterances(=videos) are of one sentence. Each utterance and its context consists of 3 modalities: video, audio and transcription/text.
The vocal tonality of the speaker often indicates sarcasm, text that otherwise looks seemingly straightforward is noticed to contain sarcasm only when the associated voices are heard. Another marker of sarcasm is the undue stress on particular words, like 'really'.
- Text features: We represent the textual utterances using BERT, we average the last four transformer layers of the first token in the utterance.
- Speech features:we obtain low-level features from the audio data stream for each utterance. We intend to provide information related to pitch, intonation, and other tonal-specific details. For more specifics, page 6 of paper.
- Video features: we extract visual features for each of the f frames in the utterance video using a pool5 layer of an ImageNet pretrained ResNet-152 image classification model.
We evaluate each modality separatedly and also combining them all. We also investigate the role of context and speaker information for improving predictions.
The experiments are conducted using 3 main baseline methods:
- Majority: this baseline assigns all the instances to the majority class, so non-sarcastic.
- Random: makes random predictions sampled uniformly across the test set.
- SVM: these are strong predictors for small-sized datasets.
Lowest performance obtained with the majority baseline, with 66.7% F1 score for non-sarcastic and 0% for sarcastic. The pre-trained features for the visual modality provide the best performance among the unimodal variants.. Adding textual features through concatenation achieves the best performance. The tri-modal variant is unable to achieve the best score due to a slightly sub-optimal performance of the audio modality.
Overall, the combination of visual and textual signals significantly improves over the unimodal variants, with a relative error rate reduction of up to 12.9%
In the cases where the bimodal textual and visual model predicts sarcasm but the unimodal textual model fails, the textual component doesn't reveal any explicit sarcasm.
The presence of new speakers in the testing set requires a higher degree of generalization from the model.
We provide a systematic introduction to multimodal learning for sarcasm detection. We introduce a novel dataset, MUStARD, consisting of sarcastic and non-sarcastic videos from different sorces. We demonstrate the need for multimodal learning for sarcasm detection. We develop models that leverage 3 different modalities, text, speech and visual.
The results of the baseline experiments supported the hypothesis that multimodality is important for sarcasm detection. In multiple evaluations, the multimodal variants were shown to significantly outperform their unimodal counterparts, with relative error rate reductions of up to 12.9%
We strove to create a high-quality dataset with rich annotations, so we traded off with corpus size. We chose a balanced version of the dataset. But with this, complex NNs are overfitted. So in our experiments SVMs perform better than CNNs. Future work should include pre-training, transfer learning, domain adaptation, etc.
As gesture and facial expressions are important features for sarcasm analysis, we believe the capability for models to identify the speakers in the multiparty videos is likely to be beneficial for the task.