Misclassification of Out-of-Domain Videos in TC-CLIP #7

ooza · 2024-12-04T11:09:18Z

Hi
Thanks again for this interesting model. I tested the demo notebook file on a small custom dataset.
(tc_clip_model_path = "pretrained/zero_shot_k400_llm_tc_clip.pth" # pretrained model path)
And I'm encountering an issue where TC-CLIP misclassifies videos that do not belong to any of the defined action classes. For example, I added a neutral video of a puppy (completely unrelated to the action classes) to my dataset, which consists of the following classes:

stealing
robbery
violence

Despite the video's irrelevance, the model assigns it a label (stealing) based on the highest logit value.
{'stealing': 24.42, 'robbery': 22.50, 'violence': 23.47}
This behavior is problematic because it suggests that the model always outputs one of the predefined classes, even when the input does not fit any of them.

What I Tried

Rejection Threshold: I implemented a threshold to reject predictions where the highest logit is below a certain value. However, this approach did not generalize well and led to poor performance when legitimate action videos had logits close to the threshold.
Neutral Class: I added an "other" class. Yet, this approach was not efficient:

Expected Behavior
The model should ideally: Provide an "unknown" or "no action" output for videos that do not belong to any defined class.
Avoid forcing a prediction into one of the predefined classes when the input is irrelevant.

Could you please provide guidance or suggest strategies to handle out-of-distribution inputs effectively in TC-CLIP?

The text was updated successfully, but these errors were encountered:

byminji · 2024-12-09T13:29:52Z

Hi @ooza, Thank you so much for your interest in our work! 🤗

This is an expected behavior, as our model wasn't explicitly trained to reject OOD samples as an <unknown> class. Simply adding an "other" class might not help, as it will assign the video embeddings closely to the text embeddings, with the semantics of the word "other" (literally).

The most straightforward way to achieve the goal of OOD detection is to fine-tune our model with tasks like anomaly detection and teach it to reject OOD samples. If you want to achieve this goal without fine-tuning, you can try using an "easy and familiar" bag of words seen during the K400 training as a criterion for the neutral classes.

Another minor suggestion is changing the weight averaging ratio $wise_ft. If you think your data distribution is fairly similar to the training source distribution (in our case, kinetics400), increasing $wise_ft might help. For more details, refer to Appendix E in our paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misclassification of Out-of-Domain Videos in TC-CLIP #7

Misclassification of Out-of-Domain Videos in TC-CLIP #7

ooza commented Dec 4, 2024

byminji commented Dec 9, 2024

Misclassification of Out-of-Domain Videos in TC-CLIP #7

Misclassification of Out-of-Domain Videos in TC-CLIP #7

Comments

ooza commented Dec 4, 2024

byminji commented Dec 9, 2024