Question in concating the features #68

memesoo99 · 2024-04-24T07:22:47Z

In the sample code provided, features are concated before processed in the encoder.
features = torch.concat([video_tokenizer(video), audio_tokenizer(audio), time_series_tokenizer(time_data)],dim=1)

However, as I ran some tokenizers of different modaility, the tokenized shape is not identical.
For example, image is tokenized as [B, HW, C] and text is tokenized as [B,1,C] where c is the embedding dimension 768.

How are we supposed to process this? Also, Is there a sample code using text_tokenizer? It seems like the text_encoder is loading the wrong tokenizer CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question in concating the features #68

Question in concating the features #68

memesoo99 commented Apr 24, 2024

Question in concating the features #68

Question in concating the features #68

Comments

memesoo99 commented Apr 24, 2024