You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the sample code provided, features are concated before processed in the encoder.
features = torch.concat([video_tokenizer(video), audio_tokenizer(audio), time_series_tokenizer(time_data)],dim=1)
However, as I ran some tokenizers of different modaility, the tokenized shape is not identical.
For example, image is tokenized as [B, HW, C] and text is tokenized as [B,1,C] where c is the embedding dimension 768.
How are we supposed to process this? Also, Is there a sample code using text_tokenizer? It seems like the text_encoder is loading the wrong tokenizer CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
The text was updated successfully, but these errors were encountered:
In the sample code provided, features are concated before processed in the encoder.
features = torch.concat([video_tokenizer(video), audio_tokenizer(audio), time_series_tokenizer(time_data)],dim=1)
However, as I ran some tokenizers of different modaility, the tokenized shape is not identical.
For example, image is tokenized as [B, HW, C] and text is tokenized as [B,1,C] where c is the embedding dimension 768.
How are we supposed to process this? Also, Is there a sample code using text_tokenizer? It seems like the text_encoder is loading the wrong tokenizer CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
The text was updated successfully, but these errors were encountered: