-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does the tokenizer support Chinese? #7
Comments
Hello! The tokenizer is based on byte-pair encoding and is able to tokenize any valid UTF-8 string, but it is not recommended to use this tokenizer for non-English sentences since its vocabulary is primarily based on English. Furthermore, CLIP models are trained with mostly English data, so we don't expect particularly good performance from CLIP with non-English text inputs. Multilingual CLIP would be an interesting future work. |
Hi, @jongwook, "CLIP models are trained with mostly English data". So can it be re-trained with a new image text pair dataset in other languages? |
Yes; I presume it'll need to train from scratch (or at least the text encoder) in that case. |
Multilingual CLIP 666 |
Hi, maybe you can refer to this repo! https://github.com/OFA-Sys/Chinese-CLIP @zhanxlin |
Hello, does the tokenizer support Chinese?
The text was updated successfully, but these errors were encountered: