Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does the tokenizer support Chinese? #7

Closed
zhanxlin opened this issue Jan 8, 2021 · 5 comments
Closed

Does the tokenizer support Chinese? #7

zhanxlin opened this issue Jan 8, 2021 · 5 comments

Comments

@zhanxlin
Copy link

zhanxlin commented Jan 8, 2021

Hello, does the tokenizer support Chinese?

@jongwook
Copy link
Collaborator

jongwook commented Jan 8, 2021

Hello!

The tokenizer is based on byte-pair encoding and is able to tokenize any valid UTF-8 string, but it is not recommended to use this tokenizer for non-English sentences since its vocabulary is primarily based on English. Furthermore, CLIP models are trained with mostly English data, so we don't expect particularly good performance from CLIP with non-English text inputs. Multilingual CLIP would be an interesting future work.

@ruby0101
Copy link

Hi, @jongwook, "CLIP models are trained with mostly English data". So can it be re-trained with a new image text pair dataset in other languages?

@jongwook
Copy link
Collaborator

Yes; I presume it'll need to train from scratch (or at least the text encoder) in that case.

@datalee
Copy link

datalee commented Mar 15, 2022

Multilingual CLIP 666

@yangapku
Copy link

yangapku commented Nov 18, 2022

Hi, maybe you can refer to this repo! https://github.com/OFA-Sys/Chinese-CLIP @zhanxlin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants