Does the tokenizer support Chinese? #7

zhanxlin · 2021-01-08T08:37:32Z

Hello, does the tokenizer support Chinese?

jongwook · 2021-01-08T18:05:41Z

Hello!

The tokenizer is based on byte-pair encoding and is able to tokenize any valid UTF-8 string, but it is not recommended to use this tokenizer for non-English sentences since its vocabulary is primarily based on English. Furthermore, CLIP models are trained with mostly English data, so we don't expect particularly good performance from CLIP with non-English text inputs. Multilingual CLIP would be an interesting future work.

ruby0101 · 2021-02-17T15:41:44Z

Hi, @jongwook, "CLIP models are trained with mostly English data". So can it be re-trained with a new image text pair dataset in other languages?

jongwook · 2021-02-18T06:17:13Z

Yes; I presume it'll need to train from scratch (or at least the text encoder) in that case.

datalee · 2022-03-15T07:24:35Z

Multilingual CLIP 666

yangapku · 2022-11-18T15:27:20Z

Hi, maybe you can refer to this repo! https://github.com/OFA-Sys/Chinese-CLIP @zhanxlin

jongwook closed this as completed Jan 13, 2021

jongwook mentioned this issue Aug 20, 2021

how to use clip on chinese dataset? #142

Closed

jongwook mentioned this issue Dec 3, 2021

About chinese #192

Closed

jongwook mentioned this issue Nov 11, 2022

whether CLIP can embedding chinese text ? #302

Closed

Young-Flash mentioned this issue Aug 15, 2023

Android support mazzzystar/Queryable#12

Closed

HYLcool mentioned this issue Nov 14, 2024

How to calculate the image_text_similarity scores for both Chinese and English? modelscope/data-juicer#473

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the tokenizer support Chinese? #7

Does the tokenizer support Chinese? #7

zhanxlin commented Jan 8, 2021

jongwook commented Jan 8, 2021

ruby0101 commented Feb 17, 2021

jongwook commented Feb 18, 2021

datalee commented Mar 15, 2022

yangapku commented Nov 18, 2022 •

edited

Loading

Does the tokenizer support Chinese? #7

Does the tokenizer support Chinese? #7

Comments

zhanxlin commented Jan 8, 2021

jongwook commented Jan 8, 2021

ruby0101 commented Feb 17, 2021

jongwook commented Feb 18, 2021

datalee commented Mar 15, 2022

yangapku commented Nov 18, 2022 • edited Loading

yangapku commented Nov 18, 2022 •

edited

Loading