captioning contra tokenizer #1302

TeKett · 2024-04-28T17:05:02Z

TeKett
Apr 28, 2024

How does clip and the tokenizer actually read the caption? What tokens are actually affected and trained?

Im trying out using a custom clip tokenizer, but im not quite sure how the tokenizer even works, or how the caption affects the tokens. Information is really scarce ever since these AI chatbots appeared since everyone asks them instead of on the forums, so you can't search for the information.

My workflow is to first caption my files, then using my caption files i add the tokens from them to the model. Automatically, only tokens that don't exist are added.

Assuming i have the default tokenizer, and then add another token, like green_hair and green_hair</w>. What happens when i then type green_hair in my caption? Will only green_hair be affected or will all tokens that match be affected? Like g, r, e, n, _, h, a, i, green, hair, green_hair, green_hair</w>, etc.

What does </w> do, and should i include it in the caption? What if i affix the token? Will light_green_hair no longer affect green_hair</w> but will affect all other tokens? How does it know or seperate between green_hair and green_hair</w>? Since im not quite sure why the tokenizer double up the tokens using </w>.

Does it matter if i duplicate my training data and replacing the underscores with spaces? Does this benefit training in any way, or does it happen automatically like i asked above?

How does the tokenizer_config come into play? Since you can say if tokens are "lstrip", "normalized", "rstrip", "single_word", or "special".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

captioning contra tokenizer #1302

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

captioning contra tokenizer #1302

TeKett Apr 28, 2024

Replies: 0 comments

TeKett
Apr 28, 2024