You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
How does clip and the tokenizer actually read the caption? What tokens are actually affected and trained?
Im trying out using a custom clip tokenizer, but im not quite sure how the tokenizer even works, or how the caption affects the tokens. Information is really scarce ever since these AI chatbots appeared since everyone asks them instead of on the forums, so you can't search for the information.
My workflow is to first caption my files, then using my caption files i add the tokens from them to the model. Automatically, only tokens that don't exist are added.
Assuming i have the default tokenizer, and then add another token, like green_hair and green_hair</w>. What happens when i then type green_hair in my caption? Will only green_hair be affected or will all tokens that match be affected? Like g, r, e, n, _, h, a, i, green, hair, green_hair, green_hair</w>, etc.
What does </w> do, and should i include it in the caption? What if i affix the token? Will light_green_hair no longer affect green_hair</w> but will affect all other tokens? How does it know or seperate between green_hair and green_hair</w>? Since im not quite sure why the tokenizer double up the tokens using </w>.
Does it matter if i duplicate my training data and replacing the underscores with spaces? Does this benefit training in any way, or does it happen automatically like i asked above?
How does the tokenizer_config come into play? Since you can say if tokens are "lstrip", "normalized", "rstrip", "single_word", or "special".
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
How does clip and the tokenizer actually read the caption? What tokens are actually affected and trained?
Im trying out using a custom clip tokenizer, but im not quite sure how the tokenizer even works, or how the caption affects the tokens. Information is really scarce ever since these AI chatbots appeared since everyone asks them instead of on the forums, so you can't search for the information.
My workflow is to first caption my files, then using my caption files i add the tokens from them to the model. Automatically, only tokens that don't exist are added.
Assuming i have the default tokenizer, and then add another token, like
green_hair
andgreen_hair</w>
. What happens when i then typegreen_hair
in my caption? Will onlygreen_hair
be affected or will all tokens that match be affected? Likeg, r, e, n, _, h, a, i, green, hair, green_hair, green_hair</w>, etc
.What does
</w>
do, and should i include it in the caption? What if i affix the token? Willlight_green_hair
no longer affectgreen_hair</w>
but will affect all other tokens? How does it know or seperate betweengreen_hair
andgreen_hair</w>
? Since im not quite sure why the tokenizer double up the tokens using</w>
.Does it matter if i duplicate my training data and replacing the underscores with spaces? Does this benefit training in any way, or does it happen automatically like i asked above?
How does the tokenizer_config come into play? Since you can say if tokens are "lstrip", "normalized", "rstrip", "single_word", or "special".
Beta Was this translation helpful? Give feedback.
All reactions