-
Notifications
You must be signed in to change notification settings - Fork 835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unigram bytefallback #1217
Add unigram bytefallback #1217
Conversation
Something like this could initialize initial vocabulary for byte_fallback.
|
@ArthurZucker |
Hey! Sorry I just got back from holidays! Will be updating soon |
No apology necessary! Hope you had a good vacation. I am interested in how you plan to address the alphabet/initial tokens situation It would have to be ensured to not be tokenized at anytime (they are not meant to be tokenized, rather processed internally as fallback during tokenization and generation) AFAICT there are no mechanisms like spm's control symbols that are ensured not to have a surface representation. One way would be to assign those tokens a i32::MIN log prob so as to make it very unlikely to be tokenized. |
… add-unigram-byte-fallback
… add-unigram-byte-fallback
let ids = if self.token_to_ids.contains_key(&string) { | ||
vec![*self.token_to_ids.get(&string).unwrap()] | ||
} else if self.byte_fallback { | ||
string | ||
.bytes() | ||
.map(|b| self.token_to_id(&byte_to_piece(b)).unwrap()) | ||
.collect() | ||
} else { | ||
vec![self.unk_id.ok_or(UnigramError::MissingUnkId)? as u32] | ||
}; | ||
let len = string.len(); | ||
let offsets = (offset, offset + len); | ||
let len = string.len() - ids.len() + 1; | ||
for id in ids { | ||
let offsets = (offset, offset + len); | ||
tokens.push(Token::new(id, self.id_to_token(id).unwrap(), offsets)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove every unwrap
and every vec
.
There is 1 collect
tolerated ( I think it's done that way in BPE) and it's only to check that ALL bytes have a token id (you're allowed to use a single vec
or collect
in that branch, not in the others.
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the unwrap
part since it implicitly implies that potential errors are not being handled.
Can you share the reasoning regarding vec
? is it because since this focuses on the bytefallback pieces, which should be known during compile time (256 of them) and therefore should be addressed with arrays?
The documentation is not available anymore as the PR was closed or merged. |
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
Excited for this feature! If there are plans for any formal release anytime soon, I'll wait for it and test this function thoroughly! If a new version of "tokenizers" isnt expected to hit soon i'll just get on trying it out asap |
Mmmm maybe we'll wait until a transformers realease includes umT5 see huggingface/transformers#24477 ! But if you figure out the bug that's breaking nodes, kudos 😅 |
I noticed byte_fallback is only implemented if you import from a model trained using Google's sentencepiece library (so using SentencePieceUnigramTokenizer.from_spm is required). Are there plans to add this to SentencePieceUnigramTokenizer.train to eliminate the dependency on sentencepiece? |
I hope to see that as well. |
Yep we discussed about having this in a follow PR! Did not have time to do it yet! |
Ok sg! Do you have a minimal example for coaxing the current implementation to work with byte fallback without importing from SPM file? |
let me know if you need any assistance in developing or testing! |
Also still interested in having |
On my TODO list, @chris-ha458 I'd be happy to review a PR if you want to tackle this! |
The last time I solved this, it was through an ugly hack using scripts with the python bindings. |
Adds support for bytfallback with the unigram model