Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unigram bytefallback #1217

Merged
merged 52 commits into from
Jun 26, 2023

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Apr 12, 2023

Adds support for bytfallback with the unigram model

@chris-ha458
Copy link
Contributor

chris-ha458 commented May 1, 2023

Something like this could initialize initial vocabulary for byte_fallback.
This could aslo be useful for
#1183 (comment)

const UNICODE_CAPACITY: usize = 256;

pub fn create_encoded_bytes() -> Vec<String> {
     (0..UNICODE_CAPACITY)
        .map(|i| format!("<0x{:02X}>", i))
        .collect()
}

@chris-ha458
Copy link
Contributor

@ArthurZucker
How's the implementation going? if necessary, I would like to provide assistance.

@ArthurZucker
Copy link
Collaborator Author

Hey! Sorry I just got back from holidays! Will be updating soon

@chris-ha458
Copy link
Contributor

No apology necessary! Hope you had a good vacation.

I am interested in how you plan to address the alphabet/initial tokens situation
(like my code snippet)

It would have to be ensured to not be tokenized at anytime (they are not meant to be tokenized, rather processed internally as fallback during tokenization and generation)

AFAICT there are no mechanisms like spm's control symbols that are ensured not to have a surface representation.

One way would be to assign those tokens a i32::MIN log prob so as to make it very unlikely to be tokenized.

Comment on lines 436 to 450
let ids = if self.token_to_ids.contains_key(&string) {
vec![*self.token_to_ids.get(&string).unwrap()]
} else if self.byte_fallback {
string
.bytes()
.map(|b| self.token_to_id(&byte_to_piece(b)).unwrap())
.collect()
} else {
vec![self.unk_id.ok_or(UnigramError::MissingUnkId)? as u32]
};
let len = string.len();
let offsets = (offset, offset + len);
let len = string.len() - ids.len() + 1;
for id in ids {
let offsets = (offset, offset + len);
tokens.push(Token::new(id, self.id_to_token(id).unwrap(), offsets));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove every unwrap and every vec.

There is 1 collect tolerated ( I think it's done that way in BPE) and it's only to check that ALL bytes have a token id (you're allowed to use a single vec or collect in that branch, not in the others.

:)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the unwrap part since it implicitly implies that potential errors are not being handled.

Can you share the reasoning regarding vec? is it because since this focuses on the bytefallback pieces, which should be known during compile time (256 of them) and therefore should be addressed with arrays?

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 21, 2023

The documentation is not available anymore as the PR was closed or merged.

@ArthurZucker ArthurZucker requested a review from Narsil June 22, 2023 12:56
tokenizers/src/models/unigram/model.rs Outdated Show resolved Hide resolved
tokenizers/src/models/unigram/model.rs Outdated Show resolved Hide resolved
tokenizers/src/models/unigram/model.rs Outdated Show resolved Hide resolved
tokenizers/src/models/unigram/model.rs Outdated Show resolved Hide resolved
tokenizers/src/models/unigram/model.rs Outdated Show resolved Hide resolved
ArthurZucker and others added 4 commits June 23, 2023 14:06
Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@Narsil Narsil merged commit 864135b into huggingface:main Jun 26, 2023
@ArthurZucker ArthurZucker deleted the add-unigram-byte-fallback branch June 26, 2023 08:49
@chris-ha458
Copy link
Contributor

Excited for this feature!

If there are plans for any formal release anytime soon, I'll wait for it and test this function thoroughly!
(There might be odd edge cases especially regarding the surface representation of byte-fallback tokens ie what the tokenizer would do when the corpus includes text that look like "<0x03>" in the orignal text)

If a new version of "tokenizers" isnt expected to hit soon i'll just get on trying it out asap

@ArthurZucker
Copy link
Collaborator Author

Mmmm maybe we'll wait until a transformers realease includes umT5 see huggingface/transformers#24477 ! But if you figure out the bug that's breaking nodes, kudos 😅

@kellymarchisio
Copy link
Contributor

kellymarchisio commented Jul 17, 2023

I noticed byte_fallback is only implemented if you import from a model trained using Google's sentencepiece library (so using SentencePieceUnigramTokenizer.from_spm is required). Are there plans to add this to SentencePieceUnigramTokenizer.train to eliminate the dependency on sentencepiece?

@chris-ha458
Copy link
Contributor

I hope to see that as well.
I've tried a very ugly work around involving training with the byte_fallback on and then adding the byte pieces (<0x00> to <0xFF>) individually as added tokens, but it is a very brittle solution.

@ArthurZucker
Copy link
Collaborator Author

Yep we discussed about having this in a follow PR! Did not have time to do it yet!

@kellymarchisio
Copy link
Contributor

Ok sg! Do you have a minimal example for coaxing the current implementation to work with byte fallback without importing from SPM file?

@chris-ha458
Copy link
Contributor

let me know if you need any assistance in developing or testing!

@gautierdag
Copy link

Also still interested in having byte_fallback properly implemented in Unigram, such that unigram models can be trained/used without unk tokens. I haven't managed to find a workaround and even manually adding the byte pieces to the vocabulary does not work.

@ArthurZucker
Copy link
Collaborator Author

On my TODO list, @chris-ha458 I'd be happy to review a PR if you want to tackle this!

@chris-ha458
Copy link
Contributor

chris-ha458 commented Aug 30, 2023

The last time I solved this, it was through an ugly hack using scripts with the python bindings.
I have since learned a bit more rust and have more confidence using it, so I might take a go at it on the rust code for trainer level, but it might take a while

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants