Add unigram bytefallback #1217

ArthurZucker · 2023-04-12T11:24:03Z

Adds support for bytfallback with the unigram model

tokenizers/src/models/unigram/model.rs

chris-ha458 · 2023-05-01T12:01:35Z

Something like this could initialize initial vocabulary for byte_fallback.
This could aslo be useful for
#1183 (comment)

const UNICODE_CAPACITY: usize = 256;

pub fn create_encoded_bytes() -> Vec<String> {
     (0..UNICODE_CAPACITY)
        .map(|i| format!("<0x{:02X}>", i))
        .collect()
}

chris-ha458 · 2023-05-15T14:26:17Z

@ArthurZucker
How's the implementation going? if necessary, I would like to provide assistance.

ArthurZucker · 2023-05-25T14:34:14Z

Hey! Sorry I just got back from holidays! Will be updating soon

chris-ha458 · 2023-05-26T01:28:48Z

No apology necessary! Hope you had a good vacation.

I am interested in how you plan to address the alphabet/initial tokens situation
(like my code snippet)

It would have to be ensured to not be tokenized at anytime (they are not meant to be tokenized, rather processed internally as fallback during tokenization and generation)

AFAICT there are no mechanisms like spm's control symbols that are ensured not to have a surface representation.

One way would be to assign those tokens a i32::MIN log prob so as to make it very unlikely to be tokenized.

… add-unigram-byte-fallback

Narsil · 2023-06-21T11:34:59Z

tokenizers/src/models/unigram/model.rs

+            let ids = if self.token_to_ids.contains_key(&string) {
+                vec![*self.token_to_ids.get(&string).unwrap()]
+            } else if self.byte_fallback {
+                string
+                    .bytes()
+                    .map(|b| self.token_to_id(&byte_to_piece(b)).unwrap())
+                    .collect()
+            } else {
+                vec![self.unk_id.ok_or(UnigramError::MissingUnkId)? as u32]
            };
-            let len = string.len();
-            let offsets = (offset, offset + len);
+            let len = string.len() - ids.len() + 1;
+            for id in ids {
+                let offsets = (offset, offset + len);
+                tokens.push(Token::new(id, self.id_to_token(id).unwrap(), offsets));
+            }


Remove every unwrap and every vec.

There is 1 collect tolerated ( I think it's done that way in BPE) and it's only to check that ALL bytes have a token id (you're allowed to use a single vec or collect in that branch, not in the others.

:)

I understand the unwrap part since it implicitly implies that potential errors are not being handled.

Can you share the reasoning regarding vec? is it because since this focuses on the bytefallback pieces, which should be known during compile time (256 of them) and therefore should be addressed with arrays?

HuggingFaceDocBuilderDev · 2023-06-21T11:36:46Z

The documentation is not available anymore as the PR was closed or merged.

tokenizers/src/models/unigram/model.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Narsil

LGTM !

chris-ha458 · 2023-06-26T09:18:22Z

Excited for this feature!

If there are plans for any formal release anytime soon, I'll wait for it and test this function thoroughly!
(There might be odd edge cases especially regarding the surface representation of byte-fallback tokens ie what the tokenizer would do when the corpus includes text that look like "<0x03>" in the orignal text)

If a new version of "tokenizers" isnt expected to hit soon i'll just get on trying it out asap

ArthurZucker · 2023-06-26T10:37:25Z

Mmmm maybe we'll wait until a transformers realease includes umT5 see huggingface/transformers#24477 ! But if you figure out the bug that's breaking nodes, kudos 😅

kellymarchisio · 2023-07-17T22:05:56Z

I noticed byte_fallback is only implemented if you import from a model trained using Google's sentencepiece library (so using SentencePieceUnigramTokenizer.from_spm is required). Are there plans to add this to SentencePieceUnigramTokenizer.train to eliminate the dependency on sentencepiece?

chris-ha458 · 2023-07-17T22:52:55Z

I hope to see that as well.
I've tried a very ugly work around involving training with the byte_fallback on and then adding the byte pieces (<0x00> to <0xFF>) individually as added tokens, but it is a very brittle solution.

ArthurZucker · 2023-07-19T15:04:47Z

Yep we discussed about having this in a follow PR! Did not have time to do it yet!

kellymarchisio · 2023-07-19T15:50:18Z

Ok sg! Do you have a minimal example for coaxing the current implementation to work with byte fallback without importing from SPM file?

chris-ha458 · 2023-07-19T22:45:03Z

let me know if you need any assistance in developing or testing!

gautierdag · 2023-08-29T17:27:54Z

Also still interested in having byte_fallback properly implemented in Unigram, such that unigram models can be trained/used without unk tokens. I haven't managed to find a workaround and even manually adding the byte pieces to the vocabulary does not work.

ArthurZucker · 2023-08-30T12:17:45Z

On my TODO list, @chris-ha458 I'd be happy to review a PR if you want to tackle this!

chris-ha458 · 2023-08-30T13:21:37Z

The last time I solved this, it was through an ugly hack using scripts with the python bindings.
I have since learned a bit more rust and have more confidence using it, so I might take a go at it on the rust code for trainer level, but it might take a while

current updates will go red

044fb41

ArthurZucker commented Apr 12, 2023

View reviewed changes

tokenizers/src/models/unigram/model.rs Outdated Show resolved Hide resolved

ArthurZucker added 2 commits April 14, 2023 16:53

cargo fmt

f26b0b7

npm install

f8c6c47

Narsil mentioned this pull request Apr 21, 2023

bytelevel(), Unigram best practices #1224

Closed

chris-ha458 mentioned this pull request Apr 30, 2023

[bug] BPE roberta-large-mnli saved with .save_pretrained() incorrectly sets byte_fallback to false (should be true) #1234

Closed

Narsil mentioned this pull request May 4, 2023

Addition of CONTRIBUTING.md to Repository #926

Closed

ArthurZucker added 17 commits May 30, 2023 15:16

Merge branch 'main' of https://github.com/huggingface/tokenizers into…

ac7529a

… add-unigram-byte-fallback

refactor train for unigram to allow bytefallbakc (breaking)

ce61a40

fmt

b327540

nits

5e13667

Merge branch 'main' of https://github.com/huggingface/tokenizers into…

e9e42e8

… add-unigram-byte-fallback

update

92b6490

add a proper test

0fef053

fix encode optimised fallback + add trainer arg

dfd36ff

fixes

c72eac1

fixes

3323956

fix tests

03b5be5

add test

c97fd26

fmt

4375f07

fix rust test

cc2f12d

update python bindings

00e3a3d

update

8834796

pub is okay and needed

005698a

Narsil reviewed Jun 21, 2023

View reviewed changes

ArthurZucker added 11 commits June 22, 2023 07:45

fix option bool issues

949bcb4

final fix

03aacd8

clippy

e44b03a

fix npm isntall

ea02db0

update

5ad1d63

update test

8bf84cf

more in depth testing

6c3ea53

Lint

a04408c

last attempt to fix node

7fc68a3

update node bindings

e1b7a33

fmt

16f9619

ArthurZucker requested a review from Narsil June 22, 2023 12:56

Narsil reviewed Jun 22, 2023

View reviewed changes

ArthurZucker and others added 4 commits June 23, 2023 14:06

Update tokenizers/src/models/unigram/model.rs

003d284

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

update based on review

2451f8c

simpler test

4b90431

lint

58912da

Narsil approved these changes Jun 26, 2023

View reviewed changes

Narsil merged commit 864135b into huggingface:main Jun 26, 2023

ArthurZucker deleted the add-unigram-byte-fallback branch June 26, 2023 08:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unigram bytefallback #1217

Add unigram bytefallback #1217

ArthurZucker commented Apr 12, 2023 •

edited

Loading

chris-ha458 commented May 1, 2023 •

edited

Loading

chris-ha458 commented May 15, 2023

ArthurZucker commented May 25, 2023

chris-ha458 commented May 26, 2023

Narsil Jun 21, 2023

chris-ha458 Jun 21, 2023

HuggingFaceDocBuilderDev commented Jun 21, 2023 •

edited

Loading

Narsil left a comment

chris-ha458 commented Jun 26, 2023

ArthurZucker commented Jun 26, 2023

kellymarchisio commented Jul 17, 2023 •

edited

Loading

chris-ha458 commented Jul 17, 2023

ArthurZucker commented Jul 19, 2023

kellymarchisio commented Jul 19, 2023

chris-ha458 commented Jul 19, 2023

gautierdag commented Aug 29, 2023

ArthurZucker commented Aug 30, 2023

chris-ha458 commented Aug 30, 2023 •

edited

Loading

Add unigram bytefallback #1217

Add unigram bytefallback #1217

Conversation

ArthurZucker commented Apr 12, 2023 • edited Loading

chris-ha458 commented May 1, 2023 • edited Loading

chris-ha458 commented May 15, 2023

ArthurZucker commented May 25, 2023

chris-ha458 commented May 26, 2023

Narsil Jun 21, 2023

Choose a reason for hiding this comment

chris-ha458 Jun 21, 2023

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jun 21, 2023 • edited Loading

Narsil left a comment

Choose a reason for hiding this comment

chris-ha458 commented Jun 26, 2023

ArthurZucker commented Jun 26, 2023

kellymarchisio commented Jul 17, 2023 • edited Loading

chris-ha458 commented Jul 17, 2023

ArthurZucker commented Jul 19, 2023

kellymarchisio commented Jul 19, 2023

chris-ha458 commented Jul 19, 2023

gautierdag commented Aug 29, 2023

ArthurZucker commented Aug 30, 2023

chris-ha458 commented Aug 30, 2023 • edited Loading

ArthurZucker commented Apr 12, 2023 •

edited

Loading

chris-ha458 commented May 1, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 21, 2023 •

edited

Loading

kellymarchisio commented Jul 17, 2023 •

edited

Loading

chris-ha458 commented Aug 30, 2023 •

edited

Loading