-
Notifications
You must be signed in to change notification settings - Fork 822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a visualization utility to render tokens and annotations in a notebook #508
Conversation
@n1t0 can you give me some guidance on what's wrong with the docs build ? |
I should be able to have a deeper look today! Will let you know |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @talolard, this is really nice and clean, I love it!
I'm not entirely sure about the namespacing (tokenizers.viz
vs tokenizers.notebooks
or tokenizers.tools
) and will need to think about it, but that's a detail.
For the error in the CI about the documentation, I think it is because we need to modify the setup.py
to have it include what's necessary when we run python setup.py install|develop
. I was having the same problem locally when trying to do import tokenizers
in a Python shell.
Another little detail: we actually use Google-style for docstrings. If you're not familiar with this syntax, don't worry I'll take care of it. We can also include everything in the API Reference in the Sphinx docs.
Last thing that we'll need to check is if everything works as expected when using a tokenizer like the one from GPT-2 or Roberta. Since they use a byte-level technique, we can have multiple tokens that have overlapping spans over the input, for example with emojis or other Unicode characters that don't have their own token.
My inclination would be towards tools, or viz but I have no conviction.
Could you handle that, I'm not sure what to do. When I run setup.py develop it works, presumably because of something I don't understand.
I made an attempt to use Google style docstrings. There were some places where I wasn't sure about how to write out the typings. If you comment on things to fix I'll learn and fix.
I added something to the notebook, but per my comment above, not sure exactly what to test for. |
The current version of the input text is great I think to check if it works as expected. I just ran some tests with a last cell with the following code: encoding = roberta_tokenizer.encode(text)
[(token, offset, text[offset[0]:offset[1]]) for (token, offset) in zip(encoding.tokens, encoding.offsets)] which gives this kind of output:
As you can see, there are actually a lot of tokens that we can't see because they are representing sub-parts of an actual Unicode code point. I'd be curious to see what it looks like with a ByteLevelBPETokenizer trained on a language like Hebrew and see if the visualization actually makes sense in this case. Is this something you'd like to try?
Sure don't worry, I'll take care of anything left related to the integration! |
I actually did that and took it out because it was too much text. What do you think of a "gallery" notebook with a mix of langauges and tokenizers? |
Sure! That'd be a great way to check that everything works as expected. |
I added a notebook with some examples in different languages. @n1t0 I actually noticed something strange. When I use the BPE tokenizer, whitespaces are included in the following token. I'm not sure if that's how it's supposed to be, or a bug in my code or in the tokenizers. Could you take a look at these pics and give guidance ? |
Thank you @talolard! This is actually expected yes. The byte-level BPE also encodes the whitespace because it then allows it to decode back to the original sentence. From your pictures and the various examples in the notebooks, I think everything looks as expected for English (and probably other latin languages). My current concern is about the other languages. I just tried checking the generated tokens with byte-level BPE using the example in Hebrew, and they don't seem to match the visualization at all. >>> encoding = roberta_tokenizer.encode(texts["Hebrew"])
>>> [(token, offset, texts["Hebrew"][offset[0]:offset[1]]) for (token, offset) in zip(encoding.tokens, encoding.offsets)]
[('×ij', (0, 1), 'ב'),
('×', (1, 2), 'נ'),
('ł', (1, 2), 'נ'),
('×Ļ', (2, 3), 'י'),
('Ġ×', (3, 5), ' א'),
('IJ', (4, 5), 'א'),
('×', (5, 6), 'ד'),
('ĵ', (5, 6), 'ד'),
('×', (6, 7), 'ם'),
('Ŀ', (6, 7), 'ם'),
('Ġ×', (7, 9), ' ז'),
('ĸ', (8, 9), 'ז'),
('×ķ', (9, 10), 'ו'),
('ר', (10, 11), 'ר'),
('×Ļ×', (11, 13), 'ים'),
('Ŀ', (12, 13), 'ם'),
('Ġ×', (13, 15), ' ב'),
('ij', (14, 15), 'ב'),
('×', (15, 16), 'כ'),
('Ľ', (15, 16), 'כ'),
('׾', (16, 17), 'ל'),
('Ġ×', (17, 19), ' י'),
('Ļ', (18, 19), 'י'),
('×ķ', (19, 20), 'ו'),
('×', (20, 21), 'ם'),
('Ŀ', (20, 21), 'ם'),
('Ġ×', (21, 23), ' ל'),
('ľ', (22, 23), 'ל'),
('ר', (23, 24), 'ר'),
('×ķ', (24, 25), 'ו'),
('×', (25, 26), 'ח'),
('Ĺ', (25, 26), 'ח'),
(',', (26, 27), ','),
...
] As you can see, most tokens represent one character at most, and often there are two tokens for one character. Yet, in the visualization, it appears to be long tokens, which seems wrong. I think the |
I honestly don't know how we should handle this. I expect the BPE algorithm to learn the most common tokens without having any overlap in most cases, and small overlaps with rarely seen tokens that end up being decomposed, but I'm not sure at all. Maybe just making sure it alternates between the two shades of grey, while excluding any "multi token single chars" from the next tokens could be enough.
Yes, in this case, the Maybe this post can help understand how the byte-level works: #203 (comment) |
OK this requires actual thinking. I'll tinker with it on the weekend and come back with something |
How would this compare to spacy's displacy? a while back I have done something to visualize (token classification) outputs that way. Something similar can be done just for tokens + annotations (just gotta write the huggingface->spacy align|formatter) |
Thanks! |
Thank you @talolard that looks great! |
@talolard I think I don't have the authorization to push on the branch used for this PR. Maybe you disabled the option while opening the PR? |
Fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now ready to be merged! Sorry it took me so long to finalize it, I was a bit overwhelmed with things left to do last week, and was off this week.
Here is a summary of the little things I changed:
- Everything now leaves in a single file, under
tools
. So in order to import the visualizer and the annotations we can do:
from tokenizer.tools import EncodingVisualizer, Annotation
- Updated the setup.py file to help it package the lib with the newly added files.
- Updated a bit the docstrings, and included everything in the API Reference part of the docs.
- I finally removed the language gallery. This notebook has been a great help in debugging what was happening with the various languages, but I fear that it might be misleading for the end-user. BERT and Roberta are both trained on English and so it does not represent the end result that a tokenizer trained on each specific language would produce.
Thanks again @talolard, this is a really great addition to the library and will be very helpful in understanding the tokenization. It will be included in the next release!
Yay!! |
This follows on the discussion we had here.
What It Does
User can get a visualization of the tokenized output with annotations
Cool Features
Missing Stuff
it renders multiple UNK tags on top.
Notebook
It has a notebook in the examples folder