Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong model is used in example, should be character instead of subword model #12676

Merged
merged 6 commits into from
Jul 13, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions docs/source/model_doc/canine.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,12 @@ Tips:
(which has a predefined Unicode code point). For token classification tasks however, the downsampled sequence of
tokens needs to be upsampled again to match the length of the original character sequence (which is 2048). The
details for this can be found in the paper.
- Models:

- `google/canine-c <https://huggingface.co/google/canine-c>`__: Pre-trained with autoregressive character loss,
12-layer, 768-hidden, 12-heads, 121M parameters (size ~500 MB).
- `google/canine-s <https://huggingface.co/google/canine-s>`__: Pre-trained with subword loss, 12-layer,
768-hidden, 12-heads, 121M parameters (size ~500 MB).

This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
<https://github.com/google-research/language/tree/master/language/canine>`__.
Expand All @@ -63,7 +69,7 @@ CANINE works on raw characters, so it can be used without a tokenizer:
from transformers import CanineModel
import torch

model = CanineModel.from_pretrained('google/canine-s') # model pre-trained with autoregressive character loss
model = CanineModel.from_pretrained('google/canine-c') # model pre-trained with autoregressive character loss

text = "hello world"
# use Python's built-in ord() function to turn each character into its unicode code point id
Expand All @@ -81,8 +87,8 @@ sequences to the same length):

from transformers import CanineTokenizer, CanineModel

model = CanineModel.from_pretrained('google/canine-s')
tokenizer = CanineTokenizer.from_pretrained('google/canine-s')
model = CanineModel.from_pretrained('google/canine-c')
tokenizer = CanineTokenizer.from_pretrained('google/canine-c')

inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."]
encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt")
Expand Down