<FrameworkSwitchCourse {fw} />
<Question choices={[ { text: "First, the model, which handles text and returns raw predictions. The tokenizer then makes sense of these predictions and converts them back to text when needed.", explain: "The model cannot understand text! The tokenizer must first tokenize the text and convert it to IDs so that it is understandable by the model." }, { text: "First, the tokenizer, which handles text and returns IDs. The model handles these IDs and outputs a prediction, which can be some text.", explain: "The model's prediction cannot be text straight away. The tokenizer has to be used in order to convert the prediction back to text!" }, { text: "The tokenizer handles text and returns IDs. The model handles these IDs and outputs a prediction. The tokenizer can then be used once again to convert these predictions back to some text.", explain: "Correct! The tokenizer can be used for both tokenizing and de-tokenizing.", correct: true } ]} />
2. How many dimensions does the tensor output by the base Transformer model have, and what are they?
<Question choices={[ { text: "2: The sequence length and the batch size", explain: "False! The tensor output by the model has a third dimension: hidden size." }, { text: "2: The sequence length and the hidden size", explain: "False! All Transformer models handle batches, even with a single sequence; that would be a batch size of 1!" }, { text: "3: The sequence length, the batch size, and the hidden size", explain: "Correct!", correct: true } ]} />
<Question choices={[ { text: "WordPiece", explain: "Yes, that's one example of subword tokenization!", correct: true }, { text: "Character-based tokenization", explain: "Character-based tokenization is not a type of subword tokenization." }, { text: "Splitting on whitespace and punctuation", explain: "That's a word-based tokenization scheme!" }, { text: "BPE", explain: "Yes, that's one example of subword tokenization!", correct: true }, { text: "Unigram", explain: "Yes, that's one example of subword tokenization!", correct: true }, { text: "None of the above", explain: "Incorrect!" } ]} />
<Question choices={[ { text: "A component of the base Transformer network that redirects tensors to their correct layers", explain: "Incorrect! There's no such component." }, { text: "Also known as the self-attention mechanism, it adapts the representation of a token according to the other tokens of the sequence", explain: "Incorrect! The self-attention layer does contain attention "heads," but these are not adaptation heads." }, { text: "An additional component, usually made up of one or a few layers, to convert the transformer predictions to a task-specific output", explain: "That's right. Adaptation heads, also known simply as heads, come up in different forms: language modeling heads, question answering heads, sequence classification heads... ", correct: true } ]} />
{#if fw === 'pt'}
<Question
choices={[
{
text: "A model that automatically trains on your data",
explain: "Incorrect. Are you mistaking this with our AutoTrain product?"
},
{
text: "An object that returns the correct architecture based on the checkpoint",
explain: "Exactly: the AutoModel
only needs to know the checkpoint from which to initialize to return the correct architecture.",
correct: true
},
{
text: "A model that automatically detects the language used for its inputs to load the correct weights",
explain: "Incorrect; while some checkpoints and models are capable of handling multiple languages, there are no built-in tools for automatic checkpoint selection according to language. You should head over to the Model Hub to find the best checkpoint for your task!"
}
]}
/>
{:else}
<Question
choices={[
{
text: "A model that automatically trains on your data",
explain: "Incorrect. Are you mistaking this with our AutoTrain product?"
},
{
text: "An object that returns the correct architecture based on the checkpoint",
explain: "Exactly: the TFAutoModel
only needs to know the checkpoint from which to initialize to return the correct architecture.",
correct: true
},
{
text: "A model that automatically detects the language used for its inputs to load the correct weights",
explain: "Incorrect; while some checkpoints and models are capable of handling multiple languages, there are no built-in tools for automatic checkpoint selection according to language. You should head over to the Model Hub to find the best checkpoint for your task!"
}
]}
/>
{/if}
<Question choices={[ { text: "Truncating", explain: "Yes, truncation is a correct way of evening out sequences so that they fit in a rectangular shape. Is it the only one, though?", correct: true }, { text: "Returning tensors", explain: "While the other techniques allow you to return rectangular tensors, returning tensors isn't helpful when batching sequences together." }, { text: "Padding", explain: "Yes, padding is a correct way of evening out sequences so that they fit in a rectangular shape. Is it the only one, though?", correct: true }, { text: "Attention masking", explain: "Absolutely! Attention masks are of prime importance when handling sequences of different lengths. That's not the only technique to be aware of, however.", correct: true } ]} />
7. What is the point of applying a SoftMax function to the logits output by a sequence classification model?
<Question choices={[ { text: "It softens the logits so that they're more reliable.", explain: "No, the SoftMax function does not affect the reliability of results." }, { text: "It applies a lower and upper bound so that they're understandable.", explain: "Correct! The resulting values are bound between 0 and 1. That's not the only reason we use a SoftMax function, though.", correct: true }, { text: "The total sum of the output is then 1, resulting in a possible probabilistic interpretation.", explain: "Correct! That's not the only reason we use a SoftMax function, though.", correct: true } ]} />
<Question
choices={[
{
text: "encode
, as it can encode text into IDs and IDs into predictions",
explain: "Wrong! While the encode
method does exist on tokenizers, it does not exist on models."
},
{
text: "Calling the tokenizer object directly.",
explain: "Exactly! The call
method of the tokenizer is a very powerful method which can handle pretty much anything. It is also the method used to retrieve predictions from a model.",
correct: true
},
{
text: "pad
",
explain: "Wrong! Padding is very useful, but it's just one part of the tokenizer API."
},
{
text: "tokenize
",
explain: "The tokenize
method is arguably one of the most useful methods, but it isn't the core of the tokenizer API."
}
]}
/>
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
result = tokenizer.tokenize("Hello!")
<Question
choices={[
{
text: "A list of strings, each string being a token",
explain: "Absolutely! Convert this to IDs, and send them to a model!",
correct: true
},
{
text: "A list of IDs",
explain: "Incorrect; that's what the call
or convert_tokens_to_ids
method is for!"
},
{
text: "A string containing all of the tokens",
explain: "This would be suboptimal, as the goal is to split the string into multiple tokens."
}
]}
/>
{#if fw === 'pt'}
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModel.from_pretrained("gpt2")
encoded = tokenizer("Hey!", return_tensors="pt")
result = model(**encoded)
<Question choices={[ { text: "No, it seems correct.", explain: "Unfortunately, coupling a model with a tokenizer that was trained with a different checkpoint is rarely a good idea. The model was not trained to make sense out of this tokenizer's output, so the model output (if it can even run!) will not make any sense." }, { text: "The tokenizer and model should always be from the same checkpoint.", explain: "Right!", correct: true }, { text: "It's good practice to pad and truncate with the tokenizer as every input is a batch.", explain: "It's true that every model input needs to be a batch. However, truncating or padding this sequence wouldn't necessarily make sense as there is only one of it, and those are techniques to batch together a list of sentences." } ]} />
{:else}
from transformers import AutoTokenizer, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = TFAutoModel.from_pretrained("gpt2")
encoded = tokenizer("Hey!", return_tensors="pt")
result = model(**encoded)
<Question choices={[ { text: "No, it seems correct.", explain: "Unfortunately, coupling a model with a tokenizer that was trained with a different checkpoint is rarely a good idea. The model was not trained to make sense out of this tokenizer's output, so the model output (if it can even run!) will not make any sense." }, { text: "The tokenizer and model should always be from the same checkpoint.", explain: "Right!", correct: true }, { text: "It's good practice to pad and truncate with the tokenizer as every input is a batch.", explain: "It's true that every model input needs to be a batch. However, truncating or padding this sequence wouldn't necessarily make sense as there is only one of it, and those are techniques to batch together a list of sentences." } ]} />
{/if}