-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added missing code in exemplary notebook - custom datasets fine-tuning #15300
Added missing code in exemplary notebook - custom datasets fine-tuning #15300
Conversation
Added missing code in tokenize_and_align_labels function in the exemplary notebook on custom datasets - token classification. The missing code concerns adding labels for all but first token in a single word. The added code was taken directly from huggingface official example - this [colab notebook](https://github.com/huggingface/notebooks/blob/master/transformers_doc/custom_datasets.ipynb).
The documentation is not available anymore as the PR was closed or merged. |
I don't understand what you mean, the notebook has the same code as the example: they are automatically synced at each merge in the Transformers repo. |
You're right. I pasted the wrong link so the comparison made no sense. The link should have been to the colab notebook: In the colab notebook |
Those are two different tutorials, it's normal they have different code. The one in the main documentation is left as simple as possible on purpose. |
In the main documentation, there is a mention about "Only labeling the first token of a given word. Assign -100 to the other subtokens from the same word.". However, it is not the case with the code below, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaaah, now I understand. Thank you for taking the time to explain the problem. I left a couple of comments to adjust the PR accordingly, so the code matches the text.
docs/source/custom_datasets.mdx
Outdated
```python | ||
label_all_tokens = False | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this to keep things simple.
docs/source/custom_datasets.mdx
Outdated
@@ -326,7 +330,9 @@ def tokenize_and_align_labels(examples): | |||
label_ids.append(-100) | |||
elif word_idx != previous_word_idx: # Only label the first token of a given word. | |||
label_ids.append(label[word_idx]) | |||
|
|||
else: | |||
label_ids.append(label[word_idx] if label_all_tokens else -100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
label_ids.append(label[word_idx] if label_all_tokens else -100) | |
label_ids.append(-100) |
I adjusted your comments. I guess I could have been more precise from the beginning. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for amending your PR!
What does this PR do?
Added missing code in tokenize_and_align_labels function in the exemplary notebook on custom datasets - token classification.
The missing code concerns adding labels for all but the first token in a single word.
The added code was taken directly from huggingface official example - this colab notebook.
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?