Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added missing code in exemplary notebook - custom datasets fine-tuning #15300

Merged
merged 2 commits into from
Jan 25, 2022
Merged

Added missing code in exemplary notebook - custom datasets fine-tuning #15300

merged 2 commits into from
Jan 25, 2022

Conversation

Pawloch247
Copy link
Contributor

@Pawloch247 Pawloch247 commented Jan 23, 2022

What does this PR do?

Added missing code in tokenize_and_align_labels function in the exemplary notebook on custom datasets - token classification.
The missing code concerns adding labels for all but the first token in a single word.
The added code was taken directly from huggingface official example - this colab notebook.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Added missing code in tokenize_and_align_labels function in the exemplary notebook on custom datasets - token classification.
The missing code concerns adding labels for all but first token in a single word.
The added code was taken directly from huggingface official example - this [colab notebook](https://github.com/huggingface/notebooks/blob/master/transformers_doc/custom_datasets.ipynb).
@HuggingFaceDocBuilder
Copy link

HuggingFaceDocBuilder commented Jan 23, 2022

The documentation is not available anymore as the PR was closed or merged.

@sgugger
Copy link
Collaborator

sgugger commented Jan 24, 2022

I don't understand what you mean, the notebook has the same code as the example: they are automatically synced at each merge in the Transformers repo.

@Pawloch247
Copy link
Contributor Author

You're right. I pasted the wrong link so the comparison made no sense. The link should have been to the colab notebook:
https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb

In the colab notebook tokenize_and_align_labels has the else clause which is missing in the notebook in Github and hence is missing here: https://huggingface.co/docs/transformers/custom_datasets#token-classification-with-wnut-emerging-entities

@sgugger
Copy link
Collaborator

sgugger commented Jan 25, 2022

Those are two different tutorials, it's normal they have different code. The one in the main documentation is left as simple as possible on purpose.

@Pawloch247
Copy link
Contributor Author

In the main documentation, there is a mention about "Only labeling the first token of a given word. Assign -100 to the other subtokens from the same word.". However, it is not the case with the code below, the tokenize_and_align_labels function, where not only are the other subtokens assigned with the true labels but also previous_word_idx is not being updated. Such contradiction was ambiguous for me. Only after having dug deeper (the colab notebook), I understood that this part of code was missing from the official documentation. I do not think that these few lines were omitted on purpose. If you don't think it makes such a difference, close this pr (or let me know if I should be the one to close it).

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaaah, now I understand. Thank you for taking the time to explain the problem. I left a couple of comments to adjust the PR accordingly, so the code matches the text.

Comment on lines 313 to 315
```python
label_all_tokens = False
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this to keep things simple.

@@ -326,7 +330,9 @@ def tokenize_and_align_labels(examples):
label_ids.append(-100)
elif word_idx != previous_word_idx: # Only label the first token of a given word.
label_ids.append(label[word_idx])

else:
label_ids.append(label[word_idx] if label_all_tokens else -100)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
label_ids.append(label[word_idx] if label_all_tokens else -100)
label_ids.append(-100)

@Pawloch247
Copy link
Contributor Author

I adjusted your comments. I guess I could have been more precise from the beginning. Thanks.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for amending your PR!

@sgugger sgugger merged commit e79a0fa into huggingface:master Jan 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants