Added missing code in exemplary notebook - custom datasets fine-tuning #15300

Pawloch247 · 2022-01-23T16:46:35Z

What does this PR do?

Added missing code in tokenize_and_align_labels function in the exemplary notebook on custom datasets - token classification.
The missing code concerns adding labels for all but the first token in a single word.
The added code was taken directly from huggingface official example - this colab notebook.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

maintained examples (not research project or legacy): @sgugger, @patil-suraj

Added missing code in tokenize_and_align_labels function in the exemplary notebook on custom datasets - token classification. The missing code concerns adding labels for all but first token in a single word. The added code was taken directly from huggingface official example - this [colab notebook](https://github.com/huggingface/notebooks/blob/master/transformers_doc/custom_datasets.ipynb).

HuggingFaceDocBuilder · 2022-01-23T16:47:58Z

The documentation is not available anymore as the PR was closed or merged.

sgugger · 2022-01-24T12:26:07Z

I don't understand what you mean, the notebook has the same code as the example: they are automatically synced at each merge in the Transformers repo.

Pawloch247 · 2022-01-25T15:22:30Z

You're right. I pasted the wrong link so the comparison made no sense. The link should have been to the colab notebook:
https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb

In the colab notebook tokenize_and_align_labels has the else clause which is missing in the notebook in Github and hence is missing here: https://huggingface.co/docs/transformers/custom_datasets#token-classification-with-wnut-emerging-entities

sgugger · 2022-01-25T16:38:30Z

Those are two different tutorials, it's normal they have different code. The one in the main documentation is left as simple as possible on purpose.

Pawloch247 · 2022-01-25T19:53:15Z

In the main documentation, there is a mention about "Only labeling the first token of a given word. Assign -100 to the other subtokens from the same word.". However, it is not the case with the code below, the tokenize_and_align_labels function, where not only are the other subtokens assigned with the true labels but also previous_word_idx is not being updated. Such contradiction was ambiguous for me. Only after having dug deeper (the colab notebook), I understood that this part of code was missing from the official documentation. I do not think that these few lines were omitted on purpose. If you don't think it makes such a difference, close this pr (or let me know if I should be the one to close it).

sgugger

Aaaah, now I understand. Thank you for taking the time to explain the problem. I left a couple of comments to adjust the PR accordingly, so the code matches the text.

sgugger · 2022-01-25T20:00:58Z

docs/source/custom_datasets.mdx

+```python
+label_all_tokens = False
+```


No need for this to keep things simple.

sgugger · 2022-01-25T20:01:23Z

docs/source/custom_datasets.mdx

@@ -326,7 +330,9 @@ def tokenize_and_align_labels(examples):
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
-
+            else:
+                label_ids.append(label[word_idx] if label_all_tokens else -100)


Suggested change

label_ids.append(label[word_idx] if label_all_tokens else -100)

label_ids.append(-100)

Pawloch247 · 2022-01-25T22:18:13Z

I adjusted your comments. I guess I could have been more precise from the beginning. Thanks.

sgugger

Thanks for amending your PR!

avmodi mentioned this pull request Jan 23, 2022

Getting error while saving model #15301

Closed

sgugger reviewed Jan 25, 2022

View reviewed changes

Changes requested in the review - keep the code as simple as possible

2a8217b

sgugger approved these changes Jan 25, 2022

View reviewed changes

sgugger merged commit e79a0fa into huggingface:master Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added missing code in exemplary notebook - custom datasets fine-tuning #15300

Added missing code in exemplary notebook - custom datasets fine-tuning #15300

Pawloch247 commented Jan 23, 2022 •

edited

Loading

HuggingFaceDocBuilder commented Jan 23, 2022 •

edited

Loading

sgugger commented Jan 24, 2022

Pawloch247 commented Jan 25, 2022

sgugger commented Jan 25, 2022

Pawloch247 commented Jan 25, 2022

sgugger left a comment

sgugger Jan 25, 2022

sgugger Jan 25, 2022

Pawloch247 commented Jan 25, 2022

sgugger left a comment

	label_ids.append(label[word_idx] if label_all_tokens else -100)
	label_ids.append(-100)

Added missing code in exemplary notebook - custom datasets fine-tuning #15300

Added missing code in exemplary notebook - custom datasets fine-tuning #15300

Conversation

Pawloch247 commented Jan 23, 2022 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilder commented Jan 23, 2022 • edited Loading

sgugger commented Jan 24, 2022

Pawloch247 commented Jan 25, 2022

sgugger commented Jan 25, 2022

Pawloch247 commented Jan 25, 2022

sgugger left a comment

Choose a reason for hiding this comment

sgugger Jan 25, 2022

Choose a reason for hiding this comment

sgugger Jan 25, 2022

Choose a reason for hiding this comment

Pawloch247 commented Jan 25, 2022

sgugger left a comment

Choose a reason for hiding this comment

Pawloch247 commented Jan 23, 2022 •

edited

Loading

HuggingFaceDocBuilder commented Jan 23, 2022 •

edited

Loading