Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add labels padding in tokenization_utils_base.py #8116

Closed
wants to merge 1 commit into from

Conversation

cccntu
Copy link
Contributor

@cccntu cccntu commented Oct 28, 2020

What does this PR do?

This PR makes tokenizer.pad() also pad 'labels'.

I tried to use this:

class DataCollatorWithPadding:

But since labels is not padded, the result cannot turn into a tensor. ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same lengt h.
This patch solves the problem.

It seems logical to me that tokenizer.pad() should also pad 'labels'.

This portion of code is last changed in #4015 @n1t0 @thomwolf @LysandreJik

@sgugger
Copy link
Collaborator

sgugger commented Oct 29, 2020

Hi there! Thanks for your PR! I see a few problems with this approach.

  1. Not all labels need to be padded. If you are doing classification (with one or multiple labels) you don't want to pad them
  2. I imagine you are in a token classification problem, and in those, the number of labels is not necessarily the same as the number of tokens, as the labels are for words and tokens can be parts of words.

I think the proper fix is to create an option in DataCollatorWithPadding to activate label padding (so a flag pad_labels_too or something like that) that then pads the labels to the maximum length of the labels (so difference that you use here might be a different number for the labels).

@cccntu
Copy link
Contributor Author

cccntu commented Oct 29, 2020

Thanks for the reply!

Considering that different problem may pad labels differently, I think maybe it's better to leave it as is and use this:

class MyDataCollatorWithPadding(DataCollatorWithPadding):
    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        batch = super().__call__(features)
        # add custom label padding here
        return batch

Just came up with this. 😃 Not sure if it works.

@cccntu
Copy link
Contributor Author

cccntu commented Nov 4, 2020

Just tried it, the above code does not work, because the error is in self.tokenizer.pad().
Here is the truncated trace:

src/transformers/data/data_collator.py", line 103, in __call__
    batch = self.tokenizer.pad(
src/transformers/tokenization_utils_base.py", line 2408, in pad
    return BatchEncoding(batch_outputs, tensor_type=return_tensors)
src/transformers/tokenization_utils_base.py", line 186, in __init__
    self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
src/transformers/tokenization_utils_base.py", line 571, in convert_to_tensors
    raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

Therefore pad_labels_too needs to be in tokenizer.pad().
@sgugger

the number of labels is not necessarily the same as the number of tokens, as the labels are for words and tokens can be parts of words.

Maybe we will need a LabelPaddingStrategy similar to PaddingStrategy. But I don't know what kinds of other label padding strategies needs to be added.

@sgugger
Copy link
Collaborator

sgugger commented Nov 4, 2020

I think you should use the newly pushed DataCollatorForTokenClassification from #8274.

@cccntu
Copy link
Contributor Author

cccntu commented Nov 4, 2020

Very nice! I guess I will close this PR.

@cccntu cccntu closed this Nov 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants