Add labels padding in tokenization_utils_base.py #8116

cccntu · 2020-10-28T10:58:11Z

What does this PR do?

This PR makes tokenizer.pad() also pad 'labels'.

I tried to use this:

transformers/src/transformers/data/data_collator.py

Line 69 in 8065fea

class DataCollatorWithPadding:

But since labels is not padded, the result cannot turn into a tensor. ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same lengt h.
This patch solves the problem.

It seems logical to me that tokenizer.pad() should also pad 'labels'.

This portion of code is last changed in #4015 @n1t0 @thomwolf @LysandreJik

sgugger · 2020-10-29T13:27:01Z

Hi there! Thanks for your PR! I see a few problems with this approach.

Not all labels need to be padded. If you are doing classification (with one or multiple labels) you don't want to pad them
I imagine you are in a token classification problem, and in those, the number of labels is not necessarily the same as the number of tokens, as the labels are for words and tokens can be parts of words.

I think the proper fix is to create an option in DataCollatorWithPadding to activate label padding (so a flag pad_labels_too or something like that) that then pads the labels to the maximum length of the labels (so difference that you use here might be a different number for the labels).

cccntu · 2020-10-29T14:00:58Z

Thanks for the reply!

Considering that different problem may pad labels differently, I think maybe it's better to leave it as is and use this:

class MyDataCollatorWithPadding(DataCollatorWithPadding):
    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        batch = super().__call__(features)
        # add custom label padding here
        return batch

Just came up with this. 😃 Not sure if it works.

cccntu · 2020-11-04T13:53:43Z

Just tried it, the above code does not work, because the error is in self.tokenizer.pad().
Here is the truncated trace:

src/transformers/data/data_collator.py", line 103, in __call__
    batch = self.tokenizer.pad(
src/transformers/tokenization_utils_base.py", line 2408, in pad
    return BatchEncoding(batch_outputs, tensor_type=return_tensors)
src/transformers/tokenization_utils_base.py", line 186, in __init__
    self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
src/transformers/tokenization_utils_base.py", line 571, in convert_to_tensors
    raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

Therefore pad_labels_too needs to be in tokenizer.pad().
@sgugger

the number of labels is not necessarily the same as the number of tokens, as the labels are for words and tokens can be parts of words.

Maybe we will need a LabelPaddingStrategy similar to PaddingStrategy. But I don't know what kinds of other label padding strategies needs to be added.

sgugger · 2020-11-04T13:58:12Z

I think you should use the newly pushed DataCollatorForTokenClassification from #8274.

cccntu · 2020-11-04T14:19:18Z

Very nice! I guess I will close this PR.

Add labels padding in tokenization_utils_base.py

72a60c5

cccntu mentioned this pull request Oct 29, 2020

Make tokenizer.pad() also pad labels #8146

Closed

cccntu closed this Nov 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add labels padding in tokenization_utils_base.py #8116

Add labels padding in tokenization_utils_base.py #8116

cccntu commented Oct 28, 2020

sgugger commented Oct 29, 2020

cccntu commented Oct 29, 2020

cccntu commented Nov 4, 2020

sgugger commented Nov 4, 2020

cccntu commented Nov 4, 2020

Add labels padding in tokenization_utils_base.py #8116

Add labels padding in tokenization_utils_base.py #8116

Conversation

cccntu commented Oct 28, 2020

What does this PR do?

sgugger commented Oct 29, 2020

cccntu commented Oct 29, 2020

cccntu commented Nov 4, 2020

sgugger commented Nov 4, 2020

cccntu commented Nov 4, 2020