Hi,
** DataCollatorForWholeWordMask** does not output attention_mask. According to the __call__ method:
return {"input_ids": inputs, "labels": labels}.
Is there a peculiar motivation behind it or is a small bug? From where I see, when we do the pre-training, most instances will not be of the same length and applying Attention for all the tokens (including the padding) may cause imprecise results.