detail in input construction during pretraining #850

hwijeen · 2021-07-13T07:44:30Z

hwijeen
Jul 13, 2021

Hi,
I have a question on the input construction.

I get the basic idea from the clear figure in the paper:

But I guess the "original text" is not actually a single sentence but rather consecutive tokens, presumably crossing the sentence boundary. One option described in the RoBERTa paper is to pack input sequence with consecutive max_seq_len(say, 512) tokens, which can cross sentence or document boundary.

Could you explain how input is actually made or point to the relevant code? This self-contained library looks great, but it is hard for me to pinpoint where input construction happens.

craffel · 2021-07-13T13:17:32Z

craffel
Jul 13, 2021
Maintainer

We do concatenate across documents. The code is here:
https://github.com/google-research/text-to-text-transfer-transformer/blob/main/t5/data/preprocessors.py#L1864

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

detail in input construction during pretraining #850

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

detail in input construction during pretraining #850

hwijeen Jul 13, 2021

Replies: 1 comment

craffel Jul 13, 2021 Maintainer

hwijeen
Jul 13, 2021

craffel
Jul 13, 2021
Maintainer