Batchwise padding dataset #121

mrghofrani · 2022-07-11T18:38:10Z

Hello
I'm pretty new to Pytorch so sorry if this question was so simple. Because of memory limits, I can't pad my dataset as a whole. So I was wondering what is the simplest way to move the pad_dataset function into the training process, I mean how can I pad the dataset in a batch? For ease of reference, I added the pad_dataset below.
Thanks.

def pad_dataset(dataset, padding=0):
    """ Pad the dataset. This could be optimized by defining a Dataset class and padding at the batch level, but this is simpler. """
    max_l = max(len(x) for x in dataset["input_ids"])
    for name in PADDED_INPUTS:
        dataset[name] = [x + [padding if name != "lm_labels" else -100] * (max_l - len(x)) for x in dataset[name]]
    return dataset

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batchwise padding dataset #121

Batchwise padding dataset #121

mrghofrani commented Jul 11, 2022

Batchwise padding dataset #121

Batchwise padding dataset #121

Comments

mrghofrani commented Jul 11, 2022