[seq2seq] memory regression #9261

stas00 · 2020-12-22T18:33:49Z

#9241 introduced a memory regression - found out via git bisect.

I was able to do: BS=12 before this PR got merged and now only BS=8 with:

 export BS=12; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=../../src USE_TF=0   python -m torch.distributed.launch --nproc_per_node=2 --master_port=9910  ./finetune_trainer.py --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --sharded_ddp --fp16

We really need to go back to that issue of memory benchmarks in CI and figure out how to make it happen.

The problem is that I started working on it some months back but gave up since each gpu gave different numbers...

For details please see: #6045

edit: should also make sure that --label_smoothing 0.1 --fp16 --fp16_backend apex works #9261 (comment)

@patrickvonplaten, should we figure this out in the new year?

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2020-12-22T19:08:05Z

Yes, we really should take a stab at better speed and memory regression testing. Big new years resolution!

stas00 · 2020-12-22T19:09:25Z

This specific commit introduced the regression:
fe7960b

stas00 · 2020-12-22T22:08:57Z

There is a second problem:

Same as above but with apex:

--label_smoothing 0.1 --fp16 --fp16_backend apex

hangs 5% into training - spinning CPU (not OOMing) - had to kill.

checked pre this PR - no hanging.

Full command:

export BS=12; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=../../src USE_TF=0   python -m torch.distributed.launch --nproc_per_node=2 --master_port=9910  ./finetune_trainer.py --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --sharded_ddp --label_smoothing 0.1 --fp16 --fp16_backend apex

(It OOMs some time later into training) but no hanging.

stas00 · 2020-12-22T22:12:05Z

So both problem seem to be related to label-smoothing, @sgugger has been testing hypotheses and this one worked:

# trainer.py (top)
def label_smoothed_nll_loss(lprobs, target, epsilon, ignore_index=-100):
    """From fairseq"""
    if target.dim() == lprobs.dim() - 1:
        target = target.unsqueeze(-1)
    nll_loss = -lprobs.gather(dim=-1, index=target)
    smooth_loss = -lprobs.sum(dim=-1, keepdim=True)
    if ignore_index is not None:
        pad_mask = target.eq(ignore_index)
        nll_loss.masked_fill_(pad_mask, 0.0)
        smooth_loss.masked_fill_(pad_mask, 0.0)
    else:
        nll_loss = nll_loss.squeeze(-1)
        smooth_loss = smooth_loss.squeeze(-1)
    nll_loss = nll_loss.sum()  # mean()? Scared to break other math.
    smooth_loss = smooth_loss.sum()
    eps_i = epsilon / lprobs.size(-1)
    loss = (1.0 - epsilon) * nll_loss + eps_i * smooth_loss
    return loss, nll_loss

    # trainer.py (in Trainer class)
    def compute_loss(self, model, inputs):
        labels = inputs.pop("labels")
        logits = model(**inputs)[0]
        return label_smoothed_nll_loss(logits.view(-1, logits.shape[-1]), labels.view(-1), self.args.label_smoothing_factor)[0]

edit @sgugger says that this code wasn't right, so we currently don't have a solution yet. will keep on experimenting.

ghost · 2020-12-26T17:27:08Z

Hi.
related to this bug, is my bug report here #9311
Is there an alternative allowing me to move forward resolving memory issue for now? thanks

stas00 · 2020-12-26T17:39:03Z

Well, I don't think it's related other than both using up more RAM ;) This regression happened in a very recent change, but you're using a much older transformers version.

I will follow up in your Issue you linked to.

stas00 · 2021-01-06T02:16:23Z

So --fp16 seems to be related, if I remove it the regression goes away.

stas00 mentioned this issue Dec 22, 2020

Test BART's memory consumption #6045

Open

sgugger mentioned this issue Jan 21, 2021

Fix memory regression in Seq2Seq example #9713

Merged

sgugger closed this as completed in #9713 Jan 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[seq2seq] memory regression #9261

[seq2seq] memory regression #9261

stas00 commented Dec 22, 2020 •

edited

Loading

patrickvonplaten commented Dec 22, 2020 •

edited

Loading

stas00 commented Dec 22, 2020

stas00 commented Dec 22, 2020 •

edited

Loading

stas00 commented Dec 22, 2020 •

edited

Loading

ghost commented Dec 26, 2020

stas00 commented Dec 26, 2020 •

edited

Loading

stas00 commented Jan 6, 2021

[seq2seq] memory regression #9261

[seq2seq] memory regression #9261

Comments

stas00 commented Dec 22, 2020 • edited Loading

patrickvonplaten commented Dec 22, 2020 • edited Loading

stas00 commented Dec 22, 2020

stas00 commented Dec 22, 2020 • edited Loading

stas00 commented Dec 22, 2020 • edited Loading

ghost commented Dec 26, 2020

stas00 commented Dec 26, 2020 • edited Loading

stas00 commented Jan 6, 2021

stas00 commented Dec 22, 2020 •

edited

Loading

patrickvonplaten commented Dec 22, 2020 •

edited

Loading

stas00 commented Dec 22, 2020 •

edited

Loading

stas00 commented Dec 22, 2020 •

edited

Loading

stas00 commented Dec 26, 2020 •

edited

Loading