Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[seq2seq] memory regression #9261

Closed
stas00 opened this issue Dec 22, 2020 · 7 comments · Fixed by #9713
Closed

[seq2seq] memory regression #9261

stas00 opened this issue Dec 22, 2020 · 7 comments · Fixed by #9713

Comments

@stas00
Copy link
Contributor

stas00 commented Dec 22, 2020

#9241 introduced a memory regression - found out via git bisect.

I was able to do: BS=12 before this PR got merged and now only BS=8 with:

 export BS=12; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=../../src USE_TF=0   python -m torch.distributed.launch --nproc_per_node=2 --master_port=9910  ./finetune_trainer.py --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --sharded_ddp --fp16

We really need to go back to that issue of memory benchmarks in CI and figure out how to make it happen.

The problem is that I started working on it some months back but gave up since each gpu gave different numbers...

For details please see: #6045

edit: should also make sure that --label_smoothing 0.1 --fp16 --fp16_backend apex works #9261 (comment)

@patrickvonplaten, should we figure this out in the new year?

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Dec 22, 2020

Yes, we really should take a stab at better speed and memory regression testing. Big new years resolution!

@stas00
Copy link
Contributor Author

stas00 commented Dec 22, 2020

This specific commit introduced the regression:
fe7960b

@stas00
Copy link
Contributor Author

stas00 commented Dec 22, 2020

There is a second problem:

Same as above but with apex:

--label_smoothing 0.1 --fp16 --fp16_backend apex

hangs 5% into training - spinning CPU (not OOMing) - had to kill.

checked pre this PR - no hanging.

Full command:

export BS=12; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=../../src USE_TF=0   python -m torch.distributed.launch --nproc_per_node=2 --master_port=9910  ./finetune_trainer.py --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --sharded_ddp --label_smoothing 0.1 --fp16 --fp16_backend apex

(It OOMs some time later into training) but no hanging.

@stas00
Copy link
Contributor Author

stas00 commented Dec 22, 2020

So both problem seem to be related to label-smoothing, @sgugger has been testing hypotheses and this one worked:

# trainer.py (top)
def label_smoothed_nll_loss(lprobs, target, epsilon, ignore_index=-100):
    """From fairseq"""
    if target.dim() == lprobs.dim() - 1:
        target = target.unsqueeze(-1)
    nll_loss = -lprobs.gather(dim=-1, index=target)
    smooth_loss = -lprobs.sum(dim=-1, keepdim=True)
    if ignore_index is not None:
        pad_mask = target.eq(ignore_index)
        nll_loss.masked_fill_(pad_mask, 0.0)
        smooth_loss.masked_fill_(pad_mask, 0.0)
    else:
        nll_loss = nll_loss.squeeze(-1)
        smooth_loss = smooth_loss.squeeze(-1)
    nll_loss = nll_loss.sum()  # mean()? Scared to break other math.
    smooth_loss = smooth_loss.sum()
    eps_i = epsilon / lprobs.size(-1)
    loss = (1.0 - epsilon) * nll_loss + eps_i * smooth_loss
    return loss, nll_loss
    # trainer.py (in Trainer class)
    def compute_loss(self, model, inputs):
        labels = inputs.pop("labels")
        logits = model(**inputs)[0]
        return label_smoothed_nll_loss(logits.view(-1, logits.shape[-1]), labels.view(-1), self.args.label_smoothing_factor)[0]

edit @sgugger says that this code wasn't right, so we currently don't have a solution yet. will keep on experimenting.

@ghost
Copy link

ghost commented Dec 26, 2020

Hi.
related to this bug, is my bug report here #9311
Is there an alternative allowing me to move forward resolving memory issue for now? thanks

@stas00
Copy link
Contributor Author

stas00 commented Dec 26, 2020

Well, I don't think it's related other than both using up more RAM ;) This regression happened in a very recent change, but you're using a much older transformers version.

I will follow up in your Issue you linked to.

@stas00
Copy link
Contributor Author

stas00 commented Jan 6, 2021

So --fp16 seems to be related, if I remove it the regression goes away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants