T5 Gradient Checkpointing #6564

agemagician · 2020-08-18T09:10:48Z

🚀 Feature request

Currently, only Bert supports gradient checkpointing which allow the model to be fine-tuned on GPUs with small memory.
It will be great to make T5 also support gradient checkpointing.

Code:

transformers/src/transformers/modeling_bert.py

Line 461 in 0735def

if getattr(self.config, "gradient_checkpointing", False):

Motivation

T5 has very big models with 3B and 11B parameters which make it impossible to be fine-tuned on most GPUs. Gradient checkpointing will allow these huge models to be fine-tuned on GPUs. This will lead to much better results on downstream tasks using on house GPUs without the need to fine-tuned it on TPUs.

Your contribution

If I am not mistaken all what need to be change is the following block:
https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_t5.py#L752

for i, (layer_module, past_key_value_state) in enumerate(zip(self.block, past_key_value_states)):
            if output_hidden_states:
                all_hidden_states = all_hidden_states + (hidden_states,)

            if getattr(self.config, "gradient_checkpointing", False):

                def create_custom_forward(module):
                    def custom_forward(*inputs):
                        return module(*inputs, output_attentions)

                    return custom_forward

                 layer_outputs = torch.utils.checkpoint.checkpoint(
                    create_custom_forward(layer_module),
                    hidden_states,
                    extended_attention_mask,
                    position_bias,
                    encoder_hidden_states,
                    encoder_extended_attention_mask,
                    encoder_decoder_position_bias,
                    head_mask[i],
                    past_key_value_state,
                    use_cache,
                    output_attentions,
                )

            else:
                layer_outputs = layer_module(
                    hidden_states,
                    attention_mask=extended_attention_mask,
                    position_bias=position_bias,
                    encoder_hidden_states=encoder_hidden_states,
                    encoder_attention_mask=encoder_extended_attention_mask,
                    encoder_decoder_position_bias=encoder_decoder_position_bias,
                    head_mask=head_mask[i],
                    past_key_value_state=past_key_value_state,
                    use_cache=use_cache,
                    output_attentions=output_attentions,
                )
                # layer_outputs is a tuple with:
                # hidden-states, key-value-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)
            hidden_states, present_key_value_state = layer_outputs[:2]

            if i == 0:
                # We share the position biases between the layers - the first layer store them
                # layer_outputs = hidden-states, key-value-states (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)
                position_bias = layer_outputs[3 if output_attentions else 2]
                if self.is_decoder and encoder_hidden_states is not None:
                    encoder_decoder_position_bias = layer_outputs[5 if output_attentions else 3]
            # append next layer key value states
            if use_cache:
                present_key_value_states = present_key_value_states + (present_key_value_state,)

            if output_attentions:
                all_attentions = all_attentions + (layer_outputs[2],)  # We keep only self-attention weights for now

@patrickvonplaten thanks in advance for looking into it.

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2020-10-22T21:17:09Z

Also pinging @LysandreJik for notification in case this is easy to implement

stale · 2020-12-24T11:26:15Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

agemagician · 2020-12-24T11:30:04Z

Keep it a live :)

patrickvonplaten · 2020-12-24T12:05:39Z

That's an important feature indeed! Will try to tackle this with @LysandreJik @VictorSanh in the new year :-)

ssss1029 · 2021-01-08T08:11:03Z

Hi, I'm not too familiar with T5 internals but I crudely tried modifying modeling_t5.py as OP suggested, but I ran into some issues with unsupported return values for torch.utils.checkpoint.checkpoint, so it seems like there might be something else other than that block that needs changing?

  File "/data/sauravkadavath/miniconda3/envs/transformers-4.0.0/lib/python3.7/site-packages/torch/utils/checkpoint.py", line 163, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
TypeError: CheckpointFunctionBackward.forward: expected Tensor or tuple of Tensor (got NoneType) for return value 1

patrickvonplaten · 2021-01-08T08:15:46Z

Hey @ssss1029,

Thanks for playing around with the feature! Would you mind using your code to open a PR? I'll help you get it merged. It is very well possible that we might have to change some more code in T5 to make it work. Ideally, I'd try to base the T5 gradient checkpointing's code as much as possible on how Bart does it.

patrickvonplaten · 2021-01-08T08:16:26Z

Lots of people have been asking for T5 checkpointing, so your PR would be a great contribution if you want to give it a try :-)

ssss1029 · 2021-01-11T05:00:26Z

Hi Patrick, unfortunately, I'm pretty new to Huggingface internals and I won't have the bandwidth to implement this.

xFinal · 2021-01-14T09:34:33Z

@patrickvonplaten @ssss1029
Just a straightforward workaround, but not for PR.
I modify the torch.utils.checkpoint file to overcome its limitation. See the code below, all the modificaitons are with comments.
Training with t5-base, I obverse the loss is droping down as same as with gradient_checkpointing off and the memory usage drops down as well. But don't have time to do full verification now.

1. checkpoint.CheckpointFunction

class CheckpointFunction(torch.autograd.Function):

    @staticmethod
    def forward(ctx, run_function, preserve_rng_state, *args):
        check_backward_validity(args)
        ctx.run_function = run_function
        ctx.preserve_rng_state = preserve_rng_state
        if preserve_rng_state:
            ctx.fwd_cpu_state = torch.get_rng_state()
            ctx.had_cuda_in_fwd = False
            if torch.cuda._initialized:
                ctx.had_cuda_in_fwd = True
                ctx.fwd_gpu_devices, ctx.fwd_gpu_states = get_device_states(*args)
        ctx.save_for_backward(*args)
        with torch.no_grad():
            outputs = run_function(*args)
        # return outputs

        #
        # Lie to torch we have no None items, to avoid the assert
        #
        result = []
        for o in outputs:
            if o is None:
                o = torch.zeros(0).cuda()
            result.append(o)

        return tuple(result)

    @staticmethod
    def backward(ctx, *args):
        if not torch.autograd._is_checkpoint_valid():
            raise RuntimeError("Checkpointing is not compatible with .grad(), please use .backward() if possible")
        inputs = ctx.saved_tensors
        rng_devices = []
        if ctx.preserve_rng_state and ctx.had_cuda_in_fwd:
            rng_devices = ctx.fwd_gpu_devices
        with torch.random.fork_rng(devices=rng_devices, enabled=ctx.preserve_rng_state):
            if ctx.preserve_rng_state:
                torch.set_rng_state(ctx.fwd_cpu_state)
                if ctx.had_cuda_in_fwd:
                    set_device_states(ctx.fwd_gpu_devices, ctx.fwd_gpu_states)
            detached_inputs = detach_variable(inputs)
            with torch.enable_grad():
                outputs = ctx.run_function(*detached_inputs)

        if isinstance(outputs, torch.Tensor):
            outputs = (outputs,)
        
        #
        # Skip None items and tensors which requires_grad are False when do backward
        #
        backward_outputs = []
        backward_args = []
        for o, a in zip(outputs, args):
            if o is not None and o.requires_grad:
                backward_outputs.append(o)
                backward_args.append(a)
        torch.autograd.backward(backward_outputs, backward_args)

        # torch.autograd.backward(outputs, args)
        grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else inp
                      for inp in detached_inputs)
        return (None, None) + grads

2. checkpoint.checkpoint()

def checkpoint(function, *args, **kwargs):
    preserve = kwargs.pop('preserve_rng_state', True)
    if kwargs:
        raise ValueError("Unexpected keyword arguments: " + ",".join(arg for arg in kwargs))

    outputs = CheckpointFunction.apply(function, preserve, *args)

    #
    # Resotre None items to result
    #
    result = []
    for o in outputs:
        if len(o) == 0:
            o = None
        result.append(o)

    return tuple(result)

3. modeling_t5.T5Stack.forward(), just the common way

        if getattr(self.config, "gradient_checkpointing", False):

                def create_custom_forward(module):
                    def custom_forward(*inputs):
                        return tuple(module(*inputs, use_cache, output_attentions))

                    return custom_forward

                layer_outputs = checkpoint(
                    create_custom_forward(layer_module),
                    hidden_states,
                    extended_attention_mask,
                    position_bias,
                    encoder_hidden_states,
                    encoder_extended_attention_mask,
                    encoder_decoder_position_bias,
                    head_mask[i],
                    past_key_value,
                )

        else:
                layer_outputs = layer_module(
                    hidden_states,
                    attention_mask=extended_attention_mask,
                    position_bias=position_bias,
                    encoder_hidden_states=encoder_hidden_states,
                    encoder_attention_mask=encoder_extended_attention_mask,
                    encoder_decoder_position_bias=encoder_decoder_position_bias,
                    head_mask=head_mask[i],
                    past_key_value=past_key_value,
                    use_cache=use_cache,
                    output_attentions=output_attentions,
                )

patrickvonplaten · 2021-01-14T16:09:05Z

Hey @xFinal ,

Your 3rd approach is definitely the one we'd be super happy to integrate into Transformers. Thanks a mille for you contribution already. If anyone in the community wants to give it a shot to add @xFinal's 3rd proposed solution to modeling_t5.py that would be awesome :-)

xFinal · 2021-01-14T16:58:22Z

Hi @patrickvonplaten ,

Glad to hear it's helpful! But I have two worries about the integration:

It's not tested by a full train yet.
The approch now is to modify the torch.utils.checkpoint file which is a part of Pytorch. Maybe not suitable for integration I think. Maybe there will be more elegant way, like adjust t5 itself?

dwaydwaydway · 2021-01-26T03:49:56Z

Hi @xFinal ,
I tried your solution and got the following error:

`TypeError('CheckpointFunctionBackward.forward: expected Tensor or tuple of Tensor (got tuple) for return value 1')

/share/home/dwaydwaydway/t5/src/transformers/src/transformers/models/t5/modified_gradient_ckpt.py(124)checkpoint()
123
--> 124 outputs = CheckpointFunction.apply(function, preserve, *args)
125 `

May I ask which pytorch version did you use?

xFinal · 2021-01-26T06:41:38Z

@dwaydwaydway,
The verison is 1.7.1
Make sure return tuple type in CheckpointFunction.forward()

github-actions · 2021-03-06T00:17:39Z

This issue has been stale for 1 month.

ceshine · 2021-04-07T05:49:56Z

Inspired by @xFinal's solution, I implemented another workaround that doesn't require modifying the Checkpoint class (by returning a dummy Tensor instead of None in T5Block.forward).

It seems to work, but my tests might not be comprehensive enough.

patrickvonplaten · 2021-04-20T22:19:23Z

Hey @ceshine - do you mind opening a PR for it? :-)

ceshine · 2021-04-21T03:38:15Z

Hey @ceshine - do you mind opening a PR for it? :-)

Not at all. I'll open a PR after a bit more polishing.

xFinal · 2021-04-21T04:57:12Z

@ceshine that's great! :)

patrickvonplaten self-assigned this Aug 18, 2020

patrickvonplaten changed the title ~~T5 Checkpointing~~ T5 Gradient Checkpointing Sep 20, 2020

stale bot added the wontfix label Dec 24, 2020

stale bot removed the wontfix label Dec 24, 2020

patrickvonplaten added the Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! label Jan 14, 2021

ceshine mentioned this issue Apr 21, 2021

T5 Gradient Checkpointing #11353

Merged

5 tasks

patil-suraj closed this as completed in #11353 Apr 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T5 Gradient Checkpointing #6564

T5 Gradient Checkpointing #6564

agemagician commented Aug 18, 2020

patrickvonplaten commented Oct 22, 2020

stale bot commented Dec 24, 2020

agemagician commented Dec 24, 2020

patrickvonplaten commented Dec 24, 2020

ssss1029 commented Jan 8, 2021

patrickvonplaten commented Jan 8, 2021

patrickvonplaten commented Jan 8, 2021

ssss1029 commented Jan 11, 2021 •

edited

Loading

xFinal commented Jan 14, 2021 •

edited

Loading

patrickvonplaten commented Jan 14, 2021

xFinal commented Jan 14, 2021 •

edited

Loading

dwaydwaydway commented Jan 26, 2021

xFinal commented Jan 26, 2021

github-actions bot commented Mar 6, 2021 •

edited by LysandreJik

Loading

ceshine commented Apr 7, 2021

patrickvonplaten commented Apr 20, 2021

ceshine commented Apr 21, 2021

xFinal commented Apr 21, 2021

T5 Gradient Checkpointing #6564

T5 Gradient Checkpointing #6564

Comments

agemagician commented Aug 18, 2020

🚀 Feature request

Motivation

Your contribution

patrickvonplaten commented Oct 22, 2020

stale bot commented Dec 24, 2020

agemagician commented Dec 24, 2020

patrickvonplaten commented Dec 24, 2020

ssss1029 commented Jan 8, 2021

patrickvonplaten commented Jan 8, 2021

patrickvonplaten commented Jan 8, 2021

ssss1029 commented Jan 11, 2021 • edited Loading

xFinal commented Jan 14, 2021 • edited Loading

1. checkpoint.CheckpointFunction

2. checkpoint.checkpoint()

3. modeling_t5.T5Stack.forward(), just the common way

patrickvonplaten commented Jan 14, 2021

xFinal commented Jan 14, 2021 • edited Loading

dwaydwaydway commented Jan 26, 2021

xFinal commented Jan 26, 2021

github-actions bot commented Mar 6, 2021 • edited by LysandreJik Loading

ceshine commented Apr 7, 2021

patrickvonplaten commented Apr 20, 2021

ceshine commented Apr 21, 2021

xFinal commented Apr 21, 2021

ssss1029 commented Jan 11, 2021 •

edited

Loading

xFinal commented Jan 14, 2021 •

edited

Loading

xFinal commented Jan 14, 2021 •

edited

Loading

github-actions bot commented Mar 6, 2021 •

edited by LysandreJik

Loading