[RFC] Scan & Gradient checkpointing in Flax #17399

patrickvonplaten · 2022-05-24T15:53:36Z

Feature request

We should add scan and remat (gradient checkpointing) to the most important Flax/JAX models (BERT, GPT2, OPT, T5, BART, Wav2Vec2).

Motivation

Scan allows for much faster compilation and memory savings and remat is the equivalent of gradient_checkpointing in PyTorch.

@sanchit-gandhi already uses both features in the Flax Seq2Seq Speech project - see: https://github.com/sanchit-gandhi/seq2seq-speech so it'd be quite trivial to get them working.

Implementation details:

Given that both scan and remat are not related to the model architecture, they should IMO not be in the model's config (We've done this mistake in PyTorch and don't want to repeat it here).

I would advocate for the following API:

model = FlaxBertForMaskedLM.from_pretrained("bert-base-cased")
model.scan()  # or model.scan_enable()
model.unscan()  # or model.scan_disable()

and

model = FlaxBertForMaskedLM.from_pretrained("bert-base-cased")
model.gradient_checkpoint_enable()
model.gradient_checkpoint_disable()

As can be seen here: https://github.com/sanchit-gandhi/seq2seq-speech/blob/b28d0c25c8fad0f9ffa6707f91f7aba320d44a4b/models/modeling_flax_wav2vec2.py#L504

We'll need to re-initialize the flax.linen.module inside the model. However this should be fine since it just means that we do

self.module = self.module_class(config=config, dtype=dtype, use_scan=True, **kwargs)
 self. _is_scan_enabled = True

similar to this line:

transformers/src/transformers/models/wav2vec2/modeling_flax_wav2vec2.py

Line 868 in 71e6027

module = self.module_class(config=config, dtype=dtype, **kwargs)

We can see along the PR how much logic can reside in modeling_flax_utils.py and how much would go into the specific models, e.g. modeling_flax_wav2vec2.py.

The same API / logic could be used for the gradient_checkpointing.

Your contribution

Happy to give this implementation a shot with @sanchit-gandhi and @patil-suraj .

Also would love to hear feedback from @borisdayma @marcvanzee about the API

The text was updated successfully, but these errors were encountered:

borisdayma · 2022-05-24T19:40:31Z

I'm not sure you would need both versions within a same script (scan and unscanned, or with and without checkpointing which affects only training anyway).

Then maybe you could just add it directly as an arg to model.from_pretrained(..., scan=False, gradient_checkpointing=False)

You would just have to use some naming conventions on your params to see if you need to scan/unscan when loading a checkpoint.

sanchit-gandhi · 2022-05-25T08:01:31Z

Suppose you have a training script, it would be useful to be able to use scan and remat during training for faster compile times and larger batch sizes, and then switch to unscan and no remat during eval for faster inference?

borisdayma · 2022-05-25T13:44:54Z

I'm not sure it would be worth it:

Most of the time evaluation is relatively fast
You would have to reformat your parameters each time between eval and train, potentially leading to memory fragmentation

KMFODA · 2022-06-28T12:20:00Z

Hey @patrickvonplaten, I'm keen to get gradient checkpointing working in JAX for long-t5. If this is not on the cards to be added soon happy to work on a PR for it if that works with you all?

sanchit-gandhi · 2022-06-28T13:05:58Z

Hey @KMFODA! There's a PR that is close to being merged: #17843 I'll let you know once it's complete, and you can copy the logic across to Flax T5 in a new PR if that sounds good to you!

patrickvonplaten changed the title ~~RFC~~ [RFC] Scan & Gradient checkpointing in Flax May 24, 2022

This was referenced Jun 20, 2022

[WIP] Flax BLOOM implementation + demo #17761

Closed

[Flax] Add remat (gradient checkpointing) #17843

Merged

huggingface deleted a comment from github-actions bot Jun 27, 2022

patrickvonplaten added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Jun 27, 2022

KMFODA mentioned this issue Jul 2, 2022

Flax Remat for LongT5 #17994

Merged

5 tasks

sanchit-gandhi linked a pull request Jul 28, 2022 that will close this issue

[Flax] Add scan_with_axes #18341

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Scan & Gradient checkpointing in Flax #17399

[RFC] Scan & Gradient checkpointing in Flax #17399

patrickvonplaten commented May 24, 2022

borisdayma commented May 24, 2022

sanchit-gandhi commented May 25, 2022

borisdayma commented May 25, 2022

KMFODA commented Jun 28, 2022

sanchit-gandhi commented Jun 28, 2022

[RFC] Scan & Gradient checkpointing in Flax #17399

[RFC] Scan & Gradient checkpointing in Flax #17399

Comments

patrickvonplaten commented May 24, 2022

Feature request

Motivation

Your contribution

borisdayma commented May 24, 2022

sanchit-gandhi commented May 25, 2022

borisdayma commented May 25, 2022

KMFODA commented Jun 28, 2022

sanchit-gandhi commented Jun 28, 2022