[Flax] improve large model init and loading #16148

patil-suraj · 2022-03-14T18:29:15Z

What does this PR do?

As discussed in #15766 this PR adds the _do_init argument to handle large model loading in flax. By default _do_init=True and the API stays same.

__init__:

When _do_init=False is passed to __init__, the params are not initialised and the user should manually call the model.init_weights method to do the initialisation.
When _do_init is False accessing model.params is not allowed and params must be always kept outside of the model
This PR also adds the params_shape_tree shape property to FlaxPreTrainedModel, which is a PyTree with shape and dtype information for each param.

This is how the API looks like:

config = BertConfig()
model = FlaxBertModel(config, _do_init=False)

# accessing model.params will raise an ValueError
model.params 

# to init the params
params = model.init_weights(model.key, model.input_shape)

# setting the model.params will also raise an ValueError
model.params = params

# model.init_weights can be used with `jit` and `pjit` to init the params in CPU or init in sharded way
params = jax.jit(model.init_weights, static_argnums=1, backend="cpu")(model.key, model.input_shape)

from_pretrained:

When _do_init=False is passed to from_pretrained, the model won't be randomly initialised, only the weights will be loaded and returned along with the model instance.
when _do_init=False params will always be loaded on CPU
If _do_init=False and some keys are missing, the user should call init_weights and pass it the loaded params. init_weights will then take care of adding the missing keys.
as described above, getting and setting model.params will raise an error.

model, params = FlaxBertModel.from_pretrained("...", _do_init=False)
# if keys are missing
params = model.initt_weights(model.key, model.input_shape, params)

cc @borisdayma @sanchit-gandhi for info.

Fixes #15766

HuggingFaceDocBuilderDev · 2022-03-14T18:35:17Z

The documentation is not available anymore as the PR was closed or merged.

src/transformers/modeling_flax_utils.py

tests/test_modeling_flax_common.py

borisdayma

Looks great!

src/transformers/modeling_flax_utils.py

borisdayma · 2022-03-16T17:29:43Z

src/transformers/modeling_flax_utils.py

+                "`params` cannot be set from model when the model is created with `_do_init=False`. "
+                "You store the params outside of the model."
+            )
+
        if isinstance(params, FrozenDict):
            params = unfreeze(params)


Wondering why we want to unfreeze the params here?
I personally always do model._params = freeze(model.params)

agemagician · 2022-03-23T17:54:45Z

Hi @patil-suraj,

Thanks a lot for fixing this issue.
Could you please give us some estimation of when this pull request will be ready?

patil-suraj · 2022-03-23T18:21:32Z

Hey @agemagician !
Thanks. I just need to run and fix a few tests and then it should be good to merge by tomorrow.

agemagician · 2022-03-23T20:46:08Z

Hey @agemagician ! Thanks. I just need to run and fix a few tests and then it should be good to merge by tomorrow.

Perfect, thanks a lot @patil-suraj for your effort.
It will be awesome if you could update one of the flax language model examples to understand the changes that is needed after this merge:
https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling

tests/test_modeling_flax_common.py

patrickvonplaten

Nice! Just two small nits to give the test a better name

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

…rmers into flax-do-init

github-actions · 2022-04-17T15:01:44Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

agemagician · 2022-04-17T15:59:27Z

Hi @patil-suraj ,

Any updates regarding this merge ?

patrickvonplaten · 2022-04-18T17:46:34Z

@patil-suraj - think we can merge this no?

patil-suraj · 2022-04-19T10:42:32Z

Just need to fix the template tests and then we can merge it.

* begin do_init * add params_shape_tree * raise error if params are accessed when do_init is False * don't allow do_init=False when keys are missing * make shape tree a property * assign self._params at the end * add test for do_init * add do_init arg to all flax models * fix param setting * disbale do_init for composite models * update test * add do_init in FlaxBigBirdForMultipleChoice * better names and errors * improve test * style * add a warning when do_init=False * remove extra if * set params after _required_params * add test for from_pretrained * do_init => _do_init * chage warning to info * fix typo * add params in init_weights * add params to gpt neo init * add params to init_weights * update do_init test * Trigger CI * Apply suggestions from code review Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * update template * trigger CI * style * style * fix template Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Initial commit * Make some fixes * Make PT model full forward pass * Drop TF & Flax implementation, fix copies etc * Add Flax model and update some corresponding stuff * Drop some TF things * Update config and flax local attn * Add encoder_attention_type to config * . * Update docs * Do some cleansing * Fix some issues -> make style; add some docs * Fix position_bias + mask addition + Update tests * Fix repo consistency * Fix model consistency by removing flax operation over attn_mask * [WIP] Add PT TGlobal LongT5 * . * [WIP] Add flax tglobal model * [WIP] Update flax model to use the right attention type in the encoder * Fix flax tglobal model forward pass * Make the use of global_relative_attention_bias * Add test suites for TGlobal model * Fix minor bugs, clean code * Fix pt-flax equivalence though not convinced with correctness * Fix LocalAttn implementation to match the original impl. + update READMEs * Few updates * Update: [Flax] improve large model init and loading #16148 * Add ckpt conversion script accoring to #16853 + handle torch device placement * Minor updates to conversion script. * Typo: AutoModelForSeq2SeqLM -> FlaxAutoModelForSeq2SeqLM * gpu support + dtype fix * Apply some suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * * Remove (de)parallelize stuff * Edit shape comments * Update README.md * make fix-copies * Remove caching logic for local & tglobal attention * Apply another batch of suggestions from code review * Add missing checkpoints * Format converting scripts * Drop (de)parallelize links from longT5 mdx * Fix converting script + revert config file change * Revert "Remove caching logic for local & tglobal attention" This reverts commit 2a61982. * Stash caching logic in Flax model * Make side relative bias used always * Drop caching logic in PT model * Return side bias as it was * Drop all remaining model parallel logic * Remove clamp statements * Move test files to the proper place * Update docs with new version of hf-doc-builder * Fix test imports * Make some minor improvements * Add missing checkpoints to docs * Make TGlobal model compatible with torch.onnx.export * Replace some np.ndarray with jnp.ndarray * Fix TGlobal for ONNX conversion + update docs * fix _make_global_fixed_block_ids and masked neg value * update flax model * style and quality * fix imports * remove load_tf_weights_in_longt5 from init and fix copies * add slow test for TGlobal model * typo fix * Drop obsolete is_parallelizable and one warning * Update __init__ files to fix repo-consistency * fix pipeline test * Fix some device placements * [wip]: Update tests -- need to generate summaries to update expected_summary * Fix quality * Update LongT5 model card * Update (slow) summarization tests * make style * rename checkpoitns * finish * fix flax tests Co-authored-by: phungvanduy <pvduy23@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: patil-suraj <surajp815@gmail.com>

* Initial commit * Make some fixes * Make PT model full forward pass * Drop TF & Flax implementation, fix copies etc * Add Flax model and update some corresponding stuff * Drop some TF things * Update config and flax local attn * Add encoder_attention_type to config * . * Update docs * Do some cleansing * Fix some issues -> make style; add some docs * Fix position_bias + mask addition + Update tests * Fix repo consistency * Fix model consistency by removing flax operation over attn_mask * [WIP] Add PT TGlobal LongT5 * . * [WIP] Add flax tglobal model * [WIP] Update flax model to use the right attention type in the encoder * Fix flax tglobal model forward pass * Make the use of global_relative_attention_bias * Add test suites for TGlobal model * Fix minor bugs, clean code * Fix pt-flax equivalence though not convinced with correctness * Fix LocalAttn implementation to match the original impl. + update READMEs * Few updates * Update: [Flax] improve large model init and loading huggingface#16148 * Add ckpt conversion script accoring to huggingface#16853 + handle torch device placement * Minor updates to conversion script. * Typo: AutoModelForSeq2SeqLM -> FlaxAutoModelForSeq2SeqLM * gpu support + dtype fix * Apply some suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * * Remove (de)parallelize stuff * Edit shape comments * Update README.md * make fix-copies * Remove caching logic for local & tglobal attention * Apply another batch of suggestions from code review * Add missing checkpoints * Format converting scripts * Drop (de)parallelize links from longT5 mdx * Fix converting script + revert config file change * Revert "Remove caching logic for local & tglobal attention" This reverts commit 2a61982. * Stash caching logic in Flax model * Make side relative bias used always * Drop caching logic in PT model * Return side bias as it was * Drop all remaining model parallel logic * Remove clamp statements * Move test files to the proper place * Update docs with new version of hf-doc-builder * Fix test imports * Make some minor improvements * Add missing checkpoints to docs * Make TGlobal model compatible with torch.onnx.export * Replace some np.ndarray with jnp.ndarray * Fix TGlobal for ONNX conversion + update docs * fix _make_global_fixed_block_ids and masked neg value * update flax model * style and quality * fix imports * remove load_tf_weights_in_longt5 from init and fix copies * add slow test for TGlobal model * typo fix * Drop obsolete is_parallelizable and one warning * Update __init__ files to fix repo-consistency * fix pipeline test * Fix some device placements * [wip]: Update tests -- need to generate summaries to update expected_summary * Fix quality * Update LongT5 model card * Update (slow) summarization tests * make style * rename checkpoitns * finish * fix flax tests Co-authored-by: phungvanduy <pvduy23@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: patil-suraj <surajp815@gmail.com>

patil-suraj changed the title ~~Flax-do-init~~ [WiP] [Flax] improve large model init and loading Mar 14, 2022

patil-suraj requested a review from patrickvonplaten March 15, 2022 16:53