Add auto variable sharding for all backbones/tasks #1689

mattdangerw · 2024-07-10T00:11:42Z

We want model parallelism to be easy to use across the library. At a high level, a user should express their hardware, and (possibly) desired model parallel vs data parallel split for the device grid.

Currently, we have a auto layer helper for Gemma here, but it is not a salable design. The correct layout map will depend on the config of the model. E.g. you need to shard a Gemma model with multi-head-attention differently then multi-query-attention.

I think there's two main direction we can go with the API.

Write our own manual sharing for a model given the config for a model. Do this for all models (most will have the same recipe, especially for our transformer models).
Use some form of autosharding functionality in Jax, or add a autosharding API to Keras. In this case, we will not need to write the sharding recipes ourselves per model.

One potential high-level API would be to directly take in a device mesh when constructing the model. For both 1) and 2), we could support an API something like this...

device_mesh = DeviceMesh(shape=(2, 4), axis_names=('batch', 'model'), devices=devices)
gemma_model = keras_nlp.models.GemmaCausalLM.from_preset(
    "gemma_2b_en",
    device_mesh=device_mesh,
)

For 1) we would need to enter into a LayoutMap scope after loading the config for a model but before loading the weights. For 2) it would depend on the details of the autosharding API we use.

The text was updated successfully, but these errors were encountered:

mattdangerw · 2024-07-10T00:13:48Z

We should also keep the docstring for the method on the Backbone base class. And factor out all the error checking somehow. That way the per model code here could be really minimal.

mattdangerw added the type:feature New feature or request label Jul 10, 2024

github-actions bot assigned sachinprasadhs Jul 10, 2024

github-actions bot added the Gemma Gemma model specific issues label Jul 10, 2024

mattdangerw assigned mattdangerw and SamanehSaadat and unassigned sachinprasadhs Jul 10, 2024

mattdangerw changed the title ~~Add get_layout_map() for all backbones~~ Add auto variable sharding for all backbones/tasks Sep 19, 2024

This was referenced Sep 19, 2024

🗺️ KerasHub Roadmap 🗺️ #1836

Open

Guide for multi-host distributed training with KerasHub #1850

Open

SamanehSaadat mentioned this issue Nov 4, 2024

[Help][BUG] KeyError: 'lm_head.weight' on loading llama 3.2 #1920

Closed

sachinprasadhs added the team-created Issues created by Keras Hub team as part of development roadmap. label Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add auto variable sharding for all backbones/tasks #1689

Add auto variable sharding for all backbones/tasks #1689

mattdangerw commented Jul 10, 2024 •

edited

Loading

mattdangerw commented Jul 10, 2024

Add auto variable sharding for all backbones/tasks #1689

Add auto variable sharding for all backbones/tasks #1689

Comments

mattdangerw commented Jul 10, 2024 • edited Loading

mattdangerw commented Jul 10, 2024

mattdangerw commented Jul 10, 2024 •

edited

Loading