Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is llavallama moe supported? #9

Open
DietDietDiet opened this issue Feb 1, 2024 · 14 comments
Open

Is llavallama moe supported? #9

DietDietDiet opened this issue Feb 1, 2024 · 14 comments

Comments

@DietDietDiet
Copy link

Hi, have you tested the result for llava_llama version? Would an extra moe stage improve original llava results?

@LinB203
Copy link
Member

LinB203 commented Feb 1, 2024

Great choice. Work in progress!

@DietDietDiet
Copy link
Author

DietDietDiet commented Feb 1, 2024 via email

@LinB203
Copy link
Member

LinB203 commented Feb 1, 2024

I think this should work.

@DietDietDiet
Copy link
Author

Any ways to insert MOE layers only in part of LLM layers? I found that modifying all layers in 13b cannot fit into 40G A100

@LinB203
Copy link
Member

LinB203 commented Feb 2, 2024

Any ways to insert MOE layers only in part of LLM layers? I found that modifying all layers in 13b cannot fit into 40G A100

For example, if you want to inset MoE layers in the first and third layer, you can pass --moe_layers_idx 0 2 in your command.

@LinB203 LinB203 closed this as completed Feb 3, 2024
@DietDietDiet
Copy link
Author

DietDietDiet commented Feb 4, 2024 via email

@LinB203
Copy link
Member

LinB203 commented Feb 4, 2024

I used pretrained llava to initialize moe-llava, and passed in moe_layers_idx params, and encountered the following error. AssertionError: The model has moe layers, but None of the param groups are marked as MoE. Create a param group with 'moe' key set to True before creating optimizer Any additional modifications to solve this?

On Sat, Feb 3, 2024 at 10:32 AM lb203 @.> wrote: Closed #9 <#9> as completed. — Reply to this email directly, view it on GitHub <#9 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDRFRH2F3NAY6RCSAG5WALYRWOU7AVCNFSM6AAAAABCUGO6RWVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJRGY4TCMRWGM4DENI . You are receiving this because you authored the thread.Message ID: @.>

Here is the solution.
#17

@DietDietDiet
Copy link
Author

I found it really weird that even I set minimal num_of experts & moe layers, moe-llama still cannot fit into 40G A100, here is the trainable modules I modified according to llama. --train_modules mlp.gate_proj mlp.up_proj mlp.down_proj wg \
Could u provide a sample script for the final moe stage for llava1.5?

@LinB203
Copy link
Member

LinB203 commented Feb 5, 2024

I found it really weird that even I set minimal num_of experts & moe layers, moe-llama still cannot fit into 40G A100, here is the trainable modules I modified according to llama. --train_modules mlp.gate_proj mlp.up_proj mlp.down_proj wg \ Could u provide a sample script for the final moe stage for llava1.5?

You can enable the flash_attn2, and try it again. Refer to this issue.
#25 (comment)

Btw, how many GPUs you use?

@LinB203 LinB203 reopened this Feb 5, 2024
@DietDietDiet
Copy link
Author

modified
model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation="flash_attention_2", **kwargs)
in builder.py,
still OOM, I 'm using 8*40G A100.

@LinB203
Copy link
Member

LinB203 commented Feb 6, 2024

Could you post your command?

modified model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation="flash_attention_2", **kwargs) in builder.py, still OOM, I 'm using 8*40G A100.

@DietDietDiet
Copy link
Author

`moe_mode="sparse"
num_experts=1
top_k_experts=1
use_residual=False
router_aux_loss_coef=0.01
JSON_FOLDER="ft_json"
IMAGE_FOLDER="train_image_video"

HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed moellava/train/train_mem.py
--moe_enable False --num_experts ${num_experts} --top_k_experts ${top_k_experts} --capacity_factor 1.5
--moe_layers_idx 0 5 10
--moe_mode ${moe_mode} --use_residual ${use_residual} --router_aux_loss_coef ${router_aux_loss_coef}
--train_modules mlp.gate_proj mlp.up_proj mlp.down_proj wg
--deepspeed ./scripts/zero2.json
--model_name_or_path $(pretrained llava weight)
--version v1
--per_device_train_batch_size 1
--per_device_eval_batch_size 16
--gradient_accumulation_steps 16 `

The rest remains consistent with llava

@LinB203
Copy link
Member

LinB203 commented Feb 6, 2024

`moe_mode="sparse" num_experts=1 top_k_experts=1 use_residual=False router_aux_loss_coef=0.01 JSON_FOLDER="ft_json" IMAGE_FOLDER="train_image_video"

HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed moellava/train/train_mem.py --moe_enable False --num_experts ${num_experts} --top_k_experts ${top_k_experts} --capacity_factor 1.5 --moe_layers_idx 0 5 10 --moe_mode ${moe_mode} --use_residual ${use_residual} --router_aux_loss_coef r o u t e r a u x l o s s c o e f t r a i n m o d u l e s m l p . g a t e p r o j m l p . u p p r o j m l p . d o w n p r o j w g d e e p s p e e d . / s c r i p t s / z e r o 2. j s o n m o d e l n a m e o r p a t h (pretrained llava weight) --version v1 --per_device_train_batch_size 1 --per_device_eval_batch_size 16 --gradient_accumulation_steps 16 `

The rest remains consistent with llava

We will check it later. Could you try other model, such as phi or stablelm?

@DietDietDiet
Copy link
Author

DietDietDiet commented Feb 7, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants