Add default IA3 target modules for Mixtral #1376

arnavgarg1 · 2024-01-19T19:55:38Z

Here's the mixtral model architecture with my proposed IA3 target module mapping:

PeftModelForCausalLM(                                                                                                                                       
  (base_model): IA3Model(                                                                                                                                   
    (model): MixtralForCausalLM(                                                                                                                            
      (model): MixtralModel(                                                                                                                                
        (embed_tokens): Embedding(32000, 4096)                                                                                                              
        (layers): ModuleList(                                                                                                                               
          (0-13): 14 x MixtralDecoderLayer(                                                                                                                 
            (self_attn): MixtralAttention(                                                                                                                  
              (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)                                                                         
              (k_proj): ia3.Linear4bit(                                                                                                                     
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)                                                                   
                (ia3_l): ParameterDict(  (default): Parameter containing: [torch.cuda.HalfTensor of size 1024x1 (cuda:0)])                                  
              )                                                                                                                                             
              (v_proj): ia3.Linear4bit(                                                                                                                     
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)                                                                   
                (ia3_l): ParameterDict(  (default): Parameter containing: [torch.cuda.HalfTensor of size 1024x1 (cuda:0)])                                  
              )                                                                                                                                             
              (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)                                                                         
              (rotary_emb): MixtralRotaryEmbedding()                                                                                                        
            )                                                                                                                                               
            (block_sparse_moe): MixtralSparseMoeBlock(                                                                                                      
              (gate): Linear4bit(in_features=4096, out_features=8, bias=False)                                                                              
              (experts): ModuleList(                                                                                                                        
                (0-7): 8 x MixtralBLockSparseTop2MLP(                                                                                                       
                  (w1): ia3.Linear4bit(                                                                                                                     
                    (base_layer): Linear4bit(in_features=4096, out_features=14336, bias=False)                                                              
                    (ia3_l): ParameterDict(  (default): Parameter containing: [torch.cuda.HalfTensor of size 1x4096 (cuda:0)])                              
                  )                                                                                                                                         
                  (w2): ia3.Linear4bit(                                                                                                                     
                    (base_layer): Linear4bit(in_features=14336, out_features=4096, bias=False)                                                              
                    (ia3_l): ParameterDict(  (default): Parameter containing: [torch.cuda.HalfTensor of size 1x14336 (cuda:0)])                             
                  )                                                                                                                                         
                  (w3): ia3.Linear4bit(                                                                                                                     
                    (base_layer): Linear4bit(in_features=4096, out_features=14336, bias=False)                                                              
                    (ia3_l): ParameterDict(  (default): Parameter containing: [torch.cuda.HalfTensor of size 1x4096 (cuda:0)])                              
                  )                                                                                                                                         
                  (act_fn): SiLU()                                                                                                                          
                )                                                                                                                                           
              )                                                                                                                                             
            )                                                                                                                                               
            (input_layernorm): MixtralRMSNorm()                                                                                                             
            (post_attention_layernorm): MixtralRMSNorm()
          )
          (14-31): 18 x MixtralDecoderLayer(
            (self_attn): MixtralAttention(
              (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (k_proj): ia3.Linear4bit( 
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
                (ia3_l): ParameterDict(  (default): Parameter containing: [torch.cuda.HalfTensor of size 1024x1 (cuda:1)])
              )
              (v_proj): ia3.Linear4bit( 
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
                (ia3_l): ParameterDict(  (default): Parameter containing: [torch.cuda.HalfTensor of size 1024x1 (cuda:1)])
              )
              (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (rotary_emb): MixtralRotaryEmbedding()
            )
            (block_sparse_moe): MixtralSparseMoeBlock(
              (gate): Linear4bit(in_features=4096, out_features=8, bias=False) 
              (experts): ModuleList(
                (0-7): 8 x MixtralBLockSparseTop2MLP(
                  (w1): ia3.Linear4bit( 
                    (base_layer): Linear4bit(in_features=4096, out_features=14336, bias=False)
                    (ia3_l): ParameterDict(  (default): Parameter containing: [torch.cuda.HalfTensor of size 1x4096 (cuda:1)])
                  )
                  (w2): ia3.Linear4bit( 
                    (base_layer): Linear4bit(in_features=14336, out_features=4096, bias=False)
                    (ia3_l): ParameterDict(  (default): Parameter containing: [torch.cuda.HalfTensor of size 1x14336 (cuda:1)])
                  )
                  (w3): ia3.Linear4bit( 
                    (base_layer): Linear4bit(in_features=4096, out_features=14336, bias=False)
                    (ia3_l): ParameterDict(  (default): Parameter containing: [torch.cuda.HalfTensor of size 1x4096 (cuda:1)])
                  )
                  (act_fn): SiLU()
                )
              )
            )
            (input_layernorm): MixtralRMSNorm()
            (post_attention_layernorm): MixtralRMSNorm()
          )
        )
        (norm): MixtralRMSNorm()
      )
      (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
    )
  )
)

These are the number of trainable parameters:

trainable params: 11,665,408 || all params: 46,714,458,112 || trainable%: 0.024971729249286513

younesbelkada

Hi @arnavgarg1 Thanks for the contribution!
I did not noticed your PR until now .. I made #1380 few days ago that adds mixtral in the LoRA mapping. Would you be happy to convert this PR to a PR that adds mixtral to IA3 mapping instead?

arnavgarg1 · 2024-01-25T21:20:31Z

@younesbelkada Yes!

pacman100

Hello @arnavgarg1, as Younes mentioned, please update the PR to add target modules for Mixtral when using IA3.

arnavgarg1 · 2024-01-29T09:13:48Z

@pacman100 @younesbelkada Just updated with IA3 instead! I'm also going to add a separate PR for Phi with IA3 right now.

younesbelkada

Thanks a lot!

HuggingFaceDocBuilderDev · 2024-01-29T22:42:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

arnavgarg1 · 2024-01-30T01:37:51Z

No problem!

arnavgarg1 · 2024-01-30T13:44:19Z

src/peft/utils/constants.py

@@ -93,6 +93,7 @@ def starcoder_model_postprocess_past_key_value(past_key_values):
    "gpt_bigcode": ["c_attn", "mlp.c_proj"],
    "llama": ["k_proj", "v_proj", "down_proj"],
    "mistral": ["k_proj", "v_proj", "down_proj"],
+    "mixtral": ["k_proj", "v_proj", "w1", "w2", "w3"],


@pacman100 Would this be k_proj, v_proj and w2?

Yes, the ffn layer should just be w2 as per the IA3 paper.

arnavgarg1 · 2024-02-01T23:28:40Z

Wanted to see what's left here as next steps @pacman100 @younesbelkada

younesbelkada

LGTM ! wdyt @pacman100 ?

pacman100

Hello @arnavgarg1, thank you for adding IA3 target modules for Mixtral! Please see the comment on and post addressing that we can merge this.

pacman100 · 2024-02-07T07:59:43Z

src/peft/utils/constants.py

@@ -93,6 +93,7 @@ def starcoder_model_postprocess_past_key_value(past_key_values):
    "gpt_bigcode": ["c_attn", "mlp.c_proj"],
    "llama": ["k_proj", "v_proj", "down_proj"],
    "mistral": ["k_proj", "v_proj", "down_proj"],
+    "mixtral": ["k_proj", "v_proj", "w1", "w2", "w3"],


Yes, the ffn layer should just be w2 as per the IA3 paper.

pacman100 · 2024-02-07T08:00:07Z

src/peft/utils/constants.py

@@ -115,6 +116,7 @@ def starcoder_model_postprocess_past_key_value(past_key_values):
    "gpt_bigcode": ["mlp.c_proj"],
    "llama": ["down_proj"],
    "mistral": ["down_proj"],
+    "mixtral": ["w1", "w2", "w3"],


As discussed above, it should only have w2

arnavgarg1 · 2024-02-07T18:10:03Z

Thanks @pacman100 ! Just updated

arnavgarg1 · 2024-02-13T09:40:09Z

Is it good to merge?

pacman100 · 2024-02-15T01:26:20Z

Thank you @arnavgarg1! ✨

arnavgarg1 · 2024-02-15T01:28:05Z

Thanks!

* Add default LoRA target modules for Mixtral * Add IA3 modules for Mixtral * Address comments

Add default LoRA target modules for Mixtral

cc1cc75

younesbelkada reviewed Jan 24, 2024

View reviewed changes

pacman100 reviewed Jan 29, 2024

View reviewed changes

merge

17398b9

arnavgarg1 changed the title ~~Add default LoRA target modules for Mixtral~~ Add default IA3 target modules for Mixtral Jan 29, 2024

Add IA3 modules for Mixtral

5f9fd69

arnavgarg1 requested review from pacman100 and younesbelkada January 29, 2024 09:13

younesbelkada approved these changes Jan 29, 2024

View reviewed changes

arnavgarg1 commented Jan 30, 2024

View reviewed changes

younesbelkada approved these changes Feb 2, 2024

View reviewed changes

pacman100 reviewed Feb 7, 2024

View reviewed changes

Address comments

9152862

arnavgarg1 requested a review from pacman100 February 7, 2024 18:09

pacman100 approved these changes Feb 15, 2024

View reviewed changes

younesbelkada merged commit 83de1af into huggingface:main Feb 15, 2024
14 checks passed

BenjaminBossan pushed a commit to BenjaminBossan/peft that referenced this pull request Mar 14, 2024

Add default IA3 target modules for Mixtral (huggingface#1376)

fae8291

* Add default LoRA target modules for Mixtral * Add IA3 modules for Mixtral * Address comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add default IA3 target modules for Mixtral #1376

Add default IA3 target modules for Mixtral #1376

arnavgarg1 commented Jan 19, 2024 •

edited

Loading

younesbelkada left a comment

arnavgarg1 commented Jan 25, 2024

pacman100 left a comment

arnavgarg1 commented Jan 29, 2024

younesbelkada left a comment

HuggingFaceDocBuilderDev commented Jan 29, 2024

arnavgarg1 commented Jan 30, 2024

arnavgarg1 Jan 30, 2024

pacman100 Feb 7, 2024

arnavgarg1 commented Feb 1, 2024

younesbelkada left a comment

pacman100 left a comment

pacman100 Feb 7, 2024

pacman100 Feb 7, 2024

arnavgarg1 commented Feb 7, 2024

arnavgarg1 commented Feb 13, 2024

pacman100 commented Feb 15, 2024

arnavgarg1 commented Feb 15, 2024

Add default IA3 target modules for Mixtral #1376

Add default IA3 target modules for Mixtral #1376

Conversation

arnavgarg1 commented Jan 19, 2024 • edited Loading

younesbelkada left a comment

Choose a reason for hiding this comment

arnavgarg1 commented Jan 25, 2024

pacman100 left a comment

Choose a reason for hiding this comment

arnavgarg1 commented Jan 29, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jan 29, 2024

arnavgarg1 commented Jan 30, 2024

arnavgarg1 Jan 30, 2024

Choose a reason for hiding this comment

pacman100 Feb 7, 2024

Choose a reason for hiding this comment

arnavgarg1 commented Feb 1, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

pacman100 left a comment

Choose a reason for hiding this comment

pacman100 Feb 7, 2024

Choose a reason for hiding this comment

pacman100 Feb 7, 2024

Choose a reason for hiding this comment

arnavgarg1 commented Feb 7, 2024

arnavgarg1 commented Feb 13, 2024

pacman100 commented Feb 15, 2024

arnavgarg1 commented Feb 15, 2024

arnavgarg1 commented Jan 19, 2024 •

edited

Loading