Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel linear Lora #1092

Merged
merged 6 commits into from
Nov 30, 2023

Conversation

zhangsheng377
Copy link
Contributor

@zhangsheng377 zhangsheng377 commented Nov 8, 2023

We implemented the Lora algorithm for megatron's distributed layer ColumnParallelLinear and RowParallelLinear.

Due to the particularity of megatron creating a distributed layer, the required megatron information needs to be injected before executing Lora:

from megatron.arguments import core_transformer_config_from_args
from megatron import get_args
from peft import LoraConfig, get_peft_model

config = core_transformer_config_from_args(get_args())

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=['query_key_value', 'dense', 'dense_h_to_4h', 'dense_4h_to_h'],
    lora_dropout=0.0,
    bias="none",
    megatron_config=config,
    megatron_core="megatron.core",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

It has been verified on the megatron and megatron-deepspeed frameworks.

@zhangsheng377
Copy link
Contributor Author

zhangsheng377 commented Nov 8, 2023

@BenjaminBossan @pacman100

@BenjaminBossan
Copy link
Member

Great, thanks for the PR.

Before going through a full review, I have some points/questions:

Could you please provide a bit of more context what users can expect when using this functionality?

Due to the particularity of megatron creating a distributed layer, the required megatron information needs to be injected before executing Lora:

It would be great if we could find a way to avoid that, I can also check later if I have any ideas.

Finally, it would be great to have unit tests for the new feature, or at the very least an example to see it in action.

@zhangsheng377
Copy link
Contributor Author

zhangsheng377 commented Nov 8, 2023

Great, thanks for the PR.

Before going through a full review, I have some points/questions:

Could you please provide a bit of more context what users can expect when using this functionality?

For llama or other large models, the deepspeed framework is not easy to use if the model is too large. We now use our own modified megatron-deepspeed framework to train the cluster. Therefore, the model structure uses the ParallelLinear of megatron, but at this time we want to use Lora for fine-tuning, so we want to extend lora to support the ParallelLinear.

Due to the particularity of megatron creating a distributed layer, the required megatron information needs to be injected before executing Lora:

It would be great if we could find a way to avoid that, I can also check later if I have any ideas.

Finally, it would be great to have unit tests for the new feature, or at the very least an example to see it in action.

Ok, I use this script to finetune a llama7B with alpaca on our own magetron-deepspeed framework :

export CUDA_VISIBLE_DEVICES_=0,1,2,3
export ASCEND_RT_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES_}
ASCEND_RT_VISIBLE_DEVICES_ARRAY=(${CUDA_VISIBLE_DEVICES_//,/ })
echo "${ASCEND_RT_VISIBLE_DEVICES_ARRAY[@]}"

# the number of parameters is not aligned
export LD_LIBRARY_PATH=/usr/local/lib:/home/anaconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
source /home/xxx/Ascend/set_env.sh

GPUS_PER_NODE=${#ASCEND_RT_VISIBLE_DEVICES_ARRAY[@]} 
echo $GPUS_PER_NODE
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

TP=1
PP=1

DATA_PATH=xxx
LOAD_CHECKPOINT_PATH=xxx
SAVE_CHECKPOINT_PATH=xxx
TOKENIZER_PATH=xxx

DS_CONFIG=deepspeed_config_13B_1.json
ZERO_STAGE=2

MICRO_BATCH=4
GRADIENT_ACCUMULATION_STEP=8
GLOBAL_BATCH=$(($MICRO_BATCH * $GRADIENT_ACCUMULATION_STEP * $WORLD_SIZE))
EPOCH=2
TRAIN_ITERS=$((52000 / $GLOBAL_BATCH * $EPOCH))
echo $TRAIN_ITERS
SAVE_INTERVAL=$(($TRAIN_ITERS / 2))
echo $SAVE_INTERVAL

export HCCL_OP_BASE_FFTS_MODE_ENABLE=TRUE

cat <<EOT > $DS_CONFIG
{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 8,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "Adam"
    },

    "zero_optimization": {
        "stage": $ZERO_STAGE,
        "allgather_partitions": true,
        "allgather_bucket_size": 1e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 1e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": ${GRADIENT_ACCUMULATION_STEP},
    "train_batch_size": $GLOBAL_BATCH,
    "train_micro_batch_size_per_gpu":$MICRO_BATCH,
    "zero_allow_untested_optimizer": true
}
EOT

ds_args=""
ds_args=" --deepspeed ${ds_args}"
ds_args=" --no-pipeline-parallel ${ds_args}"
ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}"
ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"
ds_args=" --deepspeed-activation-checkpointing ${ds_args}"

#deepspeed --master_port ${MASTER_PORT} --include localhost:${CUDA_VISIBLE_DEVICES_} pretrain_llama.py \
deepspeed --master_port ${MASTER_PORT} pretrain_llama.py \
       --DDP-impl local \
       --no-contiguous-buffers-in-local-ddp \
       --tensor-model-parallel-size ${TP} \
       --pipeline-model-parallel-size ${PP} \
       --num-layers 32 \
       --hidden-size 4096 \
       --ffn-hidden-size 11008 \
       --num-attention-heads 32 \
       --micro-batch-size $MICRO_BATCH \
       --global-batch-size $GLOBAL_BATCH \
       --seq-length 1024 \
       --max-position-embeddings 2048 \
       --train-iters ${TRAIN_ITERS} \
       --lr-decay-iters ${TRAIN_ITERS} \
       --save $SAVE_CHECKPOINT_PATH \
       --load $LOAD_CHECKPOINT_PATH \
       --data-path $DATA_PATH \
       --tokenizer-name-or-path $TOKENIZER_PATH \
       --tokenizer-not-use-fast \
       --data-impl mmap \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 2e-5 \
       --lr-decay-style cosine \
       --min-lr 0 \
       --weight-decay 0. \
       --clip-grad 1.0 \
       --lr-warmup-iters 100 \
       --checkpoint-activations \
       --log-interval 1 \
       --save-interval ${SAVE_INTERVAL} \
       --eval-interval 1000 \
       --eval-iters 10 \
       --use-cpu-initialization \
       --lora-target-modules query_key_value dense gate_proj up_proj down_proj \
       --lora-r 16 \
       --lora-alpha 32 \
       --is-instruction-dataset \
       --seed 42 \
       $ds_args \
       --optimizer fused_adam \
       --fp16 | tee logs/train_7B_deepspeed.log

We compare it with the model using torch.linear for lora fine-tuning, and the loss error is less than the absolute value 0.001.

This is the inference result of our model:
image

In fact, if you have the environment, you can use the megatron or megatron-deepspeed framework to run a small model and lora the model after get_model() func. Of course, currently both repoes need to change the parent class initialization method in ParallelLinear. The PRs of the two warehouses are as follows:
microsoft/Megatron-DeepSpeed#284
NVIDIA/Megatron-LM#578

@zhangsheng377
Copy link
Contributor Author

zhangsheng377 commented Nov 10, 2023

@BenjaminBossan
Add an unit test to test Lora for the Parallel linear.

You can run it in peft dir:

pytest tests/test_lora_megatron.py

And after modifying these PRs in my local environment, the tests all passed:
microsoft/Megatron-DeepSpeed#284
NVIDIA/Megatron-LM#578
(You will pip install -e . the Megatron-LM from source.)

@zhangsheng377
Copy link
Contributor Author

@BenjaminBossan @pacman100

@zhangsheng377
Copy link
Contributor Author

@BenjaminBossan @pacman100 Sorry to bother you, but can you help me review it?

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making some changes to integrate this better with LoRA and adding tests. I still think we need to find a better way to configure how this feature can be used, I made a suggestion in the comments. What do you think?

Furthermore, could you please run make style?

src/peft/tuners/lora/config.py Outdated Show resolved Hide resolved
tests/test_lora_megatron.py Outdated Show resolved Hide resolved
tests/test_lora_megatron.py Outdated Show resolved Hide resolved
tests/test_lora_megatron.py Outdated Show resolved Hide resolved
@BenjaminBossan
Copy link
Member

Hi @zhangsheng377 I wanted to inform you that we merged #1106, which is a substantial refactor of PEFT and created some merge conflicts for your PR. The most notable change from that PR is that we refactored the adapter layers to take a base_layer argument now, which is a reference to the original module. This means that you have to slightly rewrite LoraParalleLayer and LoraParallelLinear. Take a look at LoraLayer and Linear LoRA layer, I hope that the change should be straightforward. If you have questions, please let me know and I'll help you with the change.

@zhangsheng377
Copy link
Contributor Author

Hi @zhangsheng377 I wanted to inform you that we merged #1106, which is a substantial refactor of PEFT and created some merge conflicts for your PR. The most notable change from that PR is that we refactored the adapter layers to take a base_layer argument now, which is a reference to the original module. This means that you have to slightly rewrite LoraParalleLayer and LoraParallelLinear. Take a look at LoraLayer and Linear LoRA layer, I hope that the change should be straightforward. If you have questions, please let me know and I'll help you with the change.

Great, I think I need this base_layer. I'll take a closer look when I have time. (After all, we are already at 21 o'clock)

@BenjaminBossan
Copy link
Member

BenjaminBossan commented Nov 21, 2023

@zhangsheng377 Thanks a lot for your continued work on this PR. The usability looks much nicer now! There is still a merge conflict left in lora/model.py. Don't hesitate to ask if you need help with resolving it.

@zhangsheng377
Copy link
Contributor Author

zhangsheng377 commented Nov 22, 2023

@zhangsheng377 Thanks a lot for your continued work on this PR. The usability looks much nicer now! There is still a merge conflict left in lora/model.py. Don't hesitate to ask if you need help with resolving it.

@BenjaminBossan Ha, thank you for your concern. In fact, the conflict resolution was completed yesterday, but the changes in the main line were quite large, so I trained the llama model again to compare the loss and make sure there were no problems before submitting the code.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@zhangsheng377
Copy link
Contributor Author

@BenjaminBossan Sorry I forgot make style. Now that it has been reformed, please trigger the workflows again.

PS: Happy Thanksgiving.

@zhangsheng377
Copy link
Contributor Author

image

@BenjaminBossan Sorry, I neglected to install ruff this morning, so I mistakenly thought that style was no longer a problem.
But it should be fine now. But unlike the workflow, I did not find the problem of config.py locally. If it continue to report errors, please help me find out where the problem is.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for updating the PR, it looks very good now. I'm not knowledgeable on megatron, so my review focuses more in the integration with PEFT.

As you can see, I left a couple of comments, please take a look. Mostly, I think we need to make some changes to the config to make saving and loading possible. Also, I think we can simplify the class structure. Finally, I have some questions concerning the forward method.

I did not find the problem of config.py locally. If it continue to report errors, please help me find out where the problem is

The issue is that the from types import should come before from typing.

PS: Happy Thanksgiving.

Thanks, but where I am, there is no Thanksgiving :) In case you celebrate it, happy Thanksgiving to you too.

Edit

I forgot to mention, it would be really great if we could have an example or even better an entry in the docs that shows how to use this feature, maybe highlighting why users should use it. That way, the feature is much easier to discover.

src/peft/tuners/lora/config.py Outdated Show resolved Hide resolved
src/peft/tuners/lora/config.py Outdated Show resolved Hide resolved
src/peft/tuners/lora/tp_layer.py Outdated Show resolved Hide resolved
src/peft/tuners/lora/tp_layer.py Outdated Show resolved Hide resolved
src/peft/tuners/lora/tp_layer.py Show resolved Hide resolved
tests/test_lora_megatron.py Show resolved Hide resolved
src/peft/tuners/lora/tp_layer.py Outdated Show resolved Hide resolved
src/peft/tuners/lora/tp_layer.py Outdated Show resolved Hide resolved
src/peft/tuners/lora/tp_layer.py Outdated Show resolved Hide resolved
src/peft/tuners/lora/tp_layer.py Outdated Show resolved Hide resolved
Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhangsheng377 for working on adding support for PEFT in Megatron. Overall, the PR is in great shape to be merged! 🚀

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much, I think the implementation is much cleaner now and easier to understand. From my point of view, there are only a few details left, otherwise we should be good to merge.

src/peft/tuners/lora/model.py Outdated Show resolved Hide resolved
src/peft/tuners/lora/config.py Outdated Show resolved Hide resolved
src/peft/tuners/lora/config.py Outdated Show resolved Hide resolved
src/peft/tuners/lora/tp_layer.py Show resolved Hide resolved
src/peft/tuners/lora/tp_layer.py Outdated Show resolved Hide resolved
tests/test_lora_megatron.py Outdated Show resolved Hide resolved
tests/test_lora_megatron.py Show resolved Hide resolved
tests/test_lora_megatron.py Outdated Show resolved Hide resolved
Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this PR. It looks pretty good now, I have a nit but it's no big deal.

Before merging, I just want to discuss the newly added tests. As is, they would not be run by the Github CI at any point, so in theory bugs could be introduced without us noticing. I have no personal experience, but the requirements for running megatron seem to be quite high, so I'm not sure if we could make it run on our GPU runners. I wonder if there is a way we can still make it work.

src/peft/tuners/lora/tp_layer.py Show resolved Hide resolved
src/peft/tuners/lora/tp_layer.py Show resolved Hide resolved
@zhangsheng377
Copy link
Contributor Author

zhangsheng377 commented Nov 29, 2023

Thanks a lot for this PR. It looks pretty good now, I have a nit but it's no big deal.

Before merging, I just want to discuss the newly added tests. As is, they would not be run by the Github CI at any point, so in theory bugs could be introduced without us noticing. I have no personal experience, but the requirements for running megatron seem to be quite high, so I'm not sure if we could make it run on our GPU runners. I wonder if there is a way we can still make it work.

Well, I only used one gpu card to run the newly added unit test locally, and the resource requirements are not large.
I installed Megatron from the source code, but it does have a lot of dependencies and the installation is a bit troublesome, but it should be okay. It should take less than an afternoon for me to configure the environment.

Or you should be able to install Megatron-DeepSpeed, after all, it is actually the backend of our current magic modification.

#1092 (comment)

My local test result:
image
image

@BenjaminBossan
Copy link
Member

Unfortunately, I did not manage to successfully build APEX and it seems I'm not the only one, judging from all the open issues. Therefore, I couldn't test it locally. If you have a recipe to get this all to run, which we could use for a CI job, that would be great.

@BenjaminBossan
Copy link
Member

Heads up, there is a small merge conflict, should be easy to fix.

@zhangsheng377
Copy link
Contributor Author

Heads up, there is a small merge conflict, should be easy to fix.

Yes, it's done.

@zhangsheng377
Copy link
Contributor Author

Unfortunately, I did not manage to successfully build APEX and it seems I'm not the only one, judging from all the open issues. Therefore, I couldn't test it locally. If you have a recipe to get this all to run, which we could use for a CI job, that would be great.

My apex is version 0.1, and it seems that it should be installed directly via pip.
Can you post the error report and take a look?

@BenjaminBossan
Copy link
Member

My apex is version 0.1, and it seems that it should be installed directly via pip.

Could you tell me how to do that? The apex package on PyPI seems to be unrelated.

What I did is follow the instructions here. Several users reported issues with this. Some suggested solutions included checking out specific tags or commits, but none of those I tried worked for me.

@zhangsheng377
Copy link
Contributor Author

My apex is version 0.1, and it seems that it should be installed directly via pip.

Could you tell me how to do that? The apex package on PyPI seems to be unrelated.

What I did is follow the instructions here. Several users reported issues with this. Some suggested solutions included checking out specific tags or commits, but none of those I tried worked for me.

I think I did not install it from the source code. You can try pip install apex directly.

@BenjaminBossan
Copy link
Member

I think I did not install it from the source code. You can try pip install apex directly.

I think this will install the wrong package. When checking on PyPI, the description says:

Authentication, Form Library, I18N/L10N, Flash Message Template (not associated with Pyramid, a Pylons project)

@zhangsheng377
Copy link
Contributor Author

Authentication, Form Library, I18N/L10N, Flash Message Template (not associated with Pyramid, a Pylons project)

(xx) [root@localhost peft]# pip install apex
Looking in indexes: http://mirrors.aliyun.com/pypi/simple/
Requirement already satisfied: apex in /root/miniconda3/envs/xx/lib/python3.9/site-packages/apex-0.1-py3.9.egg (0.1)
Requirement already satisfied: packaging>20.6 in /root/miniconda3/envs/xx/lib/python3.9/site-packages (from apex) (23.2)

Maybe you can change the index-url? Or specify version?

@BenjaminBossan
Copy link
Member

Looking in indexes: http://mirrors.aliyun.com/pypi/simple/

Interesting. The host is unfortunately insecure (http) and I don't know who this is. Therefore, we cannot use this index for the CI, as it could start hosting malicious code at any point in the future.

Anyway, even if we cannot find a way right now to make the tests work, we can still proceed. Hopefully, we can find a better way in the future, maybe Nvidia manages to provide a package that can be reliably installed soon.

@BenjaminBossan BenjaminBossan merged commit 2674f5e into huggingface:main Nov 30, 2023
14 checks passed
@BenjaminBossan
Copy link
Member

@zhangsheng377 Thanks so much for this wonderful PR.

BenjaminBossan pushed a commit to BenjaminBossan/peft that referenced this pull request Nov 30, 2023
Adds option to use Megatron's ColumnParallelLinear and RowParallelLinear
for LoRA linear layers, leading to improved performance when using LoRA
with Megatron.
@thincal
Copy link

thincal commented Feb 21, 2024

on our own magetron-deepspeed framework

@zhangsheng377 so that can we use the latest https://github.com/microsoft/Megatron-DeepSpeed with this peft now ?

@zhangsheng377
Copy link
Contributor Author

zhangsheng377 commented Feb 21, 2024

on our own magetron-deepspeed framework

@zhangsheng377 so that can we use the latest https://github.com/microsoft/Megatron-DeepSpeed with this peft now ?

@thincal Yes, you can.

@thincal
Copy link

thincal commented Feb 21, 2024

Thanks for your information. And another question, with Megatron-DeepSpeed used, what's the format for input model and result model in this LoRA fine-tuning with PEFT ? @zhangsheng377

@zhangsheng377
Copy link
Contributor Author

Thanks for your information. And another question, with Megatron-DeepSpeed used, what's the format for input model and result model in this LoRA fine-tuning with PEFT ? @zhangsheng377

@thincal In Megatron-DeepSpeed, the base model is megatron's model, so you can see my ut in this pr.

@kota-iizuka
Copy link

kota-iizuka commented Jul 4, 2024

@zhangsheng377 Thank you for your great work. I'd like to try your code, so could you show me the pretrain_llama.py that you mentioned in the comment #1092 (comment) ? I think it's a script that allows us to specify --lora-target-modules from the command line, but I couldn't find anything after a quick search.

@zhangsheng377
Copy link
Contributor Author

zhangsheng377 commented Jul 4, 2024

@zhangsheng377 Thank you for your great work. I'd like to try your code, so could you show me the pretrain_llama.py that you mentioned in the comment #1092 (comment) ? I think it's a script that allows us to specify --lora-target-modules from the command line, but I couldn't find anything after a quick search.

Haha, You will need to adapt the Megatron code to PEFT. For example, modify the model_provider function in the Megatron root directory pretrain_gpt.py file.

By the way, '--lora-target-modules' is a parameter I added to Megatron myself, and you can use your own adaptation process.

Or, you can see: https://gitee.com/ascend/MindSpeed

@kota-iizuka
Copy link

@zhangsheng377 Thank you. I added the conversion to the LoRA model at the end of model_provider() in the example https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/finetune_hf_llama, but I got an error when loading the model. This sample includes a process to convert the huggingface model to the megatron model, and I think that the LoRA model conversion is also required there.

@zhangsheng377
Copy link
Contributor Author

Thank you. I added the conversion to the LoRA model at the end of model_provider() in the example https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/finetune_hf_llama, but I got an error when loading the model. This sample includes a process to convert the huggingface model to the megatron model, and I think that the LoRA model conversion is also required there.

Things to note when loading the model: 1. The original hf model may not have lora parameters. 2. If you are converting the lora plug-in, you need to write the conversion code yourself.

@kota-iizuka
Copy link

If you are converting the lora plug-in, you need to write the conversion code yourself.

I understand that there is no standard way to load weight in LoRA training. I will try converting it now. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants