-
Notifications
You must be signed in to change notification settings - Fork 605
3outeille/transformers backend (Dense model only) #2048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
3outeille/transformers backend (Dense model only) #2048
Conversation
… gradnorm and less tps with HF model
…le/transformers_backend
|
Hi @3outeille! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
wwwjn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work again, let some comments
| num_layers: int, | ||
| input_weight: int = 1, | ||
| output_weight: int = 1, | ||
| include_rotary_emb: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is not included in https://github.com/huggingface/torchtitan/pull/1/files , can you quickly remind me why we need to include rotary embedding when PP is applied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And in torchtitan models, we make rotary_emb a function, not a module, but looks like for HF models, rotary_emb is a module, that's why we need this module being included in PP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was to address the issue you mentionned here: huggingface#1 (comment)
Can we modify and add a parameter in function signature in pytorch/torchtitan@main/torchtitan/distributed/pipeline_parallel.py#L41 instead of keep 2 copies? I feel it's very easy to get diverged in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And in torchtitan models, we make rotary_emb a function, not a module, but looks like for HF models, rotary_emb is a module, that's why we need this module being included in PP?
Yes exactly !
| setattr(model, module_name, None) | ||
| # Replace with Identity or None based on configuration | ||
| replacement = ( | ||
| nn.Identity() if use_identity_for_missing_modules else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you quicly remind me why we need to use Identity() here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because HF define their models without things like if toke_embeddings is None.
I still worry about such identities breaks DCP and could be the source of PP numerics issue. The concrete question is, when loading from seed checkpoint, are all the PP ranks restored perfectly?
cc @fegin if you know this definitively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concrete question is, when loading from seed checkpoint, are all the PP ranks restored perfectly?
Seems like PP ranks are restored perfectly because we have perfect match with Qwen but not with Llama for example (cf the screenshot at huggingface#4)
| ) | ||
|
|
||
|
|
||
| def apply_fsdp( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By reading this function, the function is the same as the apply_fsdp function in llama4/parallelize (I know we will keep MoE capability for next PR), can we reuse the apply_fsdp function from llama4 and avoid keeping multiple copies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see the difference - The only difference is moe_block = transformer_block.mlp line 337, in transformers models, the MoE module is named mlp, instead of moe. Can we use the same getter/setter way to rename it in model.py, so we can reuse the apply_fsdp function from llama4.
I don't have strong opinion on this, but I'm a little bit concerned if we have several copies, they will become diverged easily in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid concern. i'll reuse fsdp from llama3 for now as this PR handles only dense. It will make more sense to handle the getter/setter in the MoE PR
torchtitan/experiments/transformers_backend/tests/integration_tests.py
Outdated
Show resolved
Hide resolved
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address final comments.
torchtitan/experiments/transformers_backend/tests/integration_tests.py
Outdated
Show resolved
Hide resolved
| setattr(model, module_name, None) | ||
| # Replace with Identity or None based on configuration | ||
| replacement = ( | ||
| nn.Identity() if use_identity_for_missing_modules else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because HF define their models without things like if toke_embeddings is None.
I still worry about such identities breaks DCP and could be the source of PP numerics issue. The concrete question is, when loading from seed checkpoint, are all the PP ranks restored perfectly?
cc @fegin if you know this definitively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds the changes are caused by specific ways transformers define models. Then let's fork the changed functions into experiments/transformers_backend/. I apologize for the back & forth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but isnt the compromise good enough ? Copy pasting means not noticing changes in Pipeline parallel later on
Context
Reference PR: huggingface#1
This PR enables:
meta-llama/Llama-3.2-1Bmicrosoft/phi-2Qwen/Qwen2.5-7Bmistralai/Mistral-7B-v0.1ByteDance-Seed/Seed-Coder-8B-InstructQwen/Qwen3-4B-Instruct-2507arcee-ai/AFM-4.5Bibm-granite/granite-3b-code-base-2kbaidu/ERNIE-4.5-0.3B-Base-PTkyutai/helium-1-preview-2ballenai/OLMo-7B-hfmistralai/Ministral-8B-Instruct-2410lossandgrad_normstarts very highUsage
transformers==4.57.1torchtitan/torchtitan/experiments/transformers_backend/configs/qwen3.tomlLOG_RANK=7 CONFIG_FILE=<YOUR_PATH>/torchtitan/experiments/transformers_backend/configs/qwen3.toml ./run_train.sh --job.custom_config_module=torchtitan.experiments.transformers_backend.job_config --compile.enableTesting methodology
FSDP=2vsFSDP=2 & <other //-ism>test_hf_integration.pyis going to do:results/ |_ meta-llama |_ Llama-3.2-1B |_ debugmodel/ |_ seed_checkpoint/ |_ config.toml |_ seed.slurm |_ step-0/ |_ .... |_ fsdp2_tp1_cp1_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ fsdp2_tp2_cp1_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp1_pp2/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp2_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp2_pp2/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log` |_ full/ ...Further tasks
build_optimizers_with_moe_load_balancingsupport for MoEFSDP=2 vs FSDP=2 + PP=2, thelossandgrad_normnot bitwise matching (but converging) while it is the case with Torchtitan modeling. (issue is tracked in Fix pp convergence to be bitwise huggingface/torchtitan#4)import torch._dynamo.config; torch._dynamo.config.cache_size_limit = 128to avoid recomputation for graph when usingtorch.compileandactivation checkpointing