Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3D-parallelism] Hybrid Model Parallelism #32074

Merged

Conversation

JZ-LIANG
Copy link
Contributor

@JZ-LIANG JZ-LIANG commented Apr 5, 2021

PR types

New features

PR changes

APIs

Describe

  • new features

    • Hybrid Model Parallelism:
      • Combine 3 single Model parallelism strategies (megatron, sharding, pipline) together as a Hybrid Parallelism strategy
      • Uniform switch to turn on/off each single parallelism strategy (temporally use Sharding & Pipeline as the uniform API)
  • performance optimization:

    • speed
      • remove potential un-necessary sync_calc & sync_calc in hybrid model parallelism
    • Memory usage
      • optimizer offload
      • optimizer temporary vars inplacement [commit withdraw, update in next pr]
  • performance-related

    • the order of parallelism from inner to outer is : mp --> sharidng --> pp
    • mp (megatron) and sharding parallelism will introduce large communication and is recommend: to be used within node. (mp_degree * sharding_degree = number of gpu per node)
    • pp parallelism have less communication load compared with above two, which make them more suitable to use across the nodes (pp_degree = number node)

example

  • assume we have 4 nodes with 8 gpus per node:

  • mp-sharding-pp 3D parallelism

        dist_strategy.sharding = True
        dist_strategy.pipeline = True
        dist_strategy.sharding_configs = {"segment_broadcast_MB": 32,
                                            "pp_degree": 4,
                                            "sharding_degree":4,
                                            "mp_degree": 2,
                                            "optimize_offload": True,
                                            }
        dist_strategy.pipeline_configs = {"schedule_mode": "1F1B",
                                            "micro_batch_size": 1,
                                            "accumulate_steps": 4,
                                            }
  • mp-pp 2D parallelism
        dist_strategy.sharding = True
        dist_strategy.pipeline = True
        dist_strategy.sharding_configs = {"segment_broadcast_MB": 32,
                                            "pp_degree": 4,
                                            "sharding_degree":1,
                                            "mp_degree": 8,
                                            "optimize_offload": True,
                                            }
        dist_strategy.pipeline_configs = {"schedule_mode": "1F1B",
                                            "micro_batch_size": 1,
                                            "accumulate_steps": 4,
                                            }
  • shardign-pp 2D parallelism
        dist_strategy.sharding = True
        dist_strategy.pipeline = True
        dist_strategy.sharding_configs = {"segment_broadcast_MB": 32,
                                            "pp_degree": 1,
                                            "sharding_degree":4,
                                            "mp_degree": 8,
                                            "optimize_offload": True,
                                            }
        dist_strategy.pipeline_configs = {"schedule_mode": "1F1B",
                                            "micro_batch_size": 1,
                                            "accumulate_steps": 4,
                                            }
  • mp-sharding 2D parallelism
        dist_strategy.sharding = True
        dist_strategy.sharding_configs = {"segment_broadcast_MB": 32,
                                            "pp_degree": 1,
                                            "sharding_degree":4,
                                            "mp_degree": 8,
                                            "optimize_offload": False,
                                            "gradient_merge_acc_step": 4,
                                            }

@paddle-bot-old
Copy link

paddle-bot-old bot commented Apr 5, 2021

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot-old
Copy link

paddle-bot-old bot commented Apr 5, 2021

✅ This PR's description meets the template requirements!
Please wait for other CI results.

@JZ-LIANG JZ-LIANG changed the title [3D-parallelism] Parallelism Switch [3D-parallelism] Hybrid Model Parallelism Apr 6, 2021
optional int32 sharding_degree = 3 [ default = 8 ];
optional int32 mp_degree = 4 [ default = 1 ];
optional string sharding_segment_strategy = 5
optional string sharding_segment_strategy = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recorded, document will be added in fluiddoc and fleetx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need add comments to this code.

optional bool hybrid_dp = 7 [ default = false ];
optional int32 gradient_merge_acc_step = 8 [ default = 1 ];
optional bool optimize_offload = 9 [ default = false ];
optional bool pp_allreduce_in_optimize = 10 [ default = false ];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments, in 3d or 4d parallel, allreduce_in_optimize=True can reduce communication, allreduce_in_optimize=False can reduce memory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recorded, document will be added in fluiddoc and fleetx and .py file where the feature is called.

but I think this should be a feature for internal project now, and we should not expose It to users ?

Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for backward.py

Copy link
Contributor

@wangxicoding wangxicoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wangxicoding wangxicoding merged commit 1e60a0c into PaddlePaddle:develop Apr 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants