You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For data parallel, no extra coding is needed. FastMoE works seamlessly with PyTorch's `DataParallel` or `DistributedDataParallel`.
90
92
The only drawback of data parallel is that the number of experts is constrained by each worker's memory.
91
93
92
-
#### Model Parallel
94
+
#### Expert Parallel (also called Model Parlallel in some previous versions)
93
95
94
-
In FastMoE's model parallel mode, the gate network is still replicated on each worker but
96
+
In FastMoE's expert parallel mode, the gate network is still replicated on each worker but
95
97
experts are placed separately across workers.
96
98
Thus, by introducing additional communication cost, FastMoE enjoys a large expert pool whose size is proportional to the number of workers.
97
99
98
100
The following figure shows the forward pass of a 6-expert MoE with 2-way model parallel. Note that experts 1-3 are located in worker 1 while experts 4-6 are located in worker 2.
0 commit comments