laekov
diff --git a/‎README.md
+11-8 b/‎README.md
+11-8
diff --git a/‎doc/parallelism/README.md
+4 b/‎doc/parallelism/README.md
+4
diff --git a/‎doc/fastmoe_data_parallel.png ‎doc/parallelism/fastmoe_data_parallel.png b/‎doc/fastmoe_data_parallel.png ‎doc/parallelism/fastmoe_data_parallel.png
diff --git a/‎doc/fastmoe_model_parallel.png ‎doc/parallelism/fastmoe_expert_parallel.png b/‎doc/fastmoe_model_parallel.png ‎doc/parallelism/fastmoe_expert_parallel.png
diff --git a/‎doc/parallelism/parallelism.png
68.7 KB b/‎doc/parallelism/parallelism.png
68.7 KB
diff --git a/‎doc/readme-cn.md
+10-8 b/‎doc/readme-cn.md
+10-8
diff --git a/‎fmoe/layers.py
+4 b/‎fmoe/layers.py
+4
@@ -75,35 +75,38 @@ the MLP layer by the `FMoE` layers.
 
 ### Using FastMoE in Parallel
 
-FastMoE supports both data parallel and model parallel. 
+FastMoE supports multiple ways of parallel training. See [a comprehensive
+document for parallelism](doc/parallelism) for details. Below shows the two
+simplest ways of using FastMoE in parallel.
 
 #### Data Parallel
 
 In FastMoE's data parallel mode, both the gate and the experts are replicated on each worker. 
 The following figure shows the forward pass of a 3-expert MoE with 2-way data parallel.
 
 <p align="center">
-<img src="doc/fastmoe_data_parallel.png" width="600">
+<img src="doc/parallelism/fastmoe_data_parallel.png" width="600">
 </p>
 
 For data parallel, no extra coding is needed. FastMoE works seamlessly with PyTorch's `DataParallel` or `DistributedDataParallel`.
 The only drawback of data parallel is that the number of experts is constrained by each worker's memory.
 
-#### Model Parallel
+#### Expert Parallel (also called Model Parlallel in some previous versions)
 
-In FastMoE's model parallel mode, the gate network is still replicated on each worker but
+In FastMoE's expert parallel mode, the gate network is still replicated on each worker but
 experts are placed separately across workers.
 Thus, by introducing additional communication cost, FastMoE enjoys a large expert pool whose size is proportional to the number of workers.
 
 The following figure shows the forward pass of a 6-expert MoE with 2-way model parallel. Note that experts 1-3 are located in worker 1 while experts 4-6 are located in worker 2.
 
 <p align="center">
-<img src="doc/fastmoe_model_parallel.png" width="600">
+<img src="doc/parallelism/fastmoe_expert_parallel.png" width="600">
 </p>
 
-FastMoE's model parallel requires sophiscated parallel strategies that neither PyTorch nor
-Megatron-LM provides. The `fmoe.DistributedGroupedDataParallel` module is
-introduced to replace PyTorch's DDP module.
+FastMoE's expert parallel requires sophiscated parallel strategies that neither
+PyTorch nor Megatron-LM provided when FastMoE was created. The
+`fmoe.DistributedGroupedDataParallel` module is introduced to replace PyTorch's
+DDP module.
 
 #### Faster Performance Features
 
 
@@ -0,0 +1,4 @@
+Multi-Dimensional Parallelism Supported by FastMoE
+===
+
+这篇文档懒得写中文版了. 在获得来自社区的贡献前, 请自行谷歌翻译.
@@ -64,7 +64,8 @@ train(model, ...)
 
 ### 分布式地使用 FastMoE
 
-FastMoE 支持数据并行和模型并行.
+FastMoE 支持并行方式. 详见[并行方式详细说明](doc/parallelism).
+以下简单介绍两种最容易使用的并行方式.
 
 #### 数据并行.
 
@@ -73,29 +74,30 @@ FastMoE 支持数据并行和模型并行.
 下图展示了一个有三个专家的两路数据并行MoE模型进行前向计算的方式.
 
 <p align="center">
-<img src="fastmoe_data_parallel.png" width="600">
+<img src="parallelism/fastmoe_data_parallel.png" width="600">
 </p>
 
 对于数据并行, 额外的代码是不需要的. FastMoE 与 PyTorch 的 `DataParallel` 和
 `DistributedDataParallel` 模块都可以无缝对接. 该方式唯一的问题是,
 专家的数量受到单个计算单元(如GPU)的内存大小限制.
 
-#### 模型并行
+#### 专家并行 (也曾被叫作模型并行)
 
-在 FastMoE 的模型并行模式中, 门网络依然是复制地被放置在每个计算单元上的,
+在 FastMoE 的专家并行模式中, 门网络依然是复制地被放置在每个计算单元上的,
 但是专家网络被独立地分别放置在各个计算单元上. 因此, 通过引入额外的通信操作,
 FastMoE 可以允许更多的专家网络们同时被训练,
 而其数量限制与计算单元的数量是正相关的.
 
-下图展示了一个有六个专家网络的模型被两路模型并行地训练.
+下图展示了一个有六个专家网络的模型被两路专家并行地训练.
 注意专家1-3被放置在第一个计算单元上, 而专家4-6被放置在第二个计算单元上.
 
 <p align="center">
-<img src="fastmoe_model_parallel.png" width="600">
+<img src="parallelism/fastmoe_expert_parallel.png" width="600">
 </p>
 
-FastMoE 的模型并行模式需要专门的并行策略, 而 PyTorch 和 Megatron-LM
-都不支持这样的策略. 因此, 需要使用 `fmoe.DistributedGroupedDataParallel`
+FastMoE 的专家并行模式需要专门的并行策略, 而 PyTorch 和 Megatron-LM
+都不支持这样的策略 (在我们创建 FastMoE 时). 因此, 需要使用
+`fmoe.DistributedGroupedDataParallel`
 模块来代替 PyTorch 的 DDP 模块.
 
 ### 如何训练得更快
 
@@ -97,6 +97,10 @@ class FMoE(nn.Module):
     the output. For each worker, FMoE only computes the output of a certain
     slice of the input batch, and will all-gather the outputs after
     computation.
+    * `mp_group` is a deprecated alias of `slice_group`
+    * `moe_group` stands for the group of process that performs expert
+    parallelism. The default value `None` means all processes. See the
+    parallelism document for more details of the groups.
     * `top_k` stands for the number of experts each token is going to.
     * `gate` is a gate class which can found in `fmoe.gates`.
     * `expert` can be specified as a module class, it is used to generate