long context performance numbers in doc (#10784)

* long context perf Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * update the long context perf Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * Akoumparouli/mcore microbatch calculator fix (#10780) * move tests/lightning/{,_}io Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * add microbatch calculator context manager Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * use microbatch calculator context manager Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * remove unused var Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Apply isort and black reformatting Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> Co-authored-by: akoumpa <akoumpa@users.noreply.github.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * remove 8x3b recipes (#10764) * remove 8x3b recipes Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * remove 8x3b from test_nemo_run Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * rm from __init__ Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * change the figure file name Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * Accommodating the reviewer's comment Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * update the y-axis title Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 3f90b98 ! (#10789) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * Add ModelOpt transformer model pruning example for Llama models, default to llama3.1-8b-base (#10294) * Add ModelOpt transformer model pruning example for Llama3 model Signed-off-by: Shengliang Xu <shengliangx@nvidia.com> * Apply isort and black reformatting Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com> Signed-off-by: Shengliang Xu <shengliangx@nvidia.com> * examples code is at wrong dir, move them Signed-off-by: Shengliang Xu <shengliangx@nvidia.com> * changes as suggested in comment remove some logging and unused config code, update example model to llama3.1 Signed-off-by: Shengliang Xu <shengliangx@nvidia.com> * Add pruning of hidden_size into example Signed-off-by: Shengliang Xu <shengliangx@nvidia.com> * Apply isort and black reformatting Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com> Signed-off-by: Shengliang Xu <shengliangx@nvidia.com> * Update examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> * Add pruning test to cicd-main.yml Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> * Update cicd-main.yml Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> * Update cicd-main.yml Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> * Update cicd-main.yml Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> * Update cicd-main.yml Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> * Update cicd-main.yml Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> --------- Signed-off-by: Shengliang Xu <shengliangx@nvidia.com> Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * Update mamba.rst after dist ckpt addition (#10800) Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * fix chunked infer (#10581) Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * fix state transform (#10728) Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * use ckpt_to_weights_subdir in restore (#10786) * use ckpt_to_weights_subdir in restore Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * make ckpt_to_{weight,context}_subdir idempotent Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Apply isort and black reformatting Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> Co-authored-by: akoumpa <akoumpa@users.noreply.github.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * Mixtral set seq_length=4k (#10704) * enable SP & set seq_lenght=4k Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * update test expected values Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * 8x22b 4k Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * Fix for crashes with tensorboard_logger=false and VP + LoRA (#10792) * Fix for crashes with tensorboard_logger=false and virtual pipeline parallel + LoRA Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Apply isort and black reformatting Signed-off-by: vysarge <vysarge@users.noreply.github.com> --------- Signed-off-by: Valerie Sarge <vsarge@nvidia.com> Signed-off-by: vysarge <vysarge@users.noreply.github.com> Co-authored-by: vysarge <vysarge@users.noreply.github.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * Disable checkpoint conversion inside AutoResume (#10645) * Disable checkpoint conversion inside AutoResume Signed-off-by: Hemil Desai <hemild@nvidia.com> * Apply isort and black reformatting Signed-off-by: hemildesai <hemildesai@users.noreply.github.com> * Update resume docstrings Signed-off-by: Hemil Desai <hemild@nvidia.com> * fix Signed-off-by: Hemil Desai <hemild@nvidia.com> * add default finetuning recipe and refactor llama3 8b recipe Signed-off-by: Chen Cui <chcui@nvidia.com> * Apply isort and black reformatting Signed-off-by: cuichenx <cuichenx@users.noreply.github.com> * address comment Signed-off-by: Chen Cui <chcui@nvidia.com> * refactor other recipes Signed-off-by: Chen Cui <chcui@nvidia.com> * Apply isort and black reformatting Signed-off-by: cuichenx <cuichenx@users.noreply.github.com> * remove 8x3b finetuning recipe for now because HF version not available Signed-off-by: Chen Cui <chcui@nvidia.com> * add copyright header Signed-off-by: Chen Cui <chcui@nvidia.com> * adjust unit tests based on recipe fixes Signed-off-by: Chen Cui <chcui@nvidia.com> * fix failed unit test Signed-off-by: Chen Cui <chcui@nvidia.com> --------- Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: hemildesai <hemildesai@users.noreply.github.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: cuichenx <cuichenx@users.noreply.github.com> Co-authored-by: hemildesai <hemildesai@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Co-authored-by: cuichenx <cuichenx@users.noreply.github.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * replace png file to github assets Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> * change image url to github release Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> --------- Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Shengliang Xu <shengliangx@nvidia.com> Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com> Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Valerie Sarge <vsarge@nvidia.com> Signed-off-by: vysarge <vysarge@users.noreply.github.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: hemildesai <hemildesai@users.noreply.github.com> Signed-off-by: cuichenx <cuichenx@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: akoumpa <akoumpa@users.noreply.github.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com> Co-authored-by: Shengliang Xu <106840466+shengliangxu@users.noreply.github.com> Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com> Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Valerie Sarge <vsarge@nvidia.com> Co-authored-by: vysarge <vysarge@users.noreply.github.com> Co-authored-by: Hemil Desai <hemild@nvidia.com> Co-authored-by: hemildesai <hemildesai@users.noreply.github.com> Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
NVIDIA · Oct 18, 2024 · 5b47a94 · 5b47a94
1 parent ce3b28e
commit 5b47a94
Showing 1 changed file with 155 additions and 0 deletions.
diff --git a/docs/source/performance/performance_long_sequence.md b/docs/source/performance/performance_long_sequence.md
@@ -0,0 +1,155 @@
+# Long Sequence Performance
+
+## LLAMA2-7B (FP8)
+
+- The table below shows the pre-training performance of the LLAMA2-7B with CP (context parallelism) and compares it against the results without CP at various input sequence lengths. The detailed model-parallel configurations and the achieved performance are shown in the training results with CP. In non-CP training runs, we use the most performant model- and data-parallel configurations without CP given the memory capacity constraint of the H100 GPU system.
+
+  - Container: [NeMo24.03.01.framework](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags)
+  - System: DGX-H100
+
+<style>
+  table {
+    border-collapse: collapse;
+  }
+  th {
+    border: 1px solid;
+    padding: 5px;
+    text-align: center; /* Center-align all header cells */
+  }
+  td {
+    border: 1px solid;
+    padding: 5px;
+  }
+  th.top-border {
+    border-top: 2px solid;
+  }
+  td.speedup {
+    font-weight: bold;
+  }
+</style>
+
+
+<table>
+  <thead>
+    <tr>
+      <th rowspan="2" class="top-border">SeqLen (K)</th>
+      <th rowspan="2" class="top-border"># of GPUs</th>
+      <th rowspan="1" class="top-border">Without CP</th>
+      <th colspan="5" class="top-border">With CP</th>
+      <th rowspan="2" class="top-border">Speedup with CP/without CP</th>
+    </tr>
+    <tr>
+      <th>TFLOPS / GPU</th>
+      <th>TP</th>
+      <th>PP</th>
+      <th>DP</th>
+      <th>CP</th>
+      <th>TFLOPS / GPU</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>4</td>
+      <td>4</td>
+      <td>768</td>
+      <td>1</td>
+      <td>1</td>
+      <td>4</td>
+      <td>1</td>
+      <td>768</td>
+      <td class="speedup">1.00</td>
+    </tr>
+    <tr>
+      <td>8</td>
+      <td>8</td>
+      <td>730</td>
+      <td>1</td>
+      <td>2</td>
+      <td>4</td>
+      <td>1</td>
+      <td>730</td>
+      <td class="speedup">1.00</td>
+    </tr>
+    <tr>
+      <td>16</td>
+      <td>16</td>
+      <td>660</td>
+      <td>2</td>
+      <td>1</td>
+      <td>8</td>
+      <td>1</td>
+      <td>660</td>
+      <td class="speedup">1.00</td>
+    </tr>
+    <tr>
+      <td>32</td>
+      <td>32</td>
+      <td>595</td>
+      <td>2</td>
+      <td>1</td>
+      <td>8</td>
+      <td>2</td>
+      <td>610</td>
+      <td class="speedup">1.03</td>
+    </tr>
+    <tr>
+      <td>64</td>
+      <td>64</td>
+      <td>534</td>
+      <td>4</td>
+      <td>1</td>
+      <td>8</td>
+      <td>2</td>
+      <td>574</td>
+      <td class="speedup">1.07</td>
+    </tr>
+    <tr>
+      <td>128</td>
+      <td>128</td>
+      <td>424</td>
+      <td>4</td>
+      <td>1</td>
+      <td>8</td>
+      <td>4</td>
+      <td>555</td>
+      <td class="speedup">1.31</td>
+    </tr>
+    <tr>
+      <td>256</td>
+      <td>256</td>
+      <td>392</td>
+      <td>4</td>
+      <td>1</td>
+      <td>8</td>
+      <td>8</td>
+      <td>549</td>
+      <td class="speedup">1.40</td>
+    </tr>
+    <tr>
+      <td>512</td>
+      <td>512</td>
+      <td>104</td>
+      <td>8</td>
+      <td>1</td>
+      <td>4</td>
+      <td>16</td>
+      <td>549</td>
+      <td class="speedup">5.28</td>
+    </tr>
+    <tr>
+      <td>1024</td>
+      <td>1024</td>
+      <td>26.5</td>
+      <td>8</td>
+      <td>1</td>
+      <td>4</td>
+      <td>32</td>
+      <td>536</td>
+      <td class="speedup">20.23</td>
+    </tr>
+  </tbody>
+</table>
+
+
+### Speedup of LLAMA2 7B training with CP over without CP
+![cp_speedup_figure](https://github.com/NVIDIA/NeMo/releases/download/r2.0.0rc1/tutorial_cp_speedup_figure.png)