From 5b47a94c110e0ce35a652242e752d8977754c8f0 Mon Sep 17 00:00:00 2001 From: Youngeun Kwon Date: Fri, 18 Oct 2024 03:07:20 -0700 Subject: [PATCH] long context performance numbers in doc (#10784) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * long context perf Signed-off-by: Youngeun Kwon * update the long context perf Signed-off-by: Youngeun Kwon * Akoumparouli/mcore microbatch calculator fix (#10780) * move tests/lightning/{,_}io Signed-off-by: Alexandros Koumparoulis * add microbatch calculator context manager Signed-off-by: Alexandros Koumparoulis * use microbatch calculator context manager Signed-off-by: Alexandros Koumparoulis * add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end Signed-off-by: Alexandros Koumparoulis * remove unused var Signed-off-by: Alexandros Koumparoulis * fix Signed-off-by: Alexandros Koumparoulis * Apply isort and black reformatting Signed-off-by: akoumpa --------- Signed-off-by: Alexandros Koumparoulis Signed-off-by: akoumpa Co-authored-by: akoumpa Signed-off-by: Youngeun Kwon * remove 8x3b recipes (#10764) * remove 8x3b recipes Signed-off-by: Alexandros Koumparoulis * remove 8x3b from test_nemo_run Signed-off-by: Alexandros Koumparoulis * rm from __init__ Signed-off-by: Alexandros Koumparoulis --------- Signed-off-by: Alexandros Koumparoulis Signed-off-by: Youngeun Kwon * change the figure file name Signed-off-by: Youngeun Kwon * Accommodating the reviewer's comment Signed-off-by: Youngeun Kwon * update the y-axis title Signed-off-by: Youngeun Kwon * [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 3f90b98 ! (#10789) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com> Signed-off-by: Youngeun Kwon * Add ModelOpt transformer model pruning example for Llama models, default to llama3.1-8b-base (#10294) * Add ModelOpt transformer model pruning example for Llama3 model Signed-off-by: Shengliang Xu * Apply isort and black reformatting Signed-off-by: shengliangxu Signed-off-by: Shengliang Xu * examples code is at wrong dir, move them Signed-off-by: Shengliang Xu * changes as suggested in comment remove some logging and unused config code, update example model to llama3.1 Signed-off-by: Shengliang Xu * Add pruning of hidden_size into example Signed-off-by: Shengliang Xu * Apply isort and black reformatting Signed-off-by: shengliangxu Signed-off-by: Shengliang Xu * Update examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> * Add pruning test to cicd-main.yml Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> * Update cicd-main.yml Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> * Update cicd-main.yml Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> * Update cicd-main.yml Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> * Update cicd-main.yml Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> * Update cicd-main.yml Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> --------- Signed-off-by: Shengliang Xu Signed-off-by: shengliangxu Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: shengliangxu Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: Youngeun Kwon * Update mamba.rst after dist ckpt addition (#10800) Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com> Signed-off-by: Youngeun Kwon * fix chunked infer (#10581) Signed-off-by: stevehuang52 Signed-off-by: Youngeun Kwon * fix state transform (#10728) Signed-off-by: Chen Cui Signed-off-by: Youngeun Kwon * use ckpt_to_weights_subdir in restore (#10786) * use ckpt_to_weights_subdir in restore Signed-off-by: Alexandros Koumparoulis * make ckpt_to_{weight,context}_subdir idempotent Signed-off-by: Alexandros Koumparoulis * Apply isort and black reformatting Signed-off-by: akoumpa --------- Signed-off-by: Alexandros Koumparoulis Signed-off-by: akoumpa Co-authored-by: akoumpa Signed-off-by: Youngeun Kwon * Mixtral set seq_length=4k (#10704) * enable SP & set seq_lenght=4k Signed-off-by: Alexandros Koumparoulis * update test expected values Signed-off-by: Alexandros Koumparoulis * 8x22b 4k Signed-off-by: Alexandros Koumparoulis --------- Signed-off-by: Alexandros Koumparoulis Signed-off-by: Youngeun Kwon * Fix for crashes with tensorboard_logger=false and VP + LoRA (#10792) * Fix for crashes with tensorboard_logger=false and virtual pipeline parallel + LoRA Signed-off-by: Valerie Sarge * Apply isort and black reformatting Signed-off-by: vysarge --------- Signed-off-by: Valerie Sarge Signed-off-by: vysarge Co-authored-by: vysarge Signed-off-by: Youngeun Kwon * Disable checkpoint conversion inside AutoResume (#10645) * Disable checkpoint conversion inside AutoResume Signed-off-by: Hemil Desai * Apply isort and black reformatting Signed-off-by: hemildesai * Update resume docstrings Signed-off-by: Hemil Desai * fix Signed-off-by: Hemil Desai * add default finetuning recipe and refactor llama3 8b recipe Signed-off-by: Chen Cui * Apply isort and black reformatting Signed-off-by: cuichenx * address comment Signed-off-by: Chen Cui * refactor other recipes Signed-off-by: Chen Cui * Apply isort and black reformatting Signed-off-by: cuichenx * remove 8x3b finetuning recipe for now because HF version not available Signed-off-by: Chen Cui * add copyright header Signed-off-by: Chen Cui * adjust unit tests based on recipe fixes Signed-off-by: Chen Cui * fix failed unit test Signed-off-by: Chen Cui --------- Signed-off-by: Hemil Desai Signed-off-by: hemildesai Signed-off-by: Chen Cui Signed-off-by: cuichenx Co-authored-by: hemildesai Co-authored-by: Chen Cui Co-authored-by: cuichenx Signed-off-by: Youngeun Kwon * replace png file to github assets Signed-off-by: Youngeun Kwon * change image url to github release Signed-off-by: Youngeun Kwon --------- Signed-off-by: Youngeun Kwon Signed-off-by: Alexandros Koumparoulis Signed-off-by: akoumpa Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Shengliang Xu Signed-off-by: shengliangxu Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com> Signed-off-by: stevehuang52 Signed-off-by: Chen Cui Signed-off-by: Valerie Sarge Signed-off-by: vysarge Signed-off-by: Hemil Desai Signed-off-by: hemildesai Signed-off-by: cuichenx Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: akoumpa Co-authored-by: oliver könig Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com> Co-authored-by: Shengliang Xu <106840466+shengliangxu@users.noreply.github.com> Co-authored-by: shengliangxu Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com> Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Co-authored-by: Chen Cui Co-authored-by: Valerie Sarge Co-authored-by: vysarge Co-authored-by: Hemil Desai Co-authored-by: hemildesai Co-authored-by: cuichenx --- .../performance/performance_long_sequence.md | 155 ++++++++++++++++++ 1 file changed, 155 insertions(+) create mode 100644 docs/source/performance/performance_long_sequence.md diff --git a/docs/source/performance/performance_long_sequence.md b/docs/source/performance/performance_long_sequence.md new file mode 100644 index 000000000000..9dc9c6c52be3 --- /dev/null +++ b/docs/source/performance/performance_long_sequence.md @@ -0,0 +1,155 @@ +# Long Sequence Performance + +## LLAMA2-7B (FP8) + +- The table below shows the pre-training performance of the LLAMA2-7B with CP (context parallelism) and compares it against the results without CP at various input sequence lengths. The detailed model-parallel configurations and the achieved performance are shown in the training results with CP. In non-CP training runs, we use the most performant model- and data-parallel configurations without CP given the memory capacity constraint of the H100 GPU system. + + - Container: [NeMo24.03.01.framework](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags) + - System: DGX-H100 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
SeqLen (K)# of GPUsWithout CPWith CPSpeedup with CP/without CP
TFLOPS / GPUTPPPDPCPTFLOPS / GPU
4476811417681.00
8873012417301.00
161666021816601.00
323259521826101.03
646453441825741.07
12812842441845551.31
25625639241885491.40
512512104814165495.28
1024102426.58143253620.23
+ + +### Speedup of LLAMA2 7B training with CP over without CP +![cp_speedup_figure](https://github.com/NVIDIA/NeMo/releases/download/r2.0.0rc1/tutorial_cp_speedup_figure.png) \ No newline at end of file