diff --git a/docs/source/performance/performance_long_sequence.md b/docs/source/performance/performance_long_sequence.md new file mode 100644 index 000000000000..9dc9c6c52be3 --- /dev/null +++ b/docs/source/performance/performance_long_sequence.md @@ -0,0 +1,155 @@ +# Long Sequence Performance + +## LLAMA2-7B (FP8) + +- The table below shows the pre-training performance of the LLAMA2-7B with CP (context parallelism) and compares it against the results without CP at various input sequence lengths. The detailed model-parallel configurations and the achieved performance are shown in the training results with CP. In non-CP training runs, we use the most performant model- and data-parallel configurations without CP given the memory capacity constraint of the H100 GPU system. + + - Container: [NeMo24.03.01.framework](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags) + - System: DGX-H100 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
SeqLen (K)# of GPUsWithout CPWith CPSpeedup with CP/without CP
TFLOPS / GPUTPPPDPCPTFLOPS / GPU
4476811417681.00
8873012417301.00
161666021816601.00
323259521826101.03
646453441825741.07
12812842441845551.31
25625639241885491.40
512512104814165495.28
1024102426.58143253620.23
+ + +### Speedup of LLAMA2 7B training with CP over without CP +![cp_speedup_figure](https://github.com/NVIDIA/NeMo/releases/download/r2.0.0rc1/tutorial_cp_speedup_figure.png) \ No newline at end of file