diff --git a/docs/source/performance/performance_long_sequence.md b/docs/source/performance/performance_long_sequence.md new file mode 100644 index 000000000000..9dc9c6c52be3 --- /dev/null +++ b/docs/source/performance/performance_long_sequence.md @@ -0,0 +1,155 @@ +# Long Sequence Performance + +## LLAMA2-7B (FP8) + +- The table below shows the pre-training performance of the LLAMA2-7B with CP (context parallelism) and compares it against the results without CP at various input sequence lengths. The detailed model-parallel configurations and the achieved performance are shown in the training results with CP. In non-CP training runs, we use the most performant model- and data-parallel configurations without CP given the memory capacity constraint of the H100 GPU system. + + - Container: [NeMo24.03.01.framework](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags) + - System: DGX-H100 + + + + +
SeqLen (K) | +# of GPUs | +Without CP | +With CP | +Speedup with CP/without CP | +||||
---|---|---|---|---|---|---|---|---|
TFLOPS / GPU | +TP | +PP | +DP | +CP | +TFLOPS / GPU | +|||
4 | +4 | +768 | +1 | +1 | +4 | +1 | +768 | +1.00 | +
8 | +8 | +730 | +1 | +2 | +4 | +1 | +730 | +1.00 | +
16 | +16 | +660 | +2 | +1 | +8 | +1 | +660 | +1.00 | +
32 | +32 | +595 | +2 | +1 | +8 | +2 | +610 | +1.03 | +
64 | +64 | +534 | +4 | +1 | +8 | +2 | +574 | +1.07 | +
128 | +128 | +424 | +4 | +1 | +8 | +4 | +555 | +1.31 | +
256 | +256 | +392 | +4 | +1 | +8 | +8 | +549 | +1.40 | +
512 | +512 | +104 | +8 | +1 | +4 | +16 | +549 | +5.28 | +
1024 | +1024 | +26.5 | +8 | +1 | +4 | +32 | +536 | +20.23 | +