-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add revised benchmarking logic and results (#9)
* Revised estimation of batch count, directly retrieving from len(train_dataloader). Deleted unused timer_handle argument in Trainer. Revised handling of "max_seq_len" override in benchmarking. Added support for automatic switching between lora and full-rank sharding scheme in benchmarking. * Revised handling of unspecified max_seq_length. Added llama-3 to benchmark model_list. * Benchmarking: Revised benchmark script to ensure consistent per-device train batch size. * Benchmarking: replaced trainer.step with trainer.train_step to avoid eval overhead in benchmarking. Revised benchmark parsing logic; display optimal batch size for each context width value. * Benchmarking: Updated reference throughput based on updated logic. * Benchmarking: Updated reference throughput descriptions.
- Loading branch information
1 parent
ce1eaa3
commit 9045f08
Showing
9 changed files
with
138 additions
and
81 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,33 +1,36 @@ | ||
# Reference Throughput | ||
|
||
We've benchmarked VectorLM on the Vaughan cluster for a number of model architectures across a variety of node configurations. | ||
In experiments labelled as LoRA, we set hidden dimension to 8. During the testing, the NVIDIA driver version was 525.105.17, CUDA Runtime 12.1.105, and torch 2.2.2. | ||
In experiments labelled as LoRA, we set hidden dimension to 8. Below are version numbers of the testing environment: | ||
|
||
For consistency, we use a batch size of 8 and the maximum context length that the pre-trained LLM supports, capped at 65536. Note that especially for smaller models, it might be possible to further increase throughput by switching to a larger batch size. | ||
```bash | ||
$ pip3 freeze|grep -E "(torch|flash-attn|nvidia)" | ||
flash-attn==2.5.8 | ||
nvidia-cublas-cu12==12.1.3.1 | ||
nvidia-cuda-cupti-cu12==12.1.105 | ||
nvidia-cuda-nvrtc-cu12==12.1.105 | ||
nvidia-cuda-runtime-cu12==12.1.105 | ||
nvidia-cudnn-cu12==8.9.2.26 | ||
nvidia-cufft-cu12==11.0.2.54 | ||
nvidia-curand-cu12==10.3.2.106 | ||
nvidia-cusolver-cu12==11.4.5.107 | ||
nvidia-cusparse-cu12==12.1.0.106 | ||
nvidia-ml-py==12.550.52 | ||
nvidia-nccl-cu12==2.19.3 | ||
nvidia-nvjitlink-cu12==12.3.101 | ||
nvidia-nvtx-cu12==12.1.105 | ||
torch==2.2.1 | ||
``` | ||
|
||
Entries that read NaN represent combinations where the node configuration does not have enough GPU memory for the training run to complete. An exception is gemma-2b, which currently does not support full-rank FSDP fine-tuning. | ||
For each context width and hardware configuration, we experiment with a per-device batch size of 2, 4, and 8. In the table below, we report the batch size that maximizes training throughput. All values in the table represent the median training throughput in tokens/second across all training steps, aggregated across all GPU devices. | ||
|
||
All values in the table below represent the median training throughput in tokens per second across all training steps, aggregated across all GPU devices. | ||
| | Meta-Llama-3-8B (2048) | Meta-Llama-3-8B (4096) | Meta-Llama-3-8B (8192) | | ||
| :----------------------------------- | :--------------------- | :--------------------- | :--------------------- | | ||
| (full_rank) NVIDIA A100-SXM4-80GB x1 | 3550.48 (batch: 8) | 3461.64 (batch: 4) | 3204.21 (batch: 2) | | ||
| (full_rank) NVIDIA A100-SXM4-80GB x2 | 6346.00 (batch: 8) | 6182.59 (batch: 4) | 5772.91 (batch: 2) | | ||
| (full_rank) NVIDIA A100-SXM4-80GB x4 | 12688.44 (batch: 8) | 12249.74 (batch: 4) | 11463.46 (batch: 2) | | ||
| (lora) NVIDIA A100-SXM4-80GB x1 | 4079.28 (batch: 8) | 3682.15 (batch: 4) | 3528.93 (batch: 2) | | ||
| (lora) NVIDIA A100-SXM4-80GB x2 | 7182.97 (batch: 8) | 6955.58 (batch: 4) | 6452.96 (batch: 2) | | ||
| (lora) NVIDIA A100-SXM4-80GB x4 | 14299.47 (batch: 8) | 13834.43 (batch: 4) | 12769.23 (batch: 2) | | ||
|
||
| | Llama-2-13b-hf | Llama-2-7b-hf | Mistral-7B-v0.1 | Mixtral-8x7B-Instruct-v0.1 | gemma-2b | opt-350m | | ||
| :----------------------------------- | -------------: | ------------: | --------------: | -------------------------: | -------: | -------: | | ||
| (full_rank) NVIDIA A100-SXM4-80GB x1 | 424.726 | 570.818 | 528.747 | nan | nan | 780.045 | | ||
| (full_rank) NVIDIA A100-SXM4-80GB x2 | 660.355 | 919.19 | 794.566 | 275.459 | nan | 1227.67 | | ||
| (full_rank) NVIDIA A100-SXM4-80GB x4 | 1309.4 | 1744.39 | 1577.09 | 817.162 | nan | 2181.46 | | ||
| (full_rank) NVIDIA A40 x1 | nan | 47.6435 | 107.503 | nan | nan | 666.881 | | ||
| (full_rank) NVIDIA A40 x2 | nan | 313.074 | 322.624 | nan | nan | 854.672 | | ||
| (full_rank) NVIDIA A40 x4 | 345.96 | 570.977 | 553.658 | nan | nan | 1765.49 | | ||
| (full_rank) Tesla T4 x1 | nan | nan | nan | nan | nan | 475.51 | | ||
| (full_rank) Tesla T4 x2 | nan | nan | nan | nan | nan | 768.008 | | ||
| (full_rank) Tesla T4 x4 | nan | nan | nan | nan | nan | 1383.6 | | ||
| (full_rank) Tesla T4 x8 | nan | nan | nan | nan | nan | 2414.68 | | ||
| (lora) NVIDIA A100-SXM4-80GB x1 | 560.167 | 646.801 | 525.802 | nan | 851.678 | 859.379 | | ||
| (lora) NVIDIA A100-SXM4-80GB x2 | 871.993 | 1157.17 | 1105.68 | 239.431 | 1724.57 | 1463.82 | | ||
| (lora) NVIDIA A100-SXM4-80GB x4 | 1783.53 | 2091.03 | 2150.06 | 1309.74 | 2719.24 | 2381.01 | | ||
| (lora) NVIDIA A40 x1 | 272.931 | 435.386 | 336.507 | nan | 983.256 | 652.611 | | ||
| (lora) NVIDIA A40 x2 | 105.442 | 457.183 | 356.263 | nan | 725.723 | 1136.17 | | ||
| (lora) NVIDIA A40 x4 | 543.22 | 715.416 | 642.642 | nan | 1302.62 | 1647.57 | | ||
| (lora) Tesla T4 x1 | nan | nan | nan | nan | 148.272 | 571.471 | | ||
| (lora) Tesla T4 x2 | nan | 101.126 | 102.859 | nan | 256.534 | 811.159 | | ||
| (lora) Tesla T4 x4 | nan | 188.575 | 190.127 | nan | 495.755 | 1506.05 | | ||
| (lora) Tesla T4 x8 | 196.709 | 372.375 | 351.361 | nan | 897.81 | 2945.86 | | ||
We provide the tools for evaluating the throughput on different context windows and different hardware/model configuration. Refer to the profiling folder in this repository to get started. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.