Skip to content

Commit

Permalink
Add revised benchmarking logic and results (#9)
Browse files Browse the repository at this point in the history
* Revised estimation of batch count, directly retrieving from len(train_dataloader).
Deleted unused timer_handle argument in Trainer.
Revised handling of "max_seq_len" override in benchmarking.
Added support for automatic switching between  lora and full-rank sharding scheme in benchmarking.

* Revised handling of unspecified max_seq_length.
Added llama-3 to benchmark model_list.

* Benchmarking: Revised benchmark script to ensure consistent per-device train batch size.

* Benchmarking: replaced trainer.step with trainer.train_step to avoid eval overhead in benchmarking.
Revised benchmark parsing logic; display optimal batch size for each context width value.

* Benchmarking: Updated reference throughput based on updated logic.

* Benchmarking: Updated reference throughput descriptions.
  • Loading branch information
jacobthebanana authored May 28, 2024
1 parent ce1eaa3 commit 9045f08
Show file tree
Hide file tree
Showing 9 changed files with 138 additions and 81 deletions.
55 changes: 29 additions & 26 deletions docs/reference_throughput.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,36 @@
# Reference Throughput

We've benchmarked VectorLM on the Vaughan cluster for a number of model architectures across a variety of node configurations.
In experiments labelled as LoRA, we set hidden dimension to 8. During the testing, the NVIDIA driver version was 525.105.17, CUDA Runtime 12.1.105, and torch 2.2.2.
In experiments labelled as LoRA, we set hidden dimension to 8. Below are version numbers of the testing environment:

For consistency, we use a batch size of 8 and the maximum context length that the pre-trained LLM supports, capped at 65536. Note that especially for smaller models, it might be possible to further increase throughput by switching to a larger batch size.
```bash
$ pip3 freeze|grep -E "(torch|flash-attn|nvidia)"
flash-attn==2.5.8
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.550.52
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
torch==2.2.1
```

Entries that read NaN represent combinations where the node configuration does not have enough GPU memory for the training run to complete. An exception is gemma-2b, which currently does not support full-rank FSDP fine-tuning.
For each context width and hardware configuration, we experiment with a per-device batch size of 2, 4, and 8. In the table below, we report the batch size that maximizes training throughput. All values in the table represent the median training throughput in tokens/second across all training steps, aggregated across all GPU devices.

All values in the table below represent the median training throughput in tokens per second across all training steps, aggregated across all GPU devices.
| | Meta-Llama-3-8B (2048) | Meta-Llama-3-8B (4096) | Meta-Llama-3-8B (8192) |
| :----------------------------------- | :--------------------- | :--------------------- | :--------------------- |
| (full_rank) NVIDIA A100-SXM4-80GB x1 | 3550.48 (batch: 8) | 3461.64 (batch: 4) | 3204.21 (batch: 2) |
| (full_rank) NVIDIA A100-SXM4-80GB x2 | 6346.00 (batch: 8) | 6182.59 (batch: 4) | 5772.91 (batch: 2) |
| (full_rank) NVIDIA A100-SXM4-80GB x4 | 12688.44 (batch: 8) | 12249.74 (batch: 4) | 11463.46 (batch: 2) |
| (lora) NVIDIA A100-SXM4-80GB x1 | 4079.28 (batch: 8) | 3682.15 (batch: 4) | 3528.93 (batch: 2) |
| (lora) NVIDIA A100-SXM4-80GB x2 | 7182.97 (batch: 8) | 6955.58 (batch: 4) | 6452.96 (batch: 2) |
| (lora) NVIDIA A100-SXM4-80GB x4 | 14299.47 (batch: 8) | 13834.43 (batch: 4) | 12769.23 (batch: 2) |

| | Llama-2-13b-hf | Llama-2-7b-hf | Mistral-7B-v0.1 | Mixtral-8x7B-Instruct-v0.1 | gemma-2b | opt-350m |
| :----------------------------------- | -------------: | ------------: | --------------: | -------------------------: | -------: | -------: |
| (full_rank) NVIDIA A100-SXM4-80GB x1 | 424.726 | 570.818 | 528.747 | nan | nan | 780.045 |
| (full_rank) NVIDIA A100-SXM4-80GB x2 | 660.355 | 919.19 | 794.566 | 275.459 | nan | 1227.67 |
| (full_rank) NVIDIA A100-SXM4-80GB x4 | 1309.4 | 1744.39 | 1577.09 | 817.162 | nan | 2181.46 |
| (full_rank) NVIDIA A40 x1 | nan | 47.6435 | 107.503 | nan | nan | 666.881 |
| (full_rank) NVIDIA A40 x2 | nan | 313.074 | 322.624 | nan | nan | 854.672 |
| (full_rank) NVIDIA A40 x4 | 345.96 | 570.977 | 553.658 | nan | nan | 1765.49 |
| (full_rank) Tesla T4 x1 | nan | nan | nan | nan | nan | 475.51 |
| (full_rank) Tesla T4 x2 | nan | nan | nan | nan | nan | 768.008 |
| (full_rank) Tesla T4 x4 | nan | nan | nan | nan | nan | 1383.6 |
| (full_rank) Tesla T4 x8 | nan | nan | nan | nan | nan | 2414.68 |
| (lora) NVIDIA A100-SXM4-80GB x1 | 560.167 | 646.801 | 525.802 | nan | 851.678 | 859.379 |
| (lora) NVIDIA A100-SXM4-80GB x2 | 871.993 | 1157.17 | 1105.68 | 239.431 | 1724.57 | 1463.82 |
| (lora) NVIDIA A100-SXM4-80GB x4 | 1783.53 | 2091.03 | 2150.06 | 1309.74 | 2719.24 | 2381.01 |
| (lora) NVIDIA A40 x1 | 272.931 | 435.386 | 336.507 | nan | 983.256 | 652.611 |
| (lora) NVIDIA A40 x2 | 105.442 | 457.183 | 356.263 | nan | 725.723 | 1136.17 |
| (lora) NVIDIA A40 x4 | 543.22 | 715.416 | 642.642 | nan | 1302.62 | 1647.57 |
| (lora) Tesla T4 x1 | nan | nan | nan | nan | 148.272 | 571.471 |
| (lora) Tesla T4 x2 | nan | 101.126 | 102.859 | nan | 256.534 | 811.159 |
| (lora) Tesla T4 x4 | nan | 188.575 | 190.127 | nan | 495.755 | 1506.05 |
| (lora) Tesla T4 x8 | 196.709 | 372.375 | 351.361 | nan | 897.81 | 2945.86 |
We provide the tools for evaluating the throughput on different context windows and different hardware/model configuration. Refer to the profiling folder in this repository to get started.
2 changes: 1 addition & 1 deletion profiling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ $ python3 launch_benchmark.py
# to accept and automatically invoke the commands.
```

After the SLURM jobs complete, profiler output can be found under `data/benchmark`. Invoke the following the to generate a Markdown summary of the results:
After the SLURM jobs complete, profiler output can be found under `data/benchmark`. Invoke the following the to generate a Markdown summary of the results. If the benchmark results include multiple different batch sizes for each (model, context window, hardware) pair, the table would list the "optimal" batch size associated with the highest training throughput for this combination.

```bash
$ python3 profiling/parse_benchmark.py --folder data/benchmark
Expand Down
60 changes: 35 additions & 25 deletions profiling/benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@
from vectorlm.utils.model_utils import (
get_lora_model_from_base_model,
get_submodule_by_pattern,
hook_activation_checkpointing,
load_model_and_tokenizer,
shard_model,
)
Expand Down Expand Up @@ -67,7 +66,7 @@ def parse_args() -> Namespace:
default=1000,
)
parser.add_argument("--max_length", type=int)
parser.add_argument("--training_batch_size", type=int)
parser.add_argument("--per_device_batch_size", type=int)
return parser.parse_args()


Expand Down Expand Up @@ -273,9 +272,26 @@ def load_datasets(self) -> None:

setup(config.train_parameters.output_dir)

if args.training_batch_size is not None:
config.dataset.train_bs = args.training_batch_size
write_metrics("training_batch_size", args.training_batch_size)
training_args = config.train_parameters

# set a seed
set_seed(training_args.seed)

# set CUDA related dependencies
local_rank = int(os.environ["LOCAL_RANK"])
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])

if args.per_device_batch_size is not None:
config.dataset.train_bs = args.per_device_batch_size
config.dataset.eval_bs = args.per_device_batch_size

write_metrics("training_batch_size", config.dataset.train_bs)
write_metrics("eval_batch_size", config.dataset.eval_bs)
write_metrics(
"training_batch_size_global",
config.dataset.train_bs * world_size,
)

print(f"Writing metrics to {output_path}")
write_metrics("model_name", args.model_name)
Expand All @@ -291,16 +307,6 @@ def load_datasets(self) -> None:
repeat=2,
)

training_args = config.train_parameters

# set a seed
set_seed(training_args.seed)

# set CUDA related dependencies
local_rank = int(os.environ["LOCAL_RANK"])
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])

with track_time("dist_init"):
print(f"Rank: {rank}, World size: {world_size}")
if dist.is_initialized():
Expand All @@ -314,17 +320,18 @@ def load_datasets(self) -> None:

# load model and tokenizer
lora_peft_config = config.train_parameters.get("lora_peft_config")
is_lora_enabled = lora_peft_config is not None

with track_time("model_load"):
model, tokenizer = load_model_and_tokenizer(
args.model_name,
training_args.use_mp,
get_is_flash_attention_supported(),
training_args.max_seq_len,
args.max_length,
local_rank,
training_args.low_cpu_mem_usage,
)
if lora_peft_config is not None:
if is_lora_enabled:
print("Enabling LoRA Wrapper.")
write_metrics("peft_method", "lora")
model = get_lora_model_from_base_model(model, lora_peft_config)
Expand All @@ -348,12 +355,9 @@ def load_datasets(self) -> None:
training_args.sharding_strategy,
local_rank,
training_args.low_cpu_mem_usage,
is_lora_enabled=is_lora_enabled,
)

with track_time("set_activation_checkpointing"):
if training_args.use_activation_checkpointing:
hook_activation_checkpointing(model, decoder_layer_module)

# load dataset
with track_time("dataset_load"):
dataset = BenchmarkingDataset(
Expand All @@ -364,14 +368,17 @@ def load_datasets(self) -> None:
max_length=args.max_length,
)

print(
f"Sequence length: {dataset.max_length};"
f"Batch Size (per device): {config.dataset.train_bs}",
)
write_metrics("max_length", dataset.max_length)

# instantiate trainer
trainer = Trainer(
config=training_args,
enable_wandb_logging=config.enable_wandb_logging,
original_dataset_length=dataset.original_length,
timer_handle=track_time,
)

# load optimizer
Expand Down Expand Up @@ -412,15 +419,18 @@ def load_datasets(self) -> None:
trainer.model.train()
train_dl_iterator = iter(dataset.train_dataloader)
for _ in tqdm(
range(args.num_train_examples),
range(len(dataset.train_dataloader)),
disable=rank != 0,
file=sys.__stdout__,
):
batch = next(train_dl_iterator)
num_tokens = len(batch["input_ids"].flatten())

with track_time("train_step", {"num_tokens": num_tokens}):
trainer.step(batch, epoch)
with track_time(
"train_step",
{"num_tokens": num_tokens * world_size},
):
trainer.train_step(batch, epoch)

profile_handle.step()
write_metrics(
Expand Down
1 change: 0 additions & 1 deletion profiling/configs/benchmark.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ wandb_config:

train_parameters:
output_dir: /dev/shm/lora-benchmark
max_seq_len: 128
epochs: 1
seed: 11

Expand Down
1 change: 0 additions & 1 deletion profiling/configs/lora-benchmark.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ wandb_config:

train_parameters:
output_dir: /dev/shm/lora-benchmark
max_seq_len: 128
epochs: 1
seed: 11

Expand Down
24 changes: 13 additions & 11 deletions profiling/launch_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,13 @@
model_list = [
"/model-weights/" + model_name
for model_name in [
"opt-350m",
"gemma-2b",
"Llama-2-7b-hf",
"Llama-2-13b-hf",
"Mistral-7B-v0.1",
"Mixtral-8x7B-Instruct-v0.1",
# "opt-350m",
# "gemma-2b",
# "Llama-2-7b-hf",
"Meta-Llama-3-8B",
# "Llama-2-13b-hf",
# "Mistral-7B-v0.1",
# "Mixtral-8x7B-Instruct-v0.1",
]
]

Expand All @@ -37,27 +38,28 @@
]

# Set to (-1) to fall back to the max context length of the pre-trained model.
max_length_list = [1024, 2048, 4096, -1]
batch_size = [8, 16, 32, 64, 128]
max_length_list = [8192, 4096, 2048]
# Per-device batch size for training
per_device_batch_size = [2, 4, 8]

slurm_flags_options = {
"nodes": [1],
"mem-per-gpu": ["16GB"],
"ntasks-per-node": [1],
"cpus-per-gpu": [3],
"gres": [f"gpu:{n}" for n in [1, 2, 4, 8]],
"gres": [f"gpu:{n}" for n in [4, 2, 1]],
"partition": partitions,
}

num_repeats = 2
num_repeats = 1
slurm_flags_extra = {"time": "01:00:00", "qos": qos_selected}

slurm_pos_args_options = [
["profiling/launch_benchmark.sh"],
config_list,
model_list,
max_length_list,
batch_size,
per_device_batch_size,
]
timestamp = int(time.time())

Expand Down
2 changes: 1 addition & 1 deletion profiling/launch_benchmark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ profiling/benchmark.py \
--yaml_path $1 \
--model_name $2 \
--max_length $3 \
--training_batch_size $4
--per_device_batch_size $4

# clean up benchmarking artifacts as ops have requested
rm -rf /dev/shm/lora-benchmark
Loading

0 comments on commit 9045f08

Please sign in to comment.