This utility used to measure throughput and other improvements obtained when using fms-acceleration
plugins.
- benchmark.py: Main benchmark script.
- scenarios.yaml:
sft_trainer.py
arguments organized different scenarios.- Each
scenario
may apply to one ore moreAccelerationFramework
sample configuration. These are the critical arguments needed for correct operation. - See section on benchmark scenarios for more details.
- Each
- defaults.yaml:
sft_trainer.py
arguments that may be used in addition to scenarios.yaml. These are the non-critical arguments that will not affect plugin operation. - accelerate.yaml: configurations required by
accelerate launch
for multi-gpu benchmarks.
An example of a scenario
for accelerated-peft-gptq
given as follows:
scenarios:
# benchmark scenario for accelerated peft using AutoGPTQ triton v2
- name: accelerated-peft-gptq
framework_config:
# one ore more framework configurations that fall within the scenario group.
# - each entry points to a shortname in CONTENTS.yaml
- accelerated-peft-autogptq
# sft_trainer.py arguments critical for correct plugin operation
arguments:
fp16: True
learning_rate: 2e-4
torch_dtype: float16
peft_method: lora
r: 16
lora_alpha: 16
lora_dropout: 0.0
target_modules: "q_proj k_proj v_proj o_proj"
model_name_or_path:
- 'mistralai/Mistral-7B-v0.1'
- 'mistralai/Mixtral-8x7B-Instruct-v0.1'
- 'NousResearch/Llama-2-70b-hf'
A scenario
has the following key components:
framework_config
: points to one or more acceleration configurations.- list of sample config
shortname
. - for each
shortname
is a different bench.
- list of sample config
arguments
: the criticalsft_trainer.py
arguments that need to be passed in alongisideframework_config
to ensure correct operation.model_name_or_path
is a list, and the bench will enumerate all of them.- NOTE: a
plugin
may not work with arbitrary models. This depends on the plugin's setting ofAccelerationPlugin.restricted_model_archs
.
The best way is via tox
which manages the dependencies, including installing the correct version fms-hf-tuning.
-
install the
setup_requirements.txt
to gettox
:pip install -r setup_requirements.txt
-
run a small representative set of benches:
tox -e run-benches
-
run the full set of benches on for both 1 and 2 GPU cases:
tox -e run-benches -- "1 2"
Note:
tox
command above accepts environment variablesDRY_RUN, NO_DATA_PROCESSING, NO_OVERWRITE
. Seescripts/run_benchmarks.sh
The convinience script run_benchmarks.sh
configures and runs benchmark.py
; the command is:
bash run_benchmarks.sh NUM_GPUS_MATRIX RESULT_DIR SCENARIOS_CONFIG SCENARIOS_FILTER
where:
NUM_GPUS_MATRIX
: list ofnum_gpu
settings to bench for, e.g."1 2"
will bench for 1 and 2 gpus.EFFECTIVE_BS_MATRIX
: list of effective batch sizes, e.g.,"4 8"
will bench for effective batch sizes 4 and 8.RESULT_DIR
: where the benchmark results will be placed.SCENARIOS_CONFIG
: thescenarios.yaml
file.SCENARIOS_CONFIG
: specify to run only a specificscenario
by providing the specificscenario
name.
The recommended way to run benchmarks.sh
is using tox
which handles the dependencies:
tox -e run-benches -- NUM_GPUS_MATRIX EFFECTIVE_BS_MATRIX RESULT_DIR SCENARIOS_CONFIG SCENARIOS_FILTER
Alternatively run benchmark.py
directly. To see the help do:
python benchmark.py --help
Note:
- in
run_benchmarks.sh
we will clear theRESULT_DIR
if it exists, to avoid contaimination with old results. To protect against overwrite, then always run withNO_OVERWRITE=true
.
There are 2 ways to benchmark memory in run_benchmarks.sh
:
- Setting the environment variable
MEMORY_LOGGING=nvidia
will use Nvidianvidia-smi
's API - Setting the environment variable
MEMORY_LOGGING=huggingface
(default) will use HuggingFaceHFTrainer
's API
Both approaches will print out the memory values to the benchmark report.
- For Nvidia, the result column will be
nvidia_mem_reserved
- For Torch/HF, the result column will be
peak_torch_mem_alloc_in_bytes
andtorch_mem_alloc_in_bytes
nvidia-smi
is a command line utility (CLI) based on the Nvidia Manage Library (NVML)`. A separate process call is used to start, log and finally terminate the CLI for every experiment.
The keyword memory.used
is passed to --query-gpu
argument to log the memory usage at some interval. The list of keywords that can be logged can be referenced from nvidia-smi --help-query-gpu
Since it runs on a separate process, it is less likely to affect the training. However, it is a coarser approach than HF as NVML's definition of used memory takes the sum of (memory allocated + memory reserved). Refer to their documentation here.
After every experiment,
- the logged values are calibrated to remove any existing foreign memory values
- the peak values for each gpu device are taken
- the values are finally averaged across all devices.
HFTrainer has a feature to log memory through the skip_memory_metrics=False
training argument. In their documentation, it is mentioned that setting this argument to False
will affect training speed. In our tests so far (below), we do not see significant difference in throughput (tokens/sec) when using this argument.
The HFTrainer API is more granular than nvidia-smi
as it uses torch.cuda
to pinpoint memory usage inside the trainer
- It reports the allocated memory by calling
torch.cuda.memory_allocated()
andtorch.cuda.max_memory_allocated()
inside its probes - It has memory logging probes at different stages of the Trainer -
init
,train
,evaluate
,predict
- When in distributed mode, the Trainer will only log the rank 0 memory.
- For stability purposes, it only tracks the outer level of train, evaluate and predict methods. i.e. if eval is called during train, there won't be a nested invocation of the memory probe.
- Any GPU memory incurred outside of the defined Trainer stages won't be tracked.
This is an example of the memory values that HFTrainer will produce in the outputs of train()
output_metrics = {
'train_runtime': 191.2491,
'train_samples_per_second': 0.209,
'train_steps_per_second': 0.052,
'train_tokens_per_second': 428.342,
'train_loss': 1.0627506256103516,
'init_mem_cpu_alloc_delta': 4096,
'init_mem_gpu_alloc_delta': 0,
'init_mem_cpu_peaked_delta': 0,
'init_mem_gpu_peaked_delta': 0,
'train_mem_cpu_alloc_delta': 839086080,
'train_mem_gpu_alloc_delta': -17491768832,
'train_mem_cpu_peaked_delta': 0,
'train_mem_gpu_peaked_delta': 26747825664,
'before_init_mem_cpu': 5513297920,
'before_init_mem_gpu': 36141687296,
'epoch': 0.01
}
We refer to the keys of the memory metrics in this order
before_init_mem_X
as stage0init_mem_X
as stage1train_mem_X
as stage2- ...
We currently compute the memory values in the report by taking the largest of sums. For example:
For allocated memory value
max([
stage0_mem,
stage0_mem + stage1_allocated_delta,
stage0_mem + stage1_allocated_delta + stage2_allocated_delta,
...
])
For peak memory value
max([
stage0_mem,
stage0_mem + stage1_allocated_delta + stage1_peaked_delta,
stage0_mem + stage1_allocated_delta + stage2_allocated_delta + stage2_peaked_delta,
...
])
We compare memory values between Nvidia-SMI and Torch in this PR - Memory Benchmarking.