Throughput instrumentation per training step. #7

raviskolli · 2021-04-21T17:28:17Z

Print throughput in samples/sec for each training step.
Print average samples/sec excluding the initial setup at the end of training.

Sample prints for each train step:
{'train_step_runtime': 0.6309, 'train_step_samples_per_second': 57.063, 'epoch': 0.98}

98%|█████████▊| 2086/2120 [26:01<00:22, 1.49it/s]
98%|█████████▊| 2087/2120 [26:02<00:22, 1.48it/s]

{'train_step_runtime': 0.641, 'train_step_samples_per_second': 56.162, 'epoch': 0.98}

98%|█████████▊| 2087/2120 [26:02<00:22, 1.48it/s]
98%|█████████▊| 2088/2120 [26:03<00:21, 1.47it/s]

{'train_step_runtime': 0.6355, 'train_step_samples_per_second': 56.652, 'epoch': 0.98}

98%|█████████▊| 2088/2120 [26:03<00:21, 1.47it/s]
99%|█████████▊| 2089/2120 [26:03<00:21, 1.45it/s]

{'train_step_runtime': 0.6715, 'train_step_samples_per_second': 53.608, 'epoch': 0.99}

99%|█████████▊| 2089/2120 [26:03<00:21, 1.45it/s]
99%|█████████▊| 2090/2120 [26:04<00:20, 1.43it/s]

{'train_step_runtime': 0.6717, 'train_step_samples_per_second': 53.592, 'epoch': 0.99}

Sample prints at the end of training:
Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 1586.5542, 'train_samples_per_second': 1.336, 'epoch': 1.0}

100%|██████████| 2120/2120 [26:26<00:00, 1.22s/it]

{'train_runtime': 1426.6043, 'train_samples_per_second': 427.611, 'epoch': 1.0}

100%|██████████| 2120/2120 [26:26<00:00, 1.22s/it]

…nsformers into raviskolli/ort_perf_instrumentation

src/transformers/trainer.py

ashbhandare · 2021-04-22T17:33:55Z

Can you put an example output in the description?

raviskolli · 2021-04-22T17:38:11Z

Can you put an example output in the description?

Good idea! I updated the description with few sample prints.

ashbhandare · 2021-04-22T17:51:08Z

src/transformers/trainer.py

@@ -1302,10 +1310,14 @@ def train(
                )

        metrics = speed_metrics("train", start_time, self.state.max_steps)
+        ort_end_train_metrics = speed_metrics("train",


Looks like these will be calculated for non-ort runs too? Can we remove the ort_prefix, or add these under an if?

ashbhandare · 2021-04-22T17:56:05Z

Thank you for adding the sample output. From the logs:
`Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 1586.5542, 'train_samples_per_second': 1.336, 'epoch': 1.0}

100%|██████████| 2120/2120 [26:26<00:00, 1.22s/it]

{'train_runtime': 1426.6043, 'train_samples_per_second': 427.611, 'epoch': 1.0}

100%|██████████| 2120/2120 [26:26<00:00, 1.22s/it]`

This is confusing to the user to see the same train_runtime, train_samples_per_second etc being printed twice with different values. Can we make it clear that the second speed metric is excluding the first step to avoid overhead?

Also, can we put the new metrics so that they get logged in the final ones as most users might expect to just look here and find the relevant info:
***** train metrics *****
epoch = 1.0
init_mem_cpu_alloc_delta = 1MB
init_mem_cpu_peaked_delta = 0MB
init_mem_gpu_alloc_delta = 0MB
init_mem_gpu_peaked_delta = 0MB
train_mem_cpu_alloc_delta = 46MB
train_mem_cpu_peaked_delta = 230MB
train_mem_gpu_alloc_delta = 808MB
train_mem_gpu_peaked_delta = 5793MB
train_runtime = 46.9366
train_samples = 4096
train_samples_per_second = 1.364

raviskolli added 3 commits April 7, 2021 15:24

Instrumentation for throughput metric

4b69d5f

Merge branch 'master' of https://github.com/microsoft/huggingface-tra…

8b05969

…nsformers into raviskolli/ort_perf_instrumentation

Fix num samples in the last train step.

8837ac0

raviskolli requested review from ashbhandare and SherlockNoMad April 21, 2021 17:28

ashbhandare reviewed Apr 21, 2021

View reviewed changes

src/transformers/trainer.py Show resolved Hide resolved

raviskolli requested a review from ashbhandare April 21, 2021 18:49

raviskolli closed this Apr 22, 2021

raviskolli reopened this Apr 22, 2021

ashbhandare reviewed Apr 22, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throughput instrumentation per training step. #7

Throughput instrumentation per training step. #7

raviskolli commented Apr 21, 2021 •

edited

Loading

ashbhandare commented Apr 22, 2021

raviskolli commented Apr 22, 2021

ashbhandare Apr 22, 2021

ashbhandare commented Apr 22, 2021 •

edited

Loading

Throughput instrumentation per training step. #7

Are you sure you want to change the base?

Throughput instrumentation per training step. #7

Conversation

raviskolli commented Apr 21, 2021 • edited Loading

ashbhandare commented Apr 22, 2021

raviskolli commented Apr 22, 2021

ashbhandare Apr 22, 2021

Choose a reason for hiding this comment

ashbhandare commented Apr 22, 2021 • edited Loading

raviskolli commented Apr 21, 2021 •

edited

Loading

ashbhandare commented Apr 22, 2021 •

edited

Loading