-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throughput instrumentation per training step. #7
base: master
Are you sure you want to change the base?
Conversation
…nsformers into raviskolli/ort_perf_instrumentation
Can you put an example output in the description? |
Good idea! I updated the description with few sample prints. |
@@ -1302,10 +1310,14 @@ def train( | |||
) | |||
|
|||
metrics = speed_metrics("train", start_time, self.state.max_steps) | |||
ort_end_train_metrics = speed_metrics("train", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these will be calculated for non-ort runs too? Can we remove the ort_prefix, or add these under an if?
Thank you for adding the sample output. From the logs: {'train_runtime': 1586.5542, 'train_samples_per_second': 1.336, 'epoch': 1.0} 100%|██████████| 2120/2120 [26:26<00:00, 1.22s/it] {'train_runtime': 1426.6043, 'train_samples_per_second': 427.611, 'epoch': 1.0} 100%|██████████| 2120/2120 [26:26<00:00, 1.22s/it]` This is confusing to the user to see the same train_runtime, train_samples_per_second etc being printed twice with different values. Can we make it clear that the second speed metric is excluding the first step to avoid overhead? Also, can we put the new metrics so that they get logged in the final ones as most users might expect to just look here and find the relevant info: |
Print throughput in samples/sec for each training step.
Print average samples/sec excluding the initial setup at the end of training.
Sample prints for each train step:
{'train_step_runtime': 0.6309, 'train_step_samples_per_second': 57.063, 'epoch': 0.98}
98%|█████████▊| 2086/2120 [26:01<00:22, 1.49it/s]
98%|█████████▊| 2087/2120 [26:02<00:22, 1.48it/s]
{'train_step_runtime': 0.641, 'train_step_samples_per_second': 56.162, 'epoch': 0.98}
98%|█████████▊| 2087/2120 [26:02<00:22, 1.48it/s]
98%|█████████▊| 2088/2120 [26:03<00:21, 1.47it/s]
{'train_step_runtime': 0.6355, 'train_step_samples_per_second': 56.652, 'epoch': 0.98}
98%|█████████▊| 2088/2120 [26:03<00:21, 1.47it/s]
99%|█████████▊| 2089/2120 [26:03<00:21, 1.45it/s]
{'train_step_runtime': 0.6715, 'train_step_samples_per_second': 53.608, 'epoch': 0.99}
99%|█████████▊| 2089/2120 [26:03<00:21, 1.45it/s]
99%|█████████▊| 2090/2120 [26:04<00:20, 1.43it/s]
{'train_step_runtime': 0.6717, 'train_step_samples_per_second': 53.592, 'epoch': 0.99}
Sample prints at the end of training:
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 1586.5542, 'train_samples_per_second': 1.336, 'epoch': 1.0}
100%|██████████| 2120/2120 [26:26<00:00, 1.22s/it]
{'train_runtime': 1426.6043, 'train_samples_per_second': 427.611, 'epoch': 1.0}
100%|██████████| 2120/2120 [26:26<00:00, 1.22s/it]