Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throughput instrumentation per training step. #7

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

raviskolli
Copy link

@raviskolli raviskolli commented Apr 21, 2021

Print throughput in samples/sec for each training step.
Print average samples/sec excluding the initial setup at the end of training.

Sample prints for each train step:
{'train_step_runtime': 0.6309, 'train_step_samples_per_second': 57.063, 'epoch': 0.98}

98%|█████████▊| 2086/2120 [26:01<00:22, 1.49it/s]
98%|█████████▊| 2087/2120 [26:02<00:22, 1.48it/s]

{'train_step_runtime': 0.641, 'train_step_samples_per_second': 56.162, 'epoch': 0.98}

98%|█████████▊| 2087/2120 [26:02<00:22, 1.48it/s]
98%|█████████▊| 2088/2120 [26:03<00:21, 1.47it/s]

{'train_step_runtime': 0.6355, 'train_step_samples_per_second': 56.652, 'epoch': 0.98}

98%|█████████▊| 2088/2120 [26:03<00:21, 1.47it/s]
99%|█████████▊| 2089/2120 [26:03<00:21, 1.45it/s]

{'train_step_runtime': 0.6715, 'train_step_samples_per_second': 53.608, 'epoch': 0.99}

99%|█████████▊| 2089/2120 [26:03<00:21, 1.45it/s]
99%|█████████▊| 2090/2120 [26:04<00:20, 1.43it/s]

{'train_step_runtime': 0.6717, 'train_step_samples_per_second': 53.592, 'epoch': 0.99}

Sample prints at the end of training:
Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 1586.5542, 'train_samples_per_second': 1.336, 'epoch': 1.0}

100%|██████████| 2120/2120 [26:26<00:00, 1.22s/it]

{'train_runtime': 1426.6043, 'train_samples_per_second': 427.611, 'epoch': 1.0}

100%|██████████| 2120/2120 [26:26<00:00, 1.22s/it]

@raviskolli raviskolli requested a review from ashbhandare April 21, 2021 18:49
@ashbhandare
Copy link

Can you put an example output in the description?

@raviskolli
Copy link
Author

Can you put an example output in the description?

Good idea! I updated the description with few sample prints.

@raviskolli raviskolli closed this Apr 22, 2021
@raviskolli raviskolli reopened this Apr 22, 2021
@@ -1302,10 +1310,14 @@ def train(
)

metrics = speed_metrics("train", start_time, self.state.max_steps)
ort_end_train_metrics = speed_metrics("train",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these will be calculated for non-ort runs too? Can we remove the ort_prefix, or add these under an if?

@ashbhandare
Copy link

ashbhandare commented Apr 22, 2021

Thank you for adding the sample output. From the logs:
`Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 1586.5542, 'train_samples_per_second': 1.336, 'epoch': 1.0}

100%|██████████| 2120/2120 [26:26<00:00, 1.22s/it]

{'train_runtime': 1426.6043, 'train_samples_per_second': 427.611, 'epoch': 1.0}

100%|██████████| 2120/2120 [26:26<00:00, 1.22s/it]`

This is confusing to the user to see the same train_runtime, train_samples_per_second etc being printed twice with different values. Can we make it clear that the second speed metric is excluding the first step to avoid overhead?

Also, can we put the new metrics so that they get logged in the final ones as most users might expect to just look here and find the relevant info:
***** train metrics *****
epoch = 1.0
init_mem_cpu_alloc_delta = 1MB
init_mem_cpu_peaked_delta = 0MB
init_mem_gpu_alloc_delta = 0MB
init_mem_gpu_peaked_delta = 0MB
train_mem_cpu_alloc_delta = 46MB
train_mem_cpu_peaked_delta = 230MB
train_mem_gpu_alloc_delta = 808MB
train_mem_gpu_peaked_delta = 5793MB
train_runtime = 46.9366
train_samples = 4096
train_samples_per_second = 1.364

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants