Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory usage logging #354

Merged
merged 2 commits into from
Aug 9, 2023
Merged

GPU memory usage logging #354

merged 2 commits into from
Aug 9, 2023

Conversation

tmm1
Copy link
Collaborator

@tmm1 tmm1 commented Aug 9, 2023

[2023-08-09 08:13:36,085] [INFO] [axolotl.scripts.train:254] [PID:741158] GPU memory baseline: 322 MB.
[2023-08-09 08:13:48,165] [INFO] [axolotl.load_model:329] [PID:741158] GPU memory after model load: 4837 MB.
[2023-08-09 08:14:22,195] [INFO] [axolotl.load_model:368] [PID:741158] GPU memory after adapters: 4917 MB.
[2023-08-09 08:14:41,233] [INFO] [axolotl.callbacks.on_step_end:95] [PID:741158] GPU memory while training: 21573 MB.

@tmm1 tmm1 force-pushed the gpu-util branch 2 times, most recently from eaa5f5c to 8787fec Compare August 9, 2023 08:31
@winglian
Copy link
Collaborator

winglian commented Aug 9, 2023

This seems to be a project that hasn't been touched in 6 years. Perhaps we could implement this with torch instead? https://pypi.org/project/nvidia-ml-py3/

https://pytorch.org/docs/stable/generated/torch.cuda.memory_summary.html

@utensil
Copy link
Contributor

utensil commented Aug 9, 2023

  1. it's nice to mark the VRAM usages for these stages, it would be even nicer to calculate the difference between baseline and loaded, the difference between loaded and training and divide by micro_batch_size etc.
  2. the unit is better to use GB for better readability, just use :.3f to make it keep 3 digits after the decimal so MB is easily recovered too
  3. you can simply use torch.cuda.memory_usage() etc. instead of reimplementation;
  4. could also calculate the percentage alongside the absolute GB value

@tmm1 tmm1 marked this pull request as ready for review August 9, 2023 21:47
@tmm1 tmm1 merged commit 9643121 into axolotl-ai-cloud:main Aug 9, 2023
3 checks passed
mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants