Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add script and doc for wandb sync #325

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kriangkraitan
Copy link
Contributor

Why this PR

add script and doc for wandb sync

Changes

  • Write some changes here

Related Issues

Close #

Checklist

  • PR should be in the Naming convention
  • Assign yourself in to Assigneees
  • Tag related issues
  • Constants name should be ALL_CAPITAL, function name should be snake_case, and class name should be CamelCase
  • complex function/algorithm should have Docstring
  • 1 PR should not have more than 200 lines changes (Exception for test files). If more than that please open multiple PRs
  • At least PR reviewer must come from the task's team (model, eval, data)

Copy link

linear bot commented Nov 2, 2023

LM-207 Set up wandb auto sync in Lanta

Rationale: running wandb on Lanta is forced to be on offline mode. We need to manully running wandb sync command on wandb log folder to upload latest training result to wandb cloud. This makes LLM training pipeline monitoring difficult.

We want to automate the syncing by making wandb auto syncing wandb-offline folder on Lanta frontend node each 30 minutes on Lanta with python script and tmux

Step by step:

  1. Install tmux on Lanta using easybuild localmodule https://thaisc.atlassian.net/wiki/spaces/UG/pages/159350813/local+module+TARA+Cluste
  2. Write python/bash script that execute command wandb sync <foldername> each 30 minutes
  3. Run simple training script https://github.com/wandb/examples/blob/master/examples/pytorch-lightning/mnist.py on lants 1 gpu device with wandb offline mode export WANDB_MODE=offline
  4. Run the wandb sync script on tmux
  5. Check the results on 30 minutes, 60 minutes

Definition of done:

Out of Scope:

  • Actually finished training the model

Tips:

  • Install tmux can takes time if you select gcc variant. ml load ncurses can be use instead of gcc if you want to speed up installing speed on non gcc variant tmux
  • peerawat.roj may have easier way to install tmux than this card instruction.

Requester: new17353

Copy link

codecov bot commented Nov 2, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (5ff5762) 64.47% compared to head (f92f53b) 19.39%.
Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main     #325       +/-   ##
===========================================
- Coverage   64.47%   19.39%   -45.08%     
===========================================
  Files          11       25       +14     
  Lines         425     1392      +967     
===========================================
- Hits          274      270        -4     
- Misses        151     1122      +971     
Flag Coverage Δ
unittests 19.39% <ø> (-45.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 36 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kriangkraitan kriangkraitan self-assigned this Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants