Open
Description
I am trying to monitor training for the OpenChatKit-7B model by increasing the number of iterations etc. I want to monitor the quality of the training with Tensorboard but have not managed to get it to work. I have been including the SummaryWriter into the test_loop function in dist_clm_train.py:
...
...
loss = torch.tensor(loss_list).mean()
ppls = torch.exp(loss)
metric = {"valid.perplexity": ppls.item(), "valid.loss": loss.item()}
# ADDED to calculate tensorboard scalars
metric = {"train.perplexity": ppls.item(), "train.loss": loss.item()}
train_log(metric, step=pipe.global_step)
tb.add_scalar('train/perplexity', ppls.item(), tmpStep)
tb.add_scalar('train/loss', loss.item(), tmpStep)
tmpStep += 1
# END of ADDED
...
...
Please can you advise whether this is possible and if so how it can be done. Any help / guidance would be much appreciated.
Many thanks,
Metadata
Metadata
Assignees
Labels
No labels