How to save image artefacts in a multi GPU training #5729
Unanswered
laughingrice
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 1 comment
-
I think I figured out the right way to do this, adding the solution in case others run into the same issue. The trick was to use a callback instead of doing it from the training function (this way there is also the external knowledge that an appropriate logger was given) so is cleaner. The main issue that it requires another call to the network, but once per epoch, or several epochs is not much of an overhead
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am using the mlflow logger and looking for the best way to save image artifact in a multi gpu setting.
Ideally I would like to log on the first batch of the following epoch, but for a start I'll be content logging the first batch of every epoch.
Currently I have this call to save on the first run of the epoch.
My problem is that the work is split between (at least) 4 workers, and each one is logging the results, with only one prevailing. Ignoring the extra work, I'm actually saving several values (input, output, error), and I get a different part of the batch for each one of them, so cannot compare.
Is there a way to check if I'm on the first sub-batch of the batch?
Beta Was this translation helpful? Give feedback.
All reactions