Skip to content
This repository has been archived by the owner on Dec 1, 2021. It is now read-only.

Randomly resource not released after training sometimes #1227

Open
joelN123 opened this issue Oct 21, 2020 · 0 comments
Open

Randomly resource not released after training sometimes #1227

joelN123 opened this issue Oct 21, 2020 · 0 comments
Labels
bug Something isn't working

Comments

@joelN123
Copy link
Contributor

The usual, expected behaviour is that when the training has finished, all the resource (GPU) is freed up and the docker container doesn't run anymore. And this does happen in most cases.

However, sometimes (randomly?) the resource is not freed up. I guess the frequency of unexpected behaviour is maybe 1 in 5. I'm not sure though. This is a potential problem for anyone using "pay-as-you-go" computing resource to train their model.

A recent example is that I was running training on a modified lm_resnet_quantize_cifar10.py config file for ilsvrc_2012 dataset, using the dataset_iterator with multigpu and prefetch (I'm not sure if any of these things are relevant to the issue).

An example of the output of a training run that failed to free resource is:

[1,1]<stderr>:  warnings.warn(str(msg))
[1,0]<stdout>:1600000/1600000 [==============================] - 122832s 77ms/steput>::::>:
[1,3]<stdout>:break
[1,2]<stdout>:break
[1,1]<stdout>:break
[1,3]<stdout>:Done
[1,2]<stdout>:Done
[1,2]<stdout>:Next step: blueoil convert -e my_model -p save.ckpt-1600000
[1,3]<stdout>:Next step: blueoil convert -e my_model -p save.ckpt-1600000
[1,1]<stdout>:Done
[1,1]<stdout>:Next step: blueoil convert -e my_model -p save.ckpt-1600000

and an example of the output of a training run that did free resource successfully is:

[1,3]<stderr>:  warnings.warn(str(msg))
[1,0]<stdout>:1599999/1600000 [============================>.] - ETA: 0s[1,2]<stdout>:break
[1,0]<stdout>:1600000/1600000 [==============================] - 131360s 82ms/step
[1,2]<stdout>:Done
[1,0]<stdout>:break
[1,2]<stdout>:Next step: blueoil convert -e another_model -p save.ckpt-1600000
[1,0]<stdout>:Done
[1,0]<stdout>:Next step: blueoil convert -e another_model -p save.ckpt-1600000
[1,3]<stdout>:break
[1,1]<stdout>:break
[1,3]<stdout>:Done
[1,1]<stdout>:Done
[1,3]<stdout>:Next step: blueoil convert -e another_model -p save.ckpt-1600000
[1,1]<stdout>:Next step: blueoil convert -e another_model -p save.ckpt-1600000

Comparing between them, the first only has three Done printout, while the second has four. This line of code is at

print("Done")

which is after the final progbar update. So it seems likely that the problem was that one of the DatasetIterators did not close.

@joelN123 joelN123 added the bug Something isn't working label Oct 21, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant