-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible memory leak during iteration for large number users (10+ millions)? #25
Comments
hi @lightsailpro , thx for the feedback, it's quite large dataset, sorry we do not focus on cpu based training before, according to your description, very likely, the Queue based sampler https://github.com/pmixer/SASRec.pytorch/blob/master/utils.py may consumes RAM incrementally if we do not elaborate into RAM usage of sub processes, personally, I'll recommend adjusting some parameters of the sampler to try to finish the experiment, also, if interested, pls feel free to enhance the sampler concerning its RAM use, it would benefit users of the repo and original official work https://github.com/kang205/SASRec |
@pmixer Thanks for your response. The training is still done on a V100 GPU with 16GB RAM. GPU RAM is fine even with larger batch size. The issue is the CPU RAM consumption keep increasing as the train progress from epoch to epoch. |
@lightsailpro maybe try original repo https://github.com/kang205/SASRec, tensorflow may have better support than pytorch when cpu is used for training and inference. |
@lightsailpro @pmixer I think the reason is that "[train, valid, test, usernum, itemnum] = copy.deepcopy(dataset)t" causes more and more RAM use. I have closely checked and think we can remove copy.deepcopy. The copy.deepcopy is used to avoid changing train, valid, test. I think existing processing, incluing reversed(train[u]) won't change them. I also tested the removing. It works. If you have more finding, please comment on. Thanks. |
@NicholasLea : Sorry for delayed response. Thanks for the help! I assume the copy.deepcopy you mentioned is in the evaluate code. To clarify, the host RAM leak / consumption observed is in epoch 1 iteration process, in which evaluate code is not called yet (epoch % 20). In my case, in epoch 1 iteration 0 when the training started, the main.py consumed about 36GB host RAM. But in iteration 7750, the host RAM consumption already jumped to 50GB. So I was not even able to finish epoch 1 before the host run out of RAM. The V100 GPU consumption is very stable though (around 4GB out of 16GB). Any further help will be appreciated. average sequence length: 34.88 |
@lightsailpro Sorry for it, it could be frustrated when trying to train on a larger dataset but got OOM, formly, this kind of issue requires elaborating into details with help of profilers https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/ , I suggested to try original tf version of SASRec before, if it's not preferred, pls try delete some variables after each iteration which I created like in Line 96 in 4297d09
del those u think no longer needed after each training iteration https://stackoverflow.com/questions/26545051/is-there-a-way-to-delete-created-variables-functions-etc-from-the-memory-of-th
|
Any luck here guys? |
I'm afraid not, all sampling etc. code should be the same whether using cpu or gpu, the major difference causing memory leak on cpu but not on gpu might be rooted in pytorch implementation diff on cpu and gpu itself. For which I recommend trying pytorch profilier and switching to tf version SASRec if keep using cpu for training. If its for memory leak on gpu, I may have more experience or expertise for debugging it. |
I am testing with a large data set with 10+ million users. I have 64GB RAM. The dataset can fit in RAM initially. But as training progresses, e.g. during epoch 1 - iterations, the RAM consumption keep increasing. Eventually all RAM will be consumed. Is this expected behavior for large data set, or possible memory leak somewhere during iteration steps? I tried with smaller max sequence length (20) and smaller batch size (64), it is same observation with RAM consumption keeps increasing during training. Thanks in advance.
The text was updated successfully, but these errors were encountered: