Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak during iteration for large number users (10+ millions)? #25

Open
lightsailpro opened this issue Jun 28, 2022 · 8 comments

Comments

@lightsailpro
Copy link

lightsailpro commented Jun 28, 2022

I am testing with a large data set with 10+ million users. I have 64GB RAM. The dataset can fit in RAM initially. But as training progresses, e.g. during epoch 1 - iterations, the RAM consumption keep increasing. Eventually all RAM will be consumed. Is this expected behavior for large data set, or possible memory leak somewhere during iteration steps? I tried with smaller max sequence length (20) and smaller batch size (64), it is same observation with RAM consumption keeps increasing during training. Thanks in advance.

@pmixer
Copy link
Owner

pmixer commented Jun 29, 2022

hi @lightsailpro , thx for the feedback, it's quite large dataset, sorry we do not focus on cpu based training before, according to your description, very likely, the Queue based sampler https://github.com/pmixer/SASRec.pytorch/blob/master/utils.py may consumes RAM incrementally if we do not elaborate into RAM usage of sub processes, personally, I'll recommend adjusting some parameters of the sampler to try to finish the experiment, also, if interested, pls feel free to enhance the sampler concerning its RAM use, it would benefit users of the repo and original official work https://github.com/kang205/SASRec

@lightsailpro
Copy link
Author

lightsailpro commented Jun 29, 2022

@pmixer Thanks for your response. The training is still done on a V100 GPU with 16GB RAM. GPU RAM is fine even with larger batch size. The issue is the CPU RAM consumption keep increasing as the train progress from epoch to epoch.

@pmixer
Copy link
Owner

pmixer commented Jun 30, 2022

@lightsailpro maybe try original repo https://github.com/kang205/SASRec, tensorflow may have better support than pytorch when cpu is used for training and inference.

@NicholasLea
Copy link

@lightsailpro @pmixer I think the reason is that "[train, valid, test, usernum, itemnum] = copy.deepcopy(dataset)t" causes more and more RAM use. I have closely checked and think we can remove copy.deepcopy. The copy.deepcopy is used to avoid changing train, valid, test. I think existing processing, incluing reversed(train[u]) won't change them. I also tested the removing. It works. If you have more finding, please comment on. Thanks.

@lightsailpro
Copy link
Author

@NicholasLea : Sorry for delayed response. Thanks for the help! I assume the copy.deepcopy you mentioned is in the evaluate code. To clarify, the host RAM leak / consumption observed is in epoch 1 iteration process, in which evaluate code is not called yet (epoch % 20). In my case, in epoch 1 iteration 0 when the training started, the main.py consumed about 36GB host RAM. But in iteration 7750, the host RAM consumption already jumped to 50GB. So I was not even able to finish epoch 1 before the host run out of RAM. The V100 GPU consumption is very stable though (around 4GB out of 16GB). Any further help will be appreciated.

average sequence length: 34.88
loss in epoch 1 iteration 0 / 47471: 1.38626229763031 (host RAM consumption, 36GB of 64GB)
.....
loss in epoch 1 iteration 7750 / 47471: 0.3159805238246918 (host RAM consumption, 50GB of 64GB, GPU RAM 4G of 16GB)

@pmixer
Copy link
Owner

pmixer commented Jul 14, 2022

@NicholasLea : Sorry for delayed response. Thanks for the help! I assume the copy.deepcopy you mentioned is in the evaluate code. To clarify, the host RAM leak / consumption observed is in epoch 1 iteration process, in which evaluate code is not called yet (epoch % 20). In my case, in epoch 1 iteration 0 when the training started, the main.py consumed about 36GB host RAM. But in iteration 7750, the host RAM consumption already jumped to 50GB. So I was not even able to finish epoch 1 before the host run out of RAM. The V100 GPU consumption is very stable though (around 4GB out of 16GB). Any further help will be appreciated.

average sequence length: 34.88 loss in epoch 1 iteration 0 / 47471: 1.38626229763031 (host RAM consumption, 36GB of 64GB) ..... loss in epoch 1 iteration 7750 / 47471: 0.3159805238246918 (host RAM consumption, 50GB of 64GB, GPU RAM 4G of 16GB)

@lightsailpro Sorry for it, it could be frustrated when trying to train on a larger dataset but got OOM, formly, this kind of issue requires elaborating into details with help of profilers https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/ , I suggested to try original tf version of SASRec before, if it's not preferred, pls try delete some variables after each iteration which I created like in

pos_labels, neg_labels = torch.ones(pos_logits.shape, device=args.device), torch.zeros(neg_logits.shape, device=args.device)
just del those u think no longer needed after each training iteration https://stackoverflow.com/questions/26545051/is-there-a-way-to-delete-created-variables-functions-etc-from-the-memory-of-th

@alan-ai-learner
Copy link

Any luck here guys?

@pmixer
Copy link
Owner

pmixer commented Jan 16, 2023

Any luck here guys?

I'm afraid not, all sampling etc. code should be the same whether using cpu or gpu, the major difference causing memory leak on cpu but not on gpu might be rooted in pytorch implementation diff on cpu and gpu itself. For which I recommend trying pytorch profilier and switching to tf version SASRec if keep using cpu for training. If its for memory leak on gpu, I may have more experience or expertise for debugging it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants