Possible memory leak during iteration for large number users (10+ millions)? #25

lightsailpro · 2022-06-28T13:01:04Z

I am testing with a large data set with 10+ million users. I have 64GB RAM. The dataset can fit in RAM initially. But as training progresses, e.g. during epoch 1 - iterations, the RAM consumption keep increasing. Eventually all RAM will be consumed. Is this expected behavior for large data set, or possible memory leak somewhere during iteration steps? I tried with smaller max sequence length (20) and smaller batch size (64), it is same observation with RAM consumption keeps increasing during training. Thanks in advance.

pmixer · 2022-06-29T03:31:01Z

hi @lightsailpro , thx for the feedback, it's quite large dataset, sorry we do not focus on cpu based training before, according to your description, very likely, the Queue based sampler https://github.com/pmixer/SASRec.pytorch/blob/master/utils.py may consumes RAM incrementally if we do not elaborate into RAM usage of sub processes, personally, I'll recommend adjusting some parameters of the sampler to try to finish the experiment, also, if interested, pls feel free to enhance the sampler concerning its RAM use, it would benefit users of the repo and original official work https://github.com/kang205/SASRec

lightsailpro · 2022-06-29T19:58:40Z

@pmixer Thanks for your response. The training is still done on a V100 GPU with 16GB RAM. GPU RAM is fine even with larger batch size. The issue is the CPU RAM consumption keep increasing as the train progress from epoch to epoch.

pmixer · 2022-06-30T01:11:25Z

@lightsailpro maybe try original repo https://github.com/kang205/SASRec, tensorflow may have better support than pytorch when cpu is used for training and inference.

NicholasLea · 2022-07-04T18:01:56Z

@lightsailpro @pmixer I think the reason is that "[train, valid, test, usernum, itemnum] = copy.deepcopy(dataset)t" causes more and more RAM use. I have closely checked and think we can remove copy.deepcopy. The copy.deepcopy is used to avoid changing train, valid, test. I think existing processing, incluing reversed(train[u]) won't change them. I also tested the removing. It works. If you have more finding, please comment on. Thanks.

lightsailpro · 2022-07-13T13:27:57Z

@NicholasLea : Sorry for delayed response. Thanks for the help! I assume the copy.deepcopy you mentioned is in the evaluate code. To clarify, the host RAM leak / consumption observed is in epoch 1 iteration process, in which evaluate code is not called yet (epoch % 20). In my case, in epoch 1 iteration 0 when the training started, the main.py consumed about 36GB host RAM. But in iteration 7750, the host RAM consumption already jumped to 50GB. So I was not even able to finish epoch 1 before the host run out of RAM. The V100 GPU consumption is very stable though (around 4GB out of 16GB). Any further help will be appreciated.

average sequence length: 34.88
loss in epoch 1 iteration 0 / 47471: 1.38626229763031 (host RAM consumption, 36GB of 64GB)
.....
loss in epoch 1 iteration 7750 / 47471: 0.3159805238246918 (host RAM consumption, 50GB of 64GB, GPU RAM 4G of 16GB)

pmixer · 2022-07-14T13:02:02Z

@NicholasLea : Sorry for delayed response. Thanks for the help! I assume the copy.deepcopy you mentioned is in the evaluate code. To clarify, the host RAM leak / consumption observed is in epoch 1 iteration process, in which evaluate code is not called yet (epoch % 20). In my case, in epoch 1 iteration 0 when the training started, the main.py consumed about 36GB host RAM. But in iteration 7750, the host RAM consumption already jumped to 50GB. So I was not even able to finish epoch 1 before the host run out of RAM. The V100 GPU consumption is very stable though (around 4GB out of 16GB). Any further help will be appreciated.

average sequence length: 34.88 loss in epoch 1 iteration 0 / 47471: 1.38626229763031 (host RAM consumption, 36GB of 64GB) ..... loss in epoch 1 iteration 7750 / 47471: 0.3159805238246918 (host RAM consumption, 50GB of 64GB, GPU RAM 4G of 16GB)

@lightsailpro Sorry for it, it could be frustrated when trying to train on a larger dataset but got OOM, formly, this kind of issue requires elaborating into details with help of profilers https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/ , I suggested to try original tf version of SASRec before, if it's not preferred, pls try delete some variables after each iteration which I created like in

SASRec.pytorch/main.py

Line 96 in 4297d09

    
           pos_labels, neg_labels = torch.ones(pos_logits.shape, device=args.device), torch.zeros(neg_logits.shape, device=args.device)

just del those u think no longer needed after each training iteration https://stackoverflow.com/questions/26545051/is-there-a-way-to-delete-created-variables-functions-etc-from-the-memory-of-th

alan-ai-learner · 2023-01-10T14:20:41Z

Any luck here guys?

pmixer · 2023-01-16T01:35:34Z

Any luck here guys?

I'm afraid not, all sampling etc. code should be the same whether using cpu or gpu, the major difference causing memory leak on cpu but not on gpu might be rooted in pytorch implementation diff on cpu and gpu itself. For which I recommend trying pytorch profilier and switching to tf version SASRec if keep using cpu for training. If its for memory leak on gpu, I may have more experience or expertise for debugging it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible memory leak during iteration for large number users (10+ millions)? #25

Possible memory leak during iteration for large number users (10+ millions)? #25

lightsailpro commented Jun 28, 2022 •

edited

Loading

pmixer commented Jun 29, 2022

lightsailpro commented Jun 29, 2022 •

edited

Loading

pmixer commented Jun 30, 2022

NicholasLea commented Jul 4, 2022

lightsailpro commented Jul 13, 2022

pmixer commented Jul 14, 2022

alan-ai-learner commented Jan 10, 2023

pmixer commented Jan 16, 2023 •

edited

Loading

Possible memory leak during iteration for large number users (10+ millions)? #25

Possible memory leak during iteration for large number users (10+ millions)? #25

Comments

lightsailpro commented Jun 28, 2022 • edited Loading

pmixer commented Jun 29, 2022

lightsailpro commented Jun 29, 2022 • edited Loading

pmixer commented Jun 30, 2022

NicholasLea commented Jul 4, 2022

lightsailpro commented Jul 13, 2022

pmixer commented Jul 14, 2022

alan-ai-learner commented Jan 10, 2023

pmixer commented Jan 16, 2023 • edited Loading

lightsailpro commented Jun 28, 2022 •

edited

Loading

lightsailpro commented Jun 29, 2022 •

edited

Loading

pmixer commented Jan 16, 2023 •

edited

Loading