How to training on huge text dataset #4139

Limtle · 2020-10-14T09:39:53Z

Limtle
Oct 14, 2020

What is your question?

I have a ddp program running on 8 nodes, and I need to load a very huge text dataset ( >30 GB) in this task. However, when I loaded it to dataloader, the program stuck. The intuition is to split my dataset into fractions with smaller size, so every time the dataloader only need to load a small fraction of dataset. Does pytorch-lightning have any support for it? Or any suggestion for solving such problem?

What's your environment?
OS: [Linux]
Packaging [pip]
Version [0.10.0]

teddykoker · 2020-10-14T16:47:02Z

teddykoker
Oct 14, 2020

Can you tell which part of your script is failing? PyTorch datasets do not need to load all of the data at once if implemented properly, perhaps you can share your Dataset implementation?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to training on huge text dataset #4139

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to training on huge text dataset #4139

Limtle Oct 14, 2020

Replies: 1 comment

teddykoker Oct 14, 2020

Limtle
Oct 14, 2020

teddykoker
Oct 14, 2020