Fix hf dataset hang on small dataset #1370

dakinggg · 2024-07-18T02:59:42Z

It seems that when processing a small dataset with multiprocessing, hf datasets sometimes hangs. This PR fixes this by just not doing multiprocessing when the dataset is small. Not sure if this covers all cases, but at least fixes the hang we are seeing now.

Before, with a dataset with 3 samples, it would hang after tokenization ~20% of the time. After this PR, 20 runs completed successfully. Also after this PR, a dataset with 512 samples still appropriately uses multiprocessing, and completed successfully 20 times.

dakinggg added 5 commits July 17, 2024 18:13

debug

b02f4f6

rm lock files

af0318e

safer

d3a147c

no proc

7401ce4

pc

1fc338c

dakinggg requested a review from a team as a code owner July 18, 2024 02:59

dakinggg requested a review from irenedea July 18, 2024 02:59

dakinggg enabled auto-merge (squash) July 18, 2024 03:01

irenedea approved these changes Jul 18, 2024

View reviewed changes

dakinggg merged commit 006f251 into mosaicml:main Jul 18, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hf dataset hang on small dataset #1370

Fix hf dataset hang on small dataset #1370

dakinggg commented Jul 18, 2024

Fix hf dataset hang on small dataset #1370

Fix hf dataset hang on small dataset #1370

Conversation

dakinggg commented Jul 18, 2024