[Examples] Add automatic dataset splitting in language-modeling examples #9133

TevenLeScao · 2020-12-15T17:17:20Z

What does this PR do?

Currently, language-modeling examples support passing a HF-datasets dataset as training data. However, this dataset needs to have a train and validation split, which is not the case for many language-modeling datasets, which are just unstructured text. The updated scripts automatically partition the train split to create a validation split if it doesn't exist already, and adds validation_split_percentage argument to control the split ratio, set to 5% by default.

…steps > 0

TevenLeScao · 2020-12-15T17:18:39Z

Ah, the commit from #9127 seems to have snuck its way in there. Should I remove it?

sgugger · 2020-12-15T17:44:49Z

If you can do it easily, that would be best!

sgugger

Looking good to me, thanks!

TevenLeScao · 2020-12-15T18:09:13Z

If you can do it easily, that would be best!

I've tried for a bit but I think I just made things worse ! If that's OK I'll leave it there and I'll fix things at merge time.

LysandreJik

Great, LGTM!

TevenLeScao added 2 commits December 15, 2020 16:58

replaced jnp.split + removing textual model inputs + ensuring warmup_…

0dda342

…steps > 0

Add automatic dataset splitting in language-modeling examples

02adf9d

TevenLeScao requested review from LysandreJik and sgugger December 15, 2020 17:17

sgugger approved these changes Dec 15, 2020

View reviewed changes

LysandreJik approved these changes Dec 15, 2020

View reviewed changes

LysandreJik merged commit 2a7e8e1 into huggingface:master Dec 15, 2020

merrymercy mentioned this pull request Mar 18, 2021

[Example] Fix a NaN bug in the flax mlm example #10796

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Examples] Add automatic dataset splitting in language-modeling examples #9133

[Examples] Add automatic dataset splitting in language-modeling examples #9133

TevenLeScao commented Dec 15, 2020

TevenLeScao commented Dec 15, 2020

sgugger commented Dec 15, 2020

sgugger left a comment

TevenLeScao commented Dec 15, 2020

LysandreJik left a comment

[Examples] Add automatic dataset splitting in language-modeling examples #9133

[Examples] Add automatic dataset splitting in language-modeling examples #9133

Conversation

TevenLeScao commented Dec 15, 2020

What does this PR do?

TevenLeScao commented Dec 15, 2020

sgugger commented Dec 15, 2020

sgugger left a comment

Choose a reason for hiding this comment

TevenLeScao commented Dec 15, 2020

LysandreJik left a comment

Choose a reason for hiding this comment