Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix DatasetConstants.splints default value to protect dictionary overwriting #1144

Merged
merged 5 commits into from
Apr 29, 2024

Conversation

ivan-kud
Copy link
Contributor

The values of DatasetConstants.splints dictionary of "c4" dataset overwrites the values for the "The Pile" splits dictionary. This is because the dict type of DatasetConstants.splints default value is mutable, so default_factory should be used.

Also there are typos for train_small amount of raw samples.

And I correct type hinting for DataSplitConstants.raw_samples. The None value is checked below in the code.

…ples amount; correct type hinting for raw_samples
@ivan-kud ivan-kud changed the title fix DatasetConstants.splints default value to protecе dictionary overwriting fix DatasetConstants.splints default value to protect dictionary overwriting Apr 26, 2024
Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

scripts/data_prep/convert_dataset_hf.py Outdated Show resolved Hide resolved
@dakinggg dakinggg merged commit 738956e into mosaicml:main Apr 29, 2024
9 checks passed
@ivan-kud ivan-kud deleted the hf_dataset_convert_fix branch April 30, 2024 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants