Support custom split name by renaming files #6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR adds support for custom split names by renaming files to follow the file naming convention of HuggingFace datasets.
What's changed
In append mode we now make an additional commit to rename existing files of the split to have the correct name, e.g. with updated total number of files.
Then we commit the newly uploaded files directly with the correct name, and (in overwrite mode) delete old files that are not overwritten.
Append requires an additional commit because in a single commit we cannot rename a file A->B while creating a new file with name A.
Walkthrough (append mode)
Suppose that the repo initially contains
We use the data source to append a dataframe of 1 partition to the split:
HuggingFaceDatasetsWriter.write
first pre-uploads the new parquet file under a unique temporary name. This uploads the file to LFS but doesn't commit it to the repo.After all partitions have been pre-uploaded,
HuggingFaceDatasetsWriter.commit
gets called. It first lists all existing files of the split, renaming them to the the correct format (changing total count from 00002 to 00003):Then, it uploads the new file under the correct name, in a separate commit:
Testing
This PR updates tests to test for custom split names. It also replaces reader with direct call to
datasets
for slightly less overhead in assertions.