Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support custom split name by renaming files #6

Merged
merged 4 commits into from
Jan 31, 2025
Merged

Conversation

wengh
Copy link
Collaborator

@wengh wengh commented Jan 30, 2025

Overview

This PR adds support for custom split names by renaming files to follow the file naming convention of HuggingFace datasets.

What's changed

In append mode we now make an additional commit to rename existing files of the split to have the correct name, e.g. with updated total number of files.

Then we commit the newly uploaded files directly with the correct name, and (in overwrite mode) delete old files that are not overwritten.

Append requires an additional commit because in a single commit we cannot rename a file A->B while creating a new file with name A.

Walkthrough (append mode)

Suppose that the repo initially contains

- data/
    - custom-00000-of-00002.parquet
    - custom-00001-of-00002.parquet

We use the data source to append a dataframe of 1 partition to the split:

>>> df.write.format("huggingfacesink").option(token=..., split="custom").save(...)

HuggingFaceDatasetsWriter.write first pre-uploads the new parquet file under a unique temporary name. This uploads the file to LFS but doesn't commit it to the repo.

After all partitions have been pre-uploaded, HuggingFaceDatasetsWriter.commit gets called. It first lists all existing files of the split, renaming them to the the correct format (changing total count from 00002 to 00003):

- data/
    - custom-00000-of-00003.parquet (moved from custom-00000-of-00002.parquet)
    - custom-00001-of-00003.parquet (moved from custom-00001-of-00002.parquet)

Then, it uploads the new file under the correct name, in a separate commit:

- data/
    - custom-00000-of-00003.parquet
    - custom-00001-of-00003.parquet
    - custom-00002-of-00003.parquet (new)

Testing

This PR updates tests to test for custom split names. It also replaces reader with direct call to datasets for slightly less overhead in assertions.

@wengh wengh marked this pull request as ready for review January 30, 2025 01:38
@wengh wengh changed the title Support custom split name by renaming files [SC-187699] Support custom split name by renaming files Jan 30, 2025
@wengh wengh changed the title [SC-187699] Support custom split name by renaming files Support custom split name by renaming files Jan 30, 2025
@wengh wengh requested a review from lhoestq January 31, 2025 00:58
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm :)

@wengh wengh merged commit df37cbd into main Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants