Support custom split name by renaming files #6

wengh · 2025-01-30T00:36:59Z

Overview

This PR adds support for custom split names by renaming files to follow the file naming convention of HuggingFace datasets.

What's changed

In append mode we now make an additional commit to rename existing files of the split to have the correct name, e.g. with updated total number of files.

Then we commit the newly uploaded files directly with the correct name, and (in overwrite mode) delete old files that are not overwritten.

Append requires an additional commit because in a single commit we cannot rename a file A->B while creating a new file with name A.

Walkthrough (append mode)

Suppose that the repo initially contains

- data/
    - custom-00000-of-00002.parquet
    - custom-00001-of-00002.parquet

We use the data source to append a dataframe of 1 partition to the split:

>>> df.write.format("huggingfacesink").option(token=..., split="custom").save(...)

HuggingFaceDatasetsWriter.write first pre-uploads the new parquet file under a unique temporary name. This uploads the file to LFS but doesn't commit it to the repo.

After all partitions have been pre-uploaded, HuggingFaceDatasetsWriter.commit gets called. It first lists all existing files of the split, renaming them to the the correct format (changing total count from 00002 to 00003):

- data/
    - custom-00000-of-00003.parquet (moved from custom-00000-of-00002.parquet)
    - custom-00001-of-00003.parquet (moved from custom-00001-of-00002.parquet)

Then, it uploads the new file under the correct name, in a separate commit:

- data/
    - custom-00000-of-00003.parquet
    - custom-00001-of-00003.parquet
    - custom-00002-of-00003.parquet (new)

Testing

This PR updates tests to test for custom split names. It also replaces reader with direct call to datasets for slightly less overhead in assertions.

pyspark_huggingface/huggingface_sink.py

lhoestq

lgtm :)

wengh added 3 commits January 29, 2025 16:35

rename files to follow split naming convention

2833b4d

remove unused import

c1134f7

add explanation

2e25a9e

wengh marked this pull request as ready for review January 30, 2025 01:38

wengh requested review from lhoestq and allisonwang-db January 30, 2025 01:38

wengh changed the title ~~Support custom split name by renaming files~~ [SC-187699] Support custom split name by renaming files Jan 30, 2025

lhoestq reviewed Jan 30, 2025

View reviewed changes

pyspark_huggingface/huggingface_sink.py Outdated Show resolved Hide resolved

1 phase upload when overwriting

6bc7bb0

wengh changed the title ~~[SC-187699] Support custom split name by renaming files~~ Support custom split name by renaming files Jan 30, 2025

wengh requested a review from lhoestq January 31, 2025 00:58

lhoestq approved these changes Jan 31, 2025

View reviewed changes

wengh merged commit df37cbd into main Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support custom split name by renaming files #6

Support custom split name by renaming files #6

wengh commented Jan 30, 2025 •

edited

Loading

lhoestq left a comment

Support custom split name by renaming files #6

Support custom split name by renaming files #6

Conversation

wengh commented Jan 30, 2025 • edited Loading

Overview

What's changed

Walkthrough (append mode)

Testing

lhoestq left a comment

Choose a reason for hiding this comment

wengh commented Jan 30, 2025 •

edited

Loading