Refactor parquet dataloader #867

zyaoj · 2024-12-03T14:27:16Z

What does this PR do? Please describe:
The first attempt to extract and migrate generic parquet dataloader from MERES to fairseq2.

Does your PR introduce any breaking changes? If yes, please list them:
N/A

Check list:

Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
Did you read the contributor guideline?
Did you make sure that your PR does only one thing instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

requirements-devel.txt

setup.py

src/fairseq2/data/parquet/__init__.py

artemru · 2025-01-16T15:47:05Z

src/fairseq2/data/parquet/configs.py

+    Contains different options that allows to load only a part of the provided dataset.
+    """
+
+    columns: Optional[List[str]] = None


I'm not yet sure if it belongs to it

artemru · 2025-01-16T15:49:39Z

src/fairseq2/data/parquet/configs.py

+    """If ``True``, uses Parquet row groups instead of simple partitions which
+    are generally smaller. Highly recommended for non-partitioned parquet files."""
+
+    nb_parallel_fragments: Optional[int] = 5


maybe we could keep it with default=None and add an extra config arg (max_tokens) to be used with dynamic bucketing.

src/fairseq2/data/parquet/configs.py

artemru · 2025-01-16T15:54:57Z

src/fairseq2/data/parquet/configs.py

+    shuffle: bool = True
+    """If ``True``, shuffles the dataset samples during the iteration. If ``False``
+    and ``order_by_length`` is ``None``, the batch samples will be produced in
+    natural Parquet dataset reading order."""
+
+    drop_null: bool = True
+    """If ``True``, drops rows containing any null value."""
+
+    seed: int = 123
+    """The RNG seed value for deterministic behavior."""
+
+    nb_epochs: int = 100
+    """
+    Number of passes over the data before iterations stop
+    """


this probably should go to Basic Dataset config (frontend pipeline)

src/fairseq2/data/parquet/pipeline.py

src/fairseq2/data/parquet/transform.py

artemru · 2025-01-16T16:25:28Z

src/fairseq2/data/parquet/utils.py

+    return table
+
+
+def load_one_fragment(


we have SafeFragment interface for this now

src/fairseq2/data/parquet/utils.py

zyaoj added 3 commits December 1, 2024 01:06

move configs

5071f80

move stopes utils

7af70d6

move stopes utils

ffd751f

zyaoj self-assigned this Dec 3, 2024

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 3, 2024

zyaoj marked this pull request as ready for review January 3, 2025 13:46

zyaoj requested review from artemru and cbalioglu as code owners January 3, 2025 13:46

zyaoj removed request for artemru and cbalioglu January 3, 2025 13:47

zyaoj marked this pull request as draft January 3, 2025 13:47

zyaoj added 17 commits January 10, 2025 13:57

merge

324a649

fix ci

fd4cc34

add dependencies for arrow

eef615e

add configs

75dcdda

update parquet utils

dd85d23

add parquet transform fns

d1efdae

add draft pipeline

b98b53c

add pytest fixture for pq

2f90700

cleanup

6fe1106

cleanup

cb26532

fix ci

966fc21

try fix ci

cf01c4c

try fix ci

5ba86be

try fix ci

57e545b

try fix ci

3fc58f5

fix pipeline ci

755afee

revise pipeline

a9d5b18