Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor parquet dataloader #867

Draft
wants to merge 25 commits into
base: main
Choose a base branch
from
Draft

Conversation

zyaoj
Copy link
Contributor

@zyaoj zyaoj commented Dec 3, 2024

What does this PR do? Please describe:
The first attempt to extract and migrate generic parquet dataloader from MERES to fairseq2.

Does your PR introduce any breaking changes? If yes, please list them:
N/A

Check list:

  • Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
  • Did you read the contributor guideline?
  • Did you make sure that your PR does only one thing instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

@zyaoj zyaoj self-assigned this Dec 3, 2024
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 3, 2024
@zyaoj zyaoj marked this pull request as ready for review January 3, 2025 13:46
@zyaoj zyaoj removed request for artemru and cbalioglu January 3, 2025 13:47
@zyaoj zyaoj marked this pull request as draft January 3, 2025 13:47
requirements-devel.txt Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
Contains different options that allows to load only a part of the provided dataset.
"""

columns: Optional[List[str]] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not yet sure if it belongs to it

"""If ``True``, uses Parquet row groups instead of simple partitions which
are generally smaller. Highly recommended for non-partitioned parquet files."""

nb_parallel_fragments: Optional[int] = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could keep it with default=None and add an extra config arg (max_tokens) to be used with dynamic bucketing.

Comment on lines +226 to +240
shuffle: bool = True
"""If ``True``, shuffles the dataset samples during the iteration. If ``False``
and ``order_by_length`` is ``None``, the batch samples will be produced in
natural Parquet dataset reading order."""

drop_null: bool = True
"""If ``True``, drops rows containing any null value."""

seed: int = 123
"""The RNG seed value for deterministic behavior."""

nb_epochs: int = 100
"""
Number of passes over the data before iterations stop
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably should go to Basic Dataset config (frontend pipeline)

return table


def load_one_fragment(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have SafeFragment interface for this now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants