-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor parquet dataloader #867
base: main
Are you sure you want to change the base?
Conversation
Contains different options that allows to load only a part of the provided dataset. | ||
""" | ||
|
||
columns: Optional[List[str]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not yet sure if it belongs to it
src/fairseq2/data/parquet/configs.py
Outdated
"""If ``True``, uses Parquet row groups instead of simple partitions which | ||
are generally smaller. Highly recommended for non-partitioned parquet files.""" | ||
|
||
nb_parallel_fragments: Optional[int] = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could keep it with default=None and add an extra config arg (max_tokens) to be used with dynamic bucketing.
shuffle: bool = True | ||
"""If ``True``, shuffles the dataset samples during the iteration. If ``False`` | ||
and ``order_by_length`` is ``None``, the batch samples will be produced in | ||
natural Parquet dataset reading order.""" | ||
|
||
drop_null: bool = True | ||
"""If ``True``, drops rows containing any null value.""" | ||
|
||
seed: int = 123 | ||
"""The RNG seed value for deterministic behavior.""" | ||
|
||
nb_epochs: int = 100 | ||
""" | ||
Number of passes over the data before iterations stop | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this probably should go to Basic Dataset config (frontend pipeline)
return table | ||
|
||
|
||
def load_one_fragment( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have SafeFragment interface for this now
What does this PR do? Please describe:
The first attempt to extract and migrate generic parquet dataloader from MERES to fairseq2.
Does your PR introduce any breaking changes? If yes, please list them:
N/A
Check list: