-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HF Dataset components #8
Conversation
|
||
|
||
# pylint: disable=too-few-public-methods | ||
class HFDatasetsDataset(ExpressDataset[List[str], Union[datasets.Dataset, datasets.DatasetDict]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also was wondering whether we could have a better name (DatasetsDataset might not be the best name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to just go for HFDataset
?
tmp_dir | ||
) | ||
|
||
data_source_hf_datasets = load_dataset("parquet", data_dir=local_parquet_path, split="train") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
load_dataset
also supports things like multiprocessing (if there are multiple CPU cores) and streaming (loading data on-the-fly), not sure if we want to leverage those too
e2c1bae
to
e3f4aff
Compare
@RobbeSneyders so the pipeline fails because |
I think handling soft dependencies would be good indeed. I would add it as a dependency in this PR, and then open a new PR where we implement the soft dependency checks for both |
Seems like I was a bit too strict with the allowed licenses. We can allow weak-copyleft licenses as well. You can add the following ones and it will fix the pipeline:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM.
Can you just remove the pylint: disable
s and resolve the conflict?
This PR adds boilerplate `HFDatasetLoaderComponent` and `HFDatasetTransformComponent` classes.
This PR adds boilerplate `HFDatasetLoaderComponent` and `HFDatasetTransformComponent` classes.
This PR adds boilerplate
HFDatasetLoaderComponent
andHFDatasetTransformComponent
classes.Questions:
Dataset
and aDatasetDict
. The latter is just a dictionary of severalDataset
instances, with keys like "train", "validation", "test". I assume the index could be stored as aDataset
, whereas data sources can be either aDataset
or aDatasetDict
.load_dataset
without providing a split (like "train", "validation"), this method returns aDatasetDict
instead of aDataset
. So was wondering how we could handle these splits a user might have.To do: