-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add load dataset #253
Add load dataset #253
Conversation
4be25d7
to
9b0a932
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only minor stuff. this is super cool. will be my new favorite way to get datasets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a few comments/questions, but LGTM
# Check if table exists | ||
result = plpy.execute( | ||
f""" | ||
SELECT pg_catalog.to_regclass('{qualified_table}')::text as friendly_table_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this strip the schema if the search_path allows for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it does. Is that a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Just curious
, field_types jsonb default null | ||
, batch_size int default 5000 | ||
, max_batches int default null | ||
, commit_every_n_batches int default null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure you want commit_every_n_batches defaulted to null?
30957d4
to
3df651f
Compare
This function allows you to create a table from any dataset on HuggingFace Hub. It's an easy way to load data for testing and experimentation. The hub contains 250k+ datasets. We use a streaming version of the API to allow for memory-efficient processing and to optimize the case where you want less than the entire dataset. On-disk caching should be minimal using streaming. We also add a multi-txn version of load_dataset.
3df651f
to
c7f0b68
Compare
This function allows you to create a table from any dataset on HuggingFace Hub. It's an easy way to load data for testing and experimentation. The hub contains 250k+ datasets. We use a streaming version of the API to allow for memory-efficient processing and to optimize the case where you want less than the entire dataset. On-disk caching should be minimal using streaming. We also add a multi-txn version of load_dataset.
This function allows you to create a table from any dataset on
HuggingFace Hub. It's an easy way to load data for testing and
experimentation. The hub contains 250k+ datasets.
We use a streaming version of the API to allow for memory-efficient
processing and to optimize the case where you want less than
the entire dataset. On-disk caching should be minimal using streaming.