Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add load dataset #253

Merged
merged 3 commits into from
Dec 10, 2024
Merged

Add load dataset #253

merged 3 commits into from
Dec 10, 2024

Conversation

cevian
Copy link
Collaborator

@cevian cevian commented Nov 25, 2024

This function allows you to create a table from any dataset on
HuggingFace Hub. It's an easy way to load data for testing and
experimentation. The hub contains 250k+ datasets.

We use a streaming version of the API to allow for memory-efficient
processing and to optimize the case where you want less than
the entire dataset. On-disk caching should be minimal using streaming.

@cevian cevian requested a review from a team as a code owner November 25, 2024 22:41
Copy link
Collaborator

@jgpruitt jgpruitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only minor stuff. this is super cool. will be my new favorite way to get datasets

projects/extension/sql/idempotent/015-load_dataset.sql Outdated Show resolved Hide resolved
projects/extension/ai/load_dataset.py Show resolved Hide resolved
projects/extension/ai/load_dataset.py Show resolved Hide resolved
projects/extension/ai/load_dataset.py Outdated Show resolved Hide resolved
projects/extension/ai/load_dataset.py Outdated Show resolved Hide resolved
projects/extension/sql/idempotent/015-load_dataset.sql Outdated Show resolved Hide resolved
@cevian cevian requested a review from jgpruitt December 9, 2024 19:59
Copy link
Collaborator

@jgpruitt jgpruitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments/questions, but LGTM

projects/extension/ai/load_dataset.py Outdated Show resolved Hide resolved
projects/extension/ai/load_dataset.py Outdated Show resolved Hide resolved
# Check if table exists
result = plpy.execute(
f"""
SELECT pg_catalog.to_regclass('{qualified_table}')::text as friendly_table_name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this strip the schema if the search_path allows for it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it does. Is that a problem?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Just curious

, field_types jsonb default null
, batch_size int default 5000
, max_batches int default null
, commit_every_n_batches int default null
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure you want commit_every_n_batches defaulted to null?

@cevian cevian force-pushed the add_load_dataset branch 2 times, most recently from 30957d4 to 3df651f Compare December 9, 2024 20:35
This function allows you to create a table from any dataset on
HuggingFace Hub. It's an easy way to load data for testing and
experimentation. The hub contains 250k+ datasets.

We use a streaming version of the API to allow for memory-efficient
processing and to optimize the case where you want less than
the entire dataset. On-disk caching should be minimal using streaming.

We also add a multi-txn version of load_dataset.
@JamesGuthrie JamesGuthrie merged commit 29e740e into main Dec 10, 2024
5 checks passed
@JamesGuthrie JamesGuthrie deleted the add_load_dataset branch December 10, 2024 14:36
JamesGuthrie pushed a commit that referenced this pull request Dec 10, 2024
This function allows you to create a table from any dataset on
HuggingFace Hub. It's an easy way to load data for testing and
experimentation. The hub contains 250k+ datasets.

We use a streaming version of the API to allow for memory-efficient
processing and to optimize the case where you want less than
the entire dataset. On-disk caching should be minimal using streaming.

We also add a multi-txn version of load_dataset.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants