Add load dataset #253

cevian · 2024-11-25T22:41:03Z

This function allows you to create a table from any dataset on
HuggingFace Hub. It's an easy way to load data for testing and
experimentation. The hub contains 250k+ datasets.

We use a streaming version of the API to allow for memory-efficient
processing and to optimize the case where you want less than
the entire dataset. On-disk caching should be minimal using streaming.

jgpruitt

only minor stuff. this is super cool. will be my new favorite way to get datasets

projects/extension/sql/idempotent/015-load_dataset.sql

projects/extension/ai/load_dataset.py

projects/extension/sql/idempotent/015-load_dataset.sql

jgpruitt

left a few comments/questions, but LGTM

projects/extension/ai/load_dataset.py

jgpruitt · 2024-12-09T20:10:06Z

projects/extension/ai/load_dataset.py

+    # Check if table exists
+    result = plpy.execute(
+        f"""
+        SELECT pg_catalog.to_regclass('{qualified_table}')::text as friendly_table_name


does this strip the schema if the search_path allows for it?

yes it does. Is that a problem?

No. Just curious

jgpruitt · 2024-12-09T20:14:06Z

projects/extension/sql/idempotent/015-load_dataset.sql

+, field_types jsonb default null
+, batch_size int default 5000
+, max_batches int default null
+, commit_every_n_batches int default null


are you sure you want commit_every_n_batches defaulted to null?

This function allows you to create a table from any dataset on HuggingFace Hub. It's an easy way to load data for testing and experimentation. The hub contains 250k+ datasets. We use a streaming version of the API to allow for memory-efficient processing and to optimize the case where you want less than the entire dataset. On-disk caching should be minimal using streaming. We also add a multi-txn version of load_dataset.

cevian requested a review from a team as a code owner November 25, 2024 22:41

cevian force-pushed the add_load_dataset branch from 4be25d7 to 9b0a932 Compare November 25, 2024 22:42

jgpruitt approved these changes Dec 4, 2024

View reviewed changes

cevian requested a review from jgpruitt December 9, 2024 19:59

jgpruitt approved these changes Dec 9, 2024

View reviewed changes

cevian force-pushed the add_load_dataset branch 2 times, most recently from 30957d4 to 3df651f Compare December 9, 2024 20:35

cevian force-pushed the add_load_dataset branch from 3df651f to c7f0b68 Compare December 9, 2024 20:50

JamesGuthrie added 2 commits December 10, 2024 12:55

chore: regenerate snapshot outputs

e5aefc4

Merge remote-tracking branch 'origin/main' into add_load_dataset

e6470b6

JamesGuthrie merged commit 29e740e into main Dec 10, 2024
5 checks passed

JamesGuthrie deleted the add_load_dataset branch December 10, 2024 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add load dataset #253

Add load dataset #253

cevian commented Nov 25, 2024 •

edited

Loading

jgpruitt left a comment

jgpruitt left a comment

jgpruitt Dec 9, 2024

cevian Dec 9, 2024

jgpruitt Dec 10, 2024

jgpruitt Dec 9, 2024

Add load dataset #253

Add load dataset #253

Conversation

cevian commented Nov 25, 2024 • edited Loading

jgpruitt left a comment

Choose a reason for hiding this comment

jgpruitt left a comment

Choose a reason for hiding this comment

jgpruitt Dec 9, 2024

Choose a reason for hiding this comment

cevian Dec 9, 2024

Choose a reason for hiding this comment

jgpruitt Dec 10, 2024

Choose a reason for hiding this comment

jgpruitt Dec 9, 2024

Choose a reason for hiding this comment

cevian commented Nov 25, 2024 •

edited

Loading