Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load Datasets from Hugging Face #19

Open
lhoestq opened this issue Mar 3, 2025 · 2 comments
Open

Load Datasets from Hugging Face #19

lhoestq opened this issue Mar 3, 2025 · 2 comments

Comments

@lhoestq
Copy link

lhoestq commented Mar 3, 2025

Hi team, and congrats on releasing smallpond ! I'm a big DuckDB fan and it's a pleasure to see it used to process data at scale :)

Anyway I was wondering if you planned to add support for loading datasets from Hugging Face ?

There are 300k+ AI datasets available, mostly in Parquet. And since DuckDB does support reading from hf:// paths (docs), and pyarrow as well via fsspec (docs), I figured it would be an easy integration.

In case it can help, I already made a fork with basic HF support here: https://github.com/lhoestq/smallpond, based on your fork @mike-luabase

>>> import smallpond
>>> sp = smallpond.init()
>>> df = sp.read_parquet("hf://datasets/openai/gsm8k/**/*.parquet")
>>> df = sp.partial_sql("SELECT * FROM {0} LIMIT 10", df)
>>> print(df.to_pandas())
                                            question                                             answer
0  Natalia sold clips to 48 of her friends in Apr...  How many clips did Natalia sell in May? ** Nat...
1  Weng earns $12 an hour for babysitting. Yester...  How much does Weng earn per minute? ** Weng ea...
2  Betty is saving money for a new wallet which c...  How much money does Betty have in the beginnin...
3  Julie is reading a 120-page book. Yesterday, s...  How many pages did Maila read today? ** Maila ...
4  James writes a 3-page letter to 2 different fr...  How many pages does he write each week? ** He ...
5  Mark has a garden with flowers. He planted pla...  How many more purple flowers are there than ye...
6  Albert is wondering how much pizza he can eat ...  How many slices does the largest pizza have? *...
7  Ken created a care package to send to his brot...  How many pounds of brownies did Ken add? ** To...
8  Alexis is applying for a new job and bought a ...  Define a variable ** Let S be the amount Alexi...
9  Tina makes $18.00 an hour.  If she works more ...  How much does Tina make in an 8-hour shift? **..
@mahanteshimath
Copy link

Nice one.

I am facing an error while reading the local .parquet file. I am using Python 3.11.11
I tried reinstalling smallpond . Could you please comment with your thoughts to resolve issue?

Image

Image

Image

@lhoestq
Copy link
Author

lhoestq commented Mar 3, 2025

Your problem seems unrelated to the issue in this github issue @mahanteshimath, can you open a separate issue instead ? (and maybe provide the full stacktrace ?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants