Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to create a Dataset from spark dataframe #5678

Closed
lu-wang-dl opened this issue Mar 29, 2023 · 5 comments
Closed

Add support to create a Dataset from spark dataframe #5678

lu-wang-dl opened this issue Mar 29, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@lu-wang-dl
Copy link

Feature request

Add a new API Dataset.from_spark to create a Dataset from Spark DataFrame.

Motivation

Spark is a distributed computing framework that can handle large datasets. By supporting loading Spark DataFrames directly into Hugging Face Datasets, we enable take the advantages of spark to processing the data in parallel.

By providing a seamless integration between these two frameworks, we make it easier for data scientists and developers to work with both Spark and Hugging Face in the same workflow.

Your contribution

We can discuss about the ideas and I can help preparing a PR for this feature.

@yanzia12138
Copy link

yanzia12138 commented Jun 16, 2023

if i read spark Dataframe , got an error on multi-node Spark cluster.
Did the Api (Dataset.from_spark) support Spark cluster, read dataframe and save_to_disk?

Error:
_pickle.PicklingError: Could not serialize object: RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforma
tion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
23/06/16 21:17:20 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)

@oakkas84
Copy link

How to perform predictions on Dataset object in Spark with multi-node cluster parallelism?

@mariosasko
Copy link
Collaborator

Addressed in #5701

@lhoestq
Copy link
Member

lhoestq commented Aug 27, 2024

Hi ! for your information we are working on some more documentation on how to use Spark with HF Datasets repositories (without the need for the datasets library) #5678
Cc @lu-wang-dl @maddiedawson let me know what you think !

@lhoestq
Copy link
Member

lhoestq commented Aug 27, 2024

sorry, wrong link: huggingface/hub-docs#1392

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants