Add support to create a Dataset from spark dataframe #5678

lu-wang-dl · 2023-03-29T04:36:28Z

Feature request

Add a new API Dataset.from_spark to create a Dataset from Spark DataFrame.

Motivation

Spark is a distributed computing framework that can handle large datasets. By supporting loading Spark DataFrames directly into Hugging Face Datasets, we enable take the advantages of spark to processing the data in parallel.

By providing a seamless integration between these two frameworks, we make it easier for data scientists and developers to work with both Spark and Hugging Face in the same workflow.

Your contribution

We can discuss about the ideas and I can help preparing a PR for this feature.

The text was updated successfully, but these errors were encountered:

yanzia12138 · 2023-06-16T13:18:05Z

if i read spark Dataframe , got an error on multi-node Spark cluster.
Did the Api (Dataset.from_spark) support Spark cluster, read dataframe and save_to_disk?

Error:
_pickle.PicklingError: Could not serialize object: RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforma
tion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
23/06/16 21:17:20 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)

oakkas84 · 2023-06-16T15:16:01Z

How to perform predictions on Dataset object in Spark with multi-node cluster parallelism?

mariosasko · 2023-07-21T14:15:38Z

Addressed in #5701

lhoestq · 2024-08-27T14:41:35Z

Hi ! for your information we are working on some more documentation on how to use Spark with HF Datasets repositories (without the need for the datasets library) ~~#5678~~
Cc @lu-wang-dl @maddiedawson let me know what you think !

lhoestq · 2024-08-27T14:42:58Z

sorry, wrong link: huggingface/hub-docs#1392

lu-wang-dl added the enhancement New feature or request label Mar 29, 2023

maddiedawson mentioned this issue Apr 3, 2023

Add Dataset.from_spark #5701

Merged

maddiedawson mentioned this issue Apr 27, 2023

Add IterableDataset.from_spark #5770

Merged

mariosasko closed this as completed Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to create a Dataset from spark dataframe #5678

Add support to create a Dataset from spark dataframe #5678

lu-wang-dl commented Mar 29, 2023

yanzia12138 commented Jun 16, 2023 •

edited

Loading

oakkas84 commented Jun 16, 2023

mariosasko commented Jul 21, 2023

lhoestq commented Aug 27, 2024 •

edited

Loading

lhoestq commented Aug 27, 2024

Add support to create a Dataset from spark dataframe #5678

Add support to create a Dataset from spark dataframe #5678

Comments

lu-wang-dl commented Mar 29, 2023

Feature request

Motivation

Your contribution

yanzia12138 commented Jun 16, 2023 • edited Loading

oakkas84 commented Jun 16, 2023

mariosasko commented Jul 21, 2023

lhoestq commented Aug 27, 2024 • edited Loading

lhoestq commented Aug 27, 2024

yanzia12138 commented Jun 16, 2023 •

edited

Loading

lhoestq commented Aug 27, 2024 •

edited

Loading