Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: PySpark integration #3774

Open
BjarkeTornager opened this issue Jul 7, 2024 · 5 comments
Open

Feature: PySpark integration #3774

BjarkeTornager opened this issue Jul 7, 2024 · 5 comments
Labels
feature New features or missing components of existing features

Comments

@BjarkeTornager
Copy link

API

Python

Description

Have you considered making an integration between Kùzu and PySpark?

Neo4j, as an example, has a Neo4j connector for Apache Spark.

Spark also has a community project called GraphFrames that can be used for basic graph algorithms.

Since Spark is widely used for analytical queries, machine learning, and streaming it could be useful to move between the two.

@BjarkeTornager BjarkeTornager added the feature New features or missing components of existing features label Jul 7, 2024
@prrao87
Copy link
Member

prrao87 commented Jul 8, 2024

Hi @BjarkeTornager, this is something that could be on the roadmap but not yet been prioritized as we typically wait for several upvotes from the community to decide how much to prioritize new integrations. There are numerous other integrations already underway for our 0.5.0 release and beyond, so hope you can understand. In the meantime, we are also releasing a basic graph algorithms package soon that can provide some of the functionality that GraphFrames does, so stay tuned!

@BjarkeTornager
Copy link
Author

Thanks @prrao87, looking forward to the Kùzu basic graph algorithm package!

@abhiwattpad
Copy link

It would be have to have spark integration with kuzu, especially for large scale data ingestion!

@prrao87
Copy link
Member

prrao87 commented Aug 12, 2024

Just adding some scope for initial functionality here: The proposed integration would behave just like the Pandas/Polars DataFrame integration does:

  • Scan data from PySpark DataFrame into a Kùzu node/rel table
  • Export the results of a Cypher query to a Spark DataFrame

Unlike Pandas/Polars, the I/O and related tasks may not be fully in-memory - we'd need to see how the persistent formats under the hood of Spark work, and also how to design the API to expose the connector to the Python client of Kùzu.

@lucifermorningstar1305
Copy link

While dealing with large scale data it's best if there is a way to integrate kuzu with spark dataframe. Something like what Neo4j has. This way anyone can upload batches of data to Kuzu without writing extensive code.

@prrao87 prrao87 pinned this issue Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New features or missing components of existing features
Projects
None yet
Development

No branches or pull requests

4 participants