Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance benchmark on RayDP v.s. Spark #340

Open
chenya-zhang opened this issue May 5, 2023 · 3 comments
Open

Performance benchmark on RayDP v.s. Spark #340

chenya-zhang opened this issue May 5, 2023 · 3 comments

Comments

@chenya-zhang
Copy link

chenya-zhang commented May 5, 2023

Hi there,

In the talk "RayDP: Build Large-scale End-to-end Data Analytics and AI Pipelines Using Spark and Ray" https://youtu.be/ELSrR1Geqg4?t=819, @carsonwang mentioned that RayDP would have better performance.

We are curious which type of queries / workflows you run and your analysis on the performance differences.

Thanks a lot!

@carsonwang
Copy link
Collaborator

Hi @chenya-zhang , there is a plan to integrate RayDP with Gluten which offloads the sql operations to native engine such as Velox. For TPC-H or TPC-DS like benchmark, we observed more than 2x speedup. You can find more details from the Gluten project https://github.com/oap-project/gluten.

We are also running RayDP + XGBoost on Ray workflows and observed performance advantage over running XGBoost on Spark. We will share more once the data is ready to publish.

@rishabh-dream11
Copy link

Hi @carsonwang, Can you please share the performance benchmark numbers for Ray + XGBoost vs XGboost on Spark.

@rishabh-dream11
Copy link

@carsonwang Did the plan to integrate RayDP with Gluten materialize?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants