-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better SparkSession settings for localhost #143
Comments
Good job @MrPowers to get this conversation started I think we have 3 ways to do this (going from less likely to have an impact to most likely to have an impact) 1- recommend the different configuration on the Spark docs, get started guide, etc 2- change Spark so it automatically uses the ‘better’ configuration above 3- change Spark so that it ‘guesses’ and uses a configuration based on the ram available, the number of CPU cores, etc On top of improving Spark performance (which is important), I wonder what should be the Spark positioning : should Spark be the engine for big data or the engine for data of any size? Your thoughts? |
@lucazanna - I think we should first figure out the true capabilities of Spark locally and then figure out the best messaging. Here are the results for one of the h2o queries: I think the current benchmarks are really misleading... |
I'm in favor of the automatic configuration (leaning towards higher memory consumption) with configurable parameters that the user can change if needed. I think these are good configurable parameters:
For executor and driver memory we could do a percentage of available system memory. It doesn't look like there's a good way to do this with Python's standard library but |
@jeffbrennan - figuring out how to programatically set the best settings is a great goal. The first step is to get everyone with the same datasets on their local machines so we can tinker and find what settings work best. There are so many Spark configuration options and I'm not even sure which knobs need to be turned (let alone how to optimally turn them)! |
Here are some other suggestions that might be useful: https://luminousmen.com/post/how-to-speed-up-spark-jobs-on-small-test-datasets |
I do use this one
With this one users get max mem and cpu in local mode. |
Here are some other settings that might be useful: https://www.linkedin.com/posts/dipanjan-s-5874b1a0_pyspark-unittesting-optimization-activity-7122896865762701312-oF-j?utm_source=share&utm_medium=member_desktop |
Do you absolutely need spark? What about polars or duckdb in case you only target single node deployments? |
The topic is about running unit tests of spark routines. These tests are running on a single node (locally). Maybe I'm missing something, but how polars/duckdb may help here? |
No- for these purposes it will not help |
Users need to configure their SparkSession for localhost development so computations run fast and so that they don't run out of memory.
Here are some examples I ran on my local machine that has 64GB of RAM on the 1e9 h2o groupby dataset (has 1 billion rows of data).
Here's the "better config":
Here's the default config:
groupby query
This query takes 104 seconds with the "better config":
This same query errors out with the default config.
join query
This query takes 69 seconds with the "better config", but 111 seconds with the default config:
Conclusion
SparkSession
configurations significantly impact the localhost Spark runtime experience.How can we make it easy for Spark users to get optimal configurations for localhost development?
The text was updated successfully, but these errors were encountered: