-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark worker scaling parameters #260
Comments
... seeing how our Spark instance is hardcoded to run locally ( renaissance/benchmarks/apache-spark/src/main/scala/org/renaissance/apache/spark/SparkUtil.scala Line 61 in 510b3b9
|
I was under the impression that the number of executors influenced the default number of partitions created by default when creating RDDs directly from files on disk. It was also one of the reasons I avoided creating RDDs by just preparing data collections in memory and calling |
This was in the original code from the very beginning. I probably did a cursory check of the documentation at some point to see what that means, but at this point I only recall that the whole idea was to control the number of executors (thread pools) and the number of threads per executor. If this does not work as expected, then we should revisit this completely. |
I think in local mode there is always only one (in process) Not sure if some other functions (like partitioning) react to the setting of If we aim for single JVM execution, then I think we can drop the executor count thing, as well as few other config bits, and just set the master to |
The thing with controlling the number of executor instances appears to originate from #145 and #147 and at that time, it seemed to work for @farquet. Looking at the allowed master URLs, the recommended setting for I wonder if we should perhaps set the master to |
#274 removes the configuration of executor instances (along with the benchmark parameter) as well as explicit input data partitioning (for now). I was wondering whether it would make sense to have a parameter, e.g., |
I have updated the PR and the measurement bundle (plugins work now). For testing, I added |
Assuming other machines behave similarly, I think we should cap the number of threads used as follows (with a warning if more cores are available):
That is until we tackle the scaling issue more systematically. |
The limits reflect the recommendations in #260 (comment)
Currently, the performance of the Spark benchmarks does not change with the configured number of executors, except for ALS, which partitions the input data based on the configuration. This may be a relevant note in the Spark documentation:
This is quite vague, but may explain why our code (
renaissance/benchmarks/apache-spark/src/main/scala/org/renaissance/apache/spark/SparkUtil.scala
Line 66 in 510b3b9
The text was updated successfully, but these errors were encountered: