-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-3488][MLLIB] Cache python RDDs after deserialization for relevant iterative learners. #2362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ant iterative learners.
|
Can one of the admins verify this patch? |
|
Can your do some benchmark to show the difference? I'm in doubt that caching the serialized data will better than caching the original objects, the former can release the GC pressure a lot. So we do in this way in Spark SQL, the columns are serialized (maybe compressed) for caching. Also, there are some cases that the cache is "none" after this patch, what does it mean? |
|
Hi, I implemented this per discussion here #2347 (comment), assuming I understood the comment correctly. The context is that we are supposed to log a warning when running an iterative learning algorithm on an uncached rdd. What originally led me to identify SPARK-3488 is that if the deserialized python rdds are always uncached, a warning will always be logged. Obviously a meaningful performance difference would trump the implementation of this warning message, and I haven't measured performance - just discussed options in the above referenced pull request. But by way of comparison, is there any significant difference in memory pressure between caching a LabeledPoint rdd deserialized from python and caching a LabeledPoint rdd created natively in scala (which is the typical use case with a scala rather than python client)? If I should do some performance testing, are there any examples of tests and infrastructure you'd suggest as a starting point? 'none' means the rdd is not cached within the python -> scala mllib interface, where previously it was cached. The learning algorithms for which rdds are no longer cached implement their own caching internally (or are not iterative). |
|
I think you could pick any algorithm that you think will have most difference. For repeated warning, maybe it's not hard to make it show only once. |
|
I ran a simple logistic regression performance test on my local machine (ubuntu desktop w/ 8gb ram, ssd disk). I used two data sizes: 2m records, which was not memory constrained, and 10m records which was memory constrained (generating log messages such as 2m records: 10m records: It looks like, running in memory, this patch provides a 33% speed improvement, while the I’m not that familiar with the typical mllib memory profile. Do you think the in-memory result here would be similar to a real world run? Finally, here is the test script. Let me know if it seems reasonable. The data generation was roughly inspired by your mllib perf test in spark-perf. Data generation: Test: |
|
The benchmark result sounds reasonable, thanks for confirm it. Cache the RDD after serialization will reduce the memory usage and GC pressure, but have some CPU overhead. Also caching the serialized data from Python is better than serialize them again in JVM (master is always better than w/ MEMORY_ONLY_SER) I think the memory usage (sometimes out of memory, means stability) is more important than CPU, right now, so I would like to hold off this change, maybe revisit it in the future. |
|
@davies understood, thanks for your feedback. It sounds like for now the preference is to continue caching the python serialized version because the reduced memory footprint is currently worth the cpu cost of repeated deserialization. Would it make sense to preserve the portions of this patch that drop caching for the NaiveBayes, ALS, and DecisionTree learners, which I do not believe require external caching to prevent repeated RDD re-evaluation during learning? NavieBayes only evaluates its input RDD once, while ALS and DecisionTree internally persist transformations of their input RDDs. |
These are helpful, you could do it in another PR. |
|
@staple How many iterations did you run? Did you generate data or load from disk/hdfs? Did you cache the Python RDD? When the dataset is not fully cached, I still expect similar performance. But your result shows a big gap. Maybe it is rotating cached blocks.
|
|
@mengxr I ran for 100 iterations. Loaded data from disk using python's SparkContext.pickleFile() (disk is ssd). I did not do any manual caching. For more details, you can also see the test script I included in my description above. I also saved the logs from my test runs if those are helpful to see. During the 10m record run I saw many log messages about 'CacheManager: Not enough space to cache partition' which I interpreted as indicating lack of caching due to memory exhaustion. But I haven't diagnosed the slowdown beyond that. |
|
@mengxr If I understand correctly, I think you are saying that you don't expect performance degradation with this patch when rdds can't all be cached. I'll look at the cache manager log messages and see if there is any evidence of partitions cycling through there. |
|
For the PR code, it looks like on each training iteration there are messages about not being able to cache partitions rdd_4_5 - rdd_4_27 For the master code, it looks like on each training iteration there are messages about not being able to cache partitions rdd_3_13, rdd_3_15 - rdd_3_27 It looks to me like a greater proportion of the data can be cached in master, I would guess the remainder needing to be pulled from disk. The set of cached partitions seems consistent across all training iterations for a given performance test run. But caveat, this is my first exposure to the caching algorithm in spark. |
When running an iterative learning algorithm, it makes sense that the input RDD be cached for improved performance. When learning is applied to a python RDD, previously the python RDD was always cached, then in scala that cached RDD was mapped to an uncached deserialized RDD, and the uncached RDD was passed to the learning algorithm. Since the RDD with deserialized data was uncached, learning algorithms would implicitly deserialize the same data repeatedly, on every iteration.
This patch moves RDD caching after deserialization for learning algorithms that should be called with a cached RDD. For algorithms that implement their own caching internally, the input RDD is no longer cached. Below I’ve listed the different learning routines accessible from python, the location where caching was previously enabled, and the location (if any) where caching is now enabled by this patch.
LogisticRegressionWithSGD:
was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
now: jvm (trainRegressionModel)
SVMWithSGD:
was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
now: jvm (trainRegressionModel)
NaiveBayes:
was: python (in _get_unmangled_labeled_point_rdd)
now: none
KMeans:
was: python (in _get_unmangled_double_vector_rdd)
now: jvm (trainKMeansModel)
ALS:
was: python (in _get_unmangled_rdd)
now: none
LinearRegressionWithSGD:
was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
now: jvm (trainRegressionModel)
LassoWithSGD:
was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
now: jvm (trainRegressionModel)
RidgeRegressionWithSGD:
was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
now: jvm (trainRegressionModel)
DecisionTree:
was: python (in _get_unmangled_labeled_point_rdd)
now: none