-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache #31517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…k into guava-cache-to-caffeine
core/src/main/scala/org/apache/spark/deploy/history/ApplicationCache.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #135009 has finished for PR 31517 at commit
|
|
Test build #135006 has finished for PR 31517 at commit
|
|
Test build #135011 has finished for PR 31517 at commit
|
|
Just out of curiosity, how much performance improvement we can get by using Caffeine? Do we have a rough number? |
|
@Ngone51 I have tested some benchmark data in SPARK-34309, but I haven't further tested it with spark scenarios. Further progress may be delayed, the Spring Festival holiday is coming soon in China, haha :) |
|
Gentle ping, @LuciferYang . |
|
Kubernetes integration test status success |
|
Test build #141806 has finished for PR 31517 at commit
|
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #141818 has finished for PR 31517 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #141935 has finished for PR 31517 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #142020 has finished for PR 31517 at commit
|
|
Merged to the current branch targeting 3.3.0. Thanks everyone for taking the time to review the PR & special thanks to @LuciferYang for sticking with this for several months. |
|
thanks all ~ |
|
Thank you, @LuciferYang and @holdenk and all! |
|
Sorry for my late review. @LuciferYang Since it adds the new dependences, could you show us the benefits besides the one in the micro-benchmark? Yes, Caffeine is faster than Guava Cache from benchmarks, but does it affect our performance if it is not being used in any hot path? Maybe I am wrong, could you show me an example? |
|
Thank you for your review @gatorsmile ~ On the one hand, I think the main purpose of using Guava cache in the code is to solve the problem of hot path access, so I want to recommend a faster library assuming I understand it correctly , there are some other benchmark results in SPARK-34309. On the other hand, Spark is still using guava 14.0.1, which is a 2013 version and the latest version is 30.1. There are some known issues related to Guava Cache(such as google/guava#1761, google/guava#1337 and so on), and I also found that some JIRA wanted to upgrade the Guava version(like SPARK-33090,SPARK-32502, SPARK-23897, SPARK-25762 and so on) but didn't completed (I'm very sorry I didn't study the reasons for why they weren't completed, maybe it's related to some cascading dependencies?), so I try to use a new Cache library to get rid of the dependence on Guava cache and it's easier to upgrade it to fix some possible issues Does the current use of Caffeine have any negative effects? If it exists, we can revert this pr. |
|
The benchmark, etc. here look promising. My only concern is that hidden behaviour changes by switching the library. Do they claim compatibility vs Guava at least?
So I felt like this had to be investigated first. |
@HyukjinKwon As far as I know, there are some differences in exception handling, for example: Guava Cache Caffeine But in fact, @ben-manes also provided CaffeinatedGuava to solve these differences, but we didn't use Maybe I don't know enough. @ben-manes Could you help me give more information? |
|
The choice on what is best for this project is yours, and Guava's is certainly a fine cache. If there is no performance, feature, or stability gain then leaving it as is would be a reasonable, conservative choice. I'll provide some background and project information, but I can't speak to what is best for your project. HistoryI coauthored Guava's cache, where Charles and I tried to port algorithmic ideas from my earlier CLHM project into Bob Lee's MapMaker class. This was challenging because Bob made design decisions that I disagreed with, as he was very enthusiastic about soft references as an ideal caching strategy and optimized for that (hence replacing the decorator-based ReferenceMap in Google Collections). While this appears optimal in a microbenchmark, the GC impact made it unusable in practice (e.g. AdWords' SRE team vocally complained). We introduced a concurrent LRU, rewrote expiration to O(1), and designed an API with feedback from Josh Bloch. There were major performance flaws from the original structure that I couldn't fix, but this was in the Java 5 time period (mostly Java 4 idiomatically) so it was still quite good. The API, feature set, and best-in-class performance on low core servers was a step forward. It replaced all of the ad hoc Java caches at Google, which was the main goal for a Guava as a standard internal library. By Java 8 everyone who worked on it had moved on, ConcurrentHashMap was completely rewritten, and the language had evolved. Caffeine is a complete rewrite with a design biased towards size-eviction (not reference), an updated Guava-like API, and no time pressures to allow for exploration of data structures & algorithms. CompatibilityThe Guava adapters are supported by a port of Guava's unit tests, minus assertions that inspect internal state. It passes the behavioral checks as implemented, though the tests are not exhaustive. Caffeine's test suite is parameterized with test methods declaring their specification constraints. This allows us to brute force runs of every acceptable configuration, which helps detect if features interact badly. This also lets us run the tests against Guava through reverse adapters, thereby providing a reference implementation. In sum the entire test suite runs ~6.8 million scenarios over 1.5 hours on CI. There are small differences. Some are bugs found in Guava, not all of which are fixed. Others are improvements, such as linearizability and a different eviction policy. The most important changes for migration is discussed on the wiki. Tests and static analyzers don't catch all mistakes. There has been enough usage to uncover and fix many issues. At this point the Guava team does not actively maintain its cache and recommends Caffeine, which they prefer to use internally. Since I haven't been there since Guava's heyday, the switch means that I had little influence on that. RecommendationsThat's not my place. I can only say that the library tries to be a modern version of Guava's cache, is actively maintained, and enough bugs have been fixed that users have ensured a mature codebase. The design differences overall means that Caffeine is usually much better, but tradeoffs means Guava might be slightly ahead in a few narrow cases. What makes sense for this project and its maintainability is your team's decision. |
|
Thanks @ben-manes. |
|
@ben-manes thanks for the context and building for the great project.
@LuciferYang Caffeine is definitely an amazing and popular project. But when we decide whether we should add a library , especially a popular library, to Spark, we also need to think about the impact to the downstream libraries and Spark applications. As Caffeine is pretty popular, many downstream libraries and Spark users probably are using it. If Spark adds it to its classpath, it will force them to use the same Caffeine version as Spark. If someone would like to use a different Caffeine version, the work would be non trivial since they need to shade their Caffeine version. Do you have any examples that we can get Caffeine's benefits from hot path, so that we can use that to make the tradeoff? |
Can we solve this problem by
For example, Overall, I think both are OK. If need to revert the code, please tell me.I am not entangled and opposed to this. I can also be responsible for completing this work :) I just want to mention it again. As I said above, we are still using Guava 14.0.1, which is a 2013 version. Some fixed about Guava Cache and other Guava components are still hidden in Spark, and I found many JIRA want to upgrade it which seems to have not been completed. And Caffeine is a component that only focuses on cache, so it is relatively easy to upgrade it :) |
### What changes were proposed in this pull request? This PR upgrades Caffeine to `2.9.2`. Caffeine was introduced in SPARK-34309 (#31517). At the time that PR was opened, the latest version of caffeine was `2.9.1` but now `2.9.2` is available. ### Why are the changes needed? `2.9.2` have the following improvements (https://github.com/ben-manes/caffeine/releases/tag/v2.9.2). * Fixed reading an intermittent null weak/soft value during a concurrent write * Fixed extraneous eviction when concurrently removing a collected entry after a writer resurrects it with a new mapping * Fixed excessive retries of discarding an expired entry when the fixed duration period is extended, thereby resurrecting it ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CIs. Closes #33772 from sarutak/upgrade-caffeine-2.9.2. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
…Guava Cache" ### What changes were proposed in this pull request? This pr revert the change of SPARK-34309, includes: - #31517 - #33772 ### Why are the changes needed? 1. No really performance improvement in Spark 2. Added an additional dependency ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #33784 from LuciferYang/revert-caffeine. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
There are 3 ways to use Guava cache in spark code:
Loadingcacheis the main way to use Guava cache in spark code and the key usages are as follows:a.
LoadingCachewithmaximumsizedata eviction policy, such asappCacheinApplicationCache,cacheinCodegeneratorb.
LoadingCachewithmaximumWeightdata eviction policy, such asshuffleIndexCacheinExternalShuffleBlockResolverc.
LoadingCachewith 'expireAfterWrite' data eviction policy, such astableRelationCacheinSessionCatalogManualCacheis another way to use Guava cache in spark code and the key usage iscacheinSharedInMemoryCache, it use to caches partition file statuses in memoryThe last use way is
hadoopJobMetadatainSparkEnv, it uses Guava Cache to build asoft-reference map.The goal of this pr is use
Caffeineinstead ofGuava CachebecauseCaffeineis faster thanGuava Cachefrom benchmarks, the main changes as follows:Add
Caffeinedeps to mavenpom.xmlUse
Caffeineinstead of GuavaLoadingCache,ManualCacheand soft-reference map inSparkEnvAdd
LocalCacheBenchmarkto compare performance ofLoadingcachebetweenGuava CacheandCaffeineWhy are the changes needed?
Caffeineis faster thanGuava Cachefrom benchmarksDoes this PR introduce any user-facing change?
No
How was this patch tested?
LocalCacheBenchmarkto compare performance ofLoadingcachebetweenGuava CacheandCaffeine