Skip to content

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Feb 8, 2021

What changes were proposed in this pull request?

There are 3 ways to use Guava cache in spark code:

  1. Loadingcache is the main way to use Guava cache in spark code and the key usages are as follows:
    a. LoadingCache with maximumsize data eviction policy, such as appCache in ApplicationCache, cache in Codegenerator
    b. LoadingCache with maximumWeight data eviction policy, such as shuffleIndexCache in ExternalShuffleBlockResolver
    c. LoadingCache with 'expireAfterWrite' data eviction policy, such as tableRelationCache in SessionCatalog

  2. ManualCache is another way to use Guava cache in spark code and the key usage is cache in SharedInMemoryCache, it use to caches partition file statuses in memory

  3. The last use way is hadoopJobMetadata in SparkEnv, it uses Guava Cache to build a soft-reference map.

The goal of this pr is use Caffeine instead of Guava Cache because Caffeine is faster than Guava Cache from benchmarks, the main changes as follows:

  1. Add Caffeine deps to maven pom.xml

  2. Use Caffeine instead of Guava LoadingCache, ManualCache and soft-reference map in SparkEnv

  3. Add LocalCacheBenchmark to compare performance of Loadingcache between Guava Cache and Caffeine

Why are the changes needed?

Caffeine is faster than Guava Cache from benchmarks

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Pass the Jenkins or GitHub Action
  • Add LocalCacheBenchmark to compare performance of Loadingcache between Guava Cache and Caffeine

@LuciferYang LuciferYang changed the title [SPARK-34309][CORE][SQL] Use Caffeine instead of Guava Cache [WIP][SPARK-34309][CORE][SQL] Use Caffeine instead of Guava Cache Feb 8, 2021
@LuciferYang LuciferYang marked this pull request as draft February 8, 2021 05:44
@SparkQA
Copy link

SparkQA commented Feb 8, 2021

Test build #135009 has finished for PR 31517 at commit 4761a5b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2021

Test build #135006 has finished for PR 31517 at commit 0c5382a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2021

Test build #135011 has finished for PR 31517 at commit 4b49b84.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member

Ngone51 commented Feb 8, 2021

Just out of curiosity, how much performance improvement we can get by using Caffeine? Do we have a rough number?

@LuciferYang
Copy link
Contributor Author

@Ngone51 I have tested some benchmark data in SPARK-34309, but I haven't further tested it with spark scenarios.

Further progress may be delayed, the Spring Festival holiday is coming soon in China, haha :)

@LuciferYang LuciferYang changed the title [WIP][SPARK-34309][CORE][SQL] Use Caffeine instead of Guava Cache [WIP][SPARK-34309][BUILD][CORE][SQL] [K8S]Use Caffeine instead of Guava Cache Feb 9, 2021
@dongjoon-hyun
Copy link
Member

Gentle ping, @LuciferYang .

@SparkQA
Copy link

SparkQA commented Jul 29, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46319/

@SparkQA
Copy link

SparkQA commented Jul 29, 2021

Test build #141806 has finished for PR 31517 at commit 6c74fc6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait AlterTableColumnCommand extends UnaryCommand
  • case class AlterTableAddColumns(
  • case class AlterTableReplaceColumns(

@SparkQA
Copy link

SparkQA commented Jul 29, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46331/

@SparkQA
Copy link

SparkQA commented Jul 29, 2021

Test build #141818 has finished for PR 31517 at commit 5a75b2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46445/

@SparkQA
Copy link

SparkQA commented Aug 2, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46445/

@SparkQA
Copy link

SparkQA commented Aug 2, 2021

Test build #141935 has finished for PR 31517 at commit 706a68d.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46531/

@SparkQA
Copy link

SparkQA commented Aug 4, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46531/

@SparkQA
Copy link

SparkQA commented Aug 4, 2021

Test build #142020 has finished for PR 31517 at commit 81f863f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 01cf6f4 Aug 4, 2021
@holdenk
Copy link
Contributor

holdenk commented Aug 4, 2021

Merged to the current branch targeting 3.3.0. Thanks everyone for taking the time to review the PR & special thanks to @LuciferYang for sticking with this for several months.

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Aug 5, 2021

thanks all ~

@dongjoon-hyun
Copy link
Member

Thank you, @LuciferYang and @holdenk and all!

@gatorsmile
Copy link
Member

Sorry for my late review.

@LuciferYang Since it adds the new dependences, could you show us the benefits besides the one in the micro-benchmark? Yes, Caffeine is faster than Guava Cache from benchmarks, but does it affect our performance if it is not being used in any hot path? Maybe I am wrong, could you show me an example?

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Aug 11, 2021

Thank you for your review @gatorsmile ~

On the one hand, I think the main purpose of using Guava cache in the code is to solve the problem of hot path access, so I want to recommend a faster library assuming I understand it correctly , there are some other benchmark results in SPARK-34309.

On the other hand, Spark is still using guava 14.0.1, which is a 2013 version and the latest version is 30.1. There are some known issues related to Guava Cache(such as google/guava#1761, google/guava#1337 and so on), and I also found that some JIRA wanted to upgrade the Guava version(like SPARK-33090SPARK-32502, SPARK-23897, SPARK-25762 and so on) but didn't completed (I'm very sorry I didn't study the reasons for why they weren't completed, maybe it's related to some cascading dependencies?), so I try to use a new Cache library to get rid of the dependence on Guava cache and it's easier to upgrade it to fix some possible issues

Does the current use of Caffeine have any negative effects? If it exists, we can revert this pr.

@LuciferYang LuciferYang deleted the guava-cache-to-caffeine branch August 11, 2021 05:12
@HyukjinKwon
Copy link
Member

The benchmark, etc. here look promising. My only concern is that hidden behaviour changes by switching the library. Do they claim compatibility vs Guava at least?

Does the current use of Caffeine have any negative effects?

So I felt like this had to be investigated first.

@LuciferYang
Copy link
Contributor Author

The benchmark, etc. here look promising. My only concern is that hidden behaviour changes by switching the library. Do they claim compatibility vs Guava at least?

@HyukjinKwon As far as I know, there are some differences in exception handling, for example:

Guava Cache

   * @throws ExecutionException if a checked exception was thrown while loading the value
   * @throws UncheckedExecutionException if an unchecked exception was thrown while loading the
   *     value
   * @throws ExecutionError if an error was thrown while loading the value
   */
  V get(K key) throws ExecutionException;

Caffeine

   * @throws NullPointerException if the specified key is null
   * @throws IllegalStateException if the computation detectably attempts a recursive update to this
   *         cache that would otherwise never complete
   * @throws CompletionException if a checked exception was thrown while loading the value
   * @throws RuntimeException or Error if the {@link CacheLoader} does so, in which case the mapping
   *         is left unestablished
   */
  @Nullable
  V get(@NonNull K key);

But in fact, @ben-manes also provided CaffeinatedGuava to solve these differences, but we didn't use CaffeinatedGuava at @holdenk 's suggestion.

Maybe I don't know enough. @ben-manes Could you help me give more information?

@ben-manes
Copy link

ben-manes commented Aug 11, 2021

The choice on what is best for this project is yours, and Guava's is certainly a fine cache. If there is no performance, feature, or stability gain then leaving it as is would be a reasonable, conservative choice. I'll provide some background and project information, but I can't speak to what is best for your project.

History

I coauthored Guava's cache, where Charles and I tried to port algorithmic ideas from my earlier CLHM project into Bob Lee's MapMaker class. This was challenging because Bob made design decisions that I disagreed with, as he was very enthusiastic about soft references as an ideal caching strategy and optimized for that (hence replacing the decorator-based ReferenceMap in Google Collections). While this appears optimal in a microbenchmark, the GC impact made it unusable in practice (e.g. AdWords' SRE team vocally complained). We introduced a concurrent LRU, rewrote expiration to O(1), and designed an API with feedback from Josh Bloch. There were major performance flaws from the original structure that I couldn't fix, but this was in the Java 5 time period (mostly Java 4 idiomatically) so it was still quite good. The API, feature set, and best-in-class performance on low core servers was a step forward. It replaced all of the ad hoc Java caches at Google, which was the main goal for a Guava as a standard internal library. By Java 8 everyone who worked on it had moved on, ConcurrentHashMap was completely rewritten, and the language had evolved. Caffeine is a complete rewrite with a design biased towards size-eviction (not reference), an updated Guava-like API, and no time pressures to allow for exploration of data structures & algorithms.

Compatibility

The Guava adapters are supported by a port of Guava's unit tests, minus assertions that inspect internal state. It passes the behavioral checks as implemented, though the tests are not exhaustive.

Caffeine's test suite is parameterized with test methods declaring their specification constraints. This allows us to brute force runs of every acceptable configuration, which helps detect if features interact badly. This also lets us run the tests against Guava through reverse adapters, thereby providing a reference implementation. In sum the entire test suite runs ~6.8 million scenarios over 1.5 hours on CI.

There are small differences. Some are bugs found in Guava, not all of which are fixed. Others are improvements, such as linearizability and a different eviction policy. The most important changes for migration is discussed on the wiki.

Tests and static analyzers don't catch all mistakes. There has been enough usage to uncover and fix many issues. At this point the Guava team does not actively maintain its cache and recommends Caffeine, which they prefer to use internally. Since I haven't been there since Guava's heyday, the switch means that I had little influence on that.

Recommendations

That's not my place. I can only say that the library tries to be a modern version of Guava's cache, is actively maintained, and enough bugs have been fixed that users have ensured a mature codebase. The design differences overall means that Caffeine is usually much better, but tradeoffs means Guava might be slightly ahead in a few narrow cases. What makes sense for this project and its maintainability is your team's decision.

@HyukjinKwon
Copy link
Member

Thanks @ben-manes.

@zsxwing
Copy link
Member

zsxwing commented Aug 12, 2021

@ben-manes thanks for the context and building for the great project.

Does the current use of Caffeine have any negative effects? If it exists, we can revert this pr.

@LuciferYang Caffeine is definitely an amazing and popular project. But when we decide whether we should add a library , especially a popular library, to Spark, we also need to think about the impact to the downstream libraries and Spark applications. As Caffeine is pretty popular, many downstream libraries and Spark users probably are using it. If Spark adds it to its classpath, it will force them to use the same Caffeine version as Spark. If someone would like to use a different Caffeine version, the work would be non trivial since they need to shade their Caffeine version.

Do you have any examples that we can get Caffeine's benefits from hot path, so that we can use that to make the tradeoff?

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Aug 12, 2021

@LuciferYang Caffeine is definitely an amazing and popular project. But when we decide whether we should add a library , especially a popular library, to Spark, we also need to think about the impact to the downstream libraries and Spark applications. As Caffeine is pretty popular, many downstream libraries and Spark users probably are using it. If Spark adds it to its classpath, it will force them to use the same Caffeine version as Spark. If someone would like to use a different Caffeine version, the work would be non trivial since they need to shade their Caffeine version.

Can we solve this problem by shade + relocation, just like now Spark uses Guava and Jetty now. If we accept it, I can do this work.

Do you have any examples that we can get Caffeine's benefits from hot path, so that we can use that to make the tradeoff?

For example, FileStatusCache on driver side of a resident process(like thrift-server). Of course, the improvement will not be very significant because Guava Cache is not bad.

Overall, I think both are OK. If need to revert the code, please tell me.I am not entangled and opposed to this. I can also be responsible for completing this work :)

I just want to mention it again. As I said above, we are still using Guava 14.0.1, which is a 2013 version. Some fixed about Guava Cache and other Guava components are still hidden in Spark, and I found many JIRA want to upgrade it which seems to have not been completed. And Caffeine is a component that only focuses on cache, so it is relatively easy to upgrade it :)

sarutak added a commit that referenced this pull request Aug 18, 2021
### What changes were proposed in this pull request?

This PR upgrades Caffeine to `2.9.2`.
Caffeine was introduced in SPARK-34309 (#31517). At the time that PR was opened, the latest version of caffeine was `2.9.1` but now `2.9.2` is available.

### Why are the changes needed?

`2.9.2` have the following improvements (https://github.com/ben-manes/caffeine/releases/tag/v2.9.2).

* Fixed reading an intermittent null weak/soft value during a concurrent write
* Fixed extraneous eviction when concurrently removing a collected entry after a writer resurrects it with a new mapping
* Fixed excessive retries of discarding an expired entry when the fixed duration period is extended, thereby resurrecting it

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CIs.

Closes #33772 from sarutak/upgrade-caffeine-2.9.2.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
HyukjinKwon pushed a commit that referenced this pull request Aug 22, 2021
…Guava Cache"

### What changes were proposed in this pull request?
This pr revert the change of SPARK-34309, includes:

- #31517
- #33772

### Why are the changes needed?

1. No really performance improvement in Spark
2. Added an additional dependency

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #33784 from LuciferYang/revert-caffeine.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.