-
Notifications
You must be signed in to change notification settings - Fork 29k
Revert "[SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache" #33784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @gatorsmile @HyukjinKwon @zsxwing @sarutak @sunchao @holdenk @mridulm @dongjoon-hyun and |
|
hm I was thinking that we'd keep that fix since that's already merged. |
|
That PR was merged for |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #142637 has finished for PR 33784 at commit
|
|
+1 on keeping this. I think the main pushback is that this will leak dependency to downstream apps and cause conflicts, but we can figure out a way, such as shading, during the Spark 3.3 timeline. The same issue also applies to Guava today and we should tackle them together. |
|
@LuciferYang Thank you for submitting the revert PR. @sarutak @dongjoon-hyun As I mentioned in the previous PR, adding a popular library to Spark's classpath directly is risky. @LuciferYang 's suggestion to shade Caffeine is a good solution. But it sounds not worth because:
Do I miss any benefits provided by Caffeine? |
|
Maybe I'm missing some context, what's the reason for a revert to a slower implementation? Is it that there might be a conflict? Have you encountered this conflict with any projects? If anything reducing our scope on Guava (the cause of many conflicts) seems to be useful? |
|
Also wait, I'm under the impression that shuffle block fetching is in the hot-path, we have to block on fetches. |
The improvement doesn't sound noticeable since this is a RPC operation.
It's hard to check this before the release goes out. But since Caffeine is pretty popular (10.3k stars), the overlap of Spark and Caffeine users should not be small, and the chance to hit the conflict is not low. SPARK-34309 also mentioned that
Not sure whether they use it in the client or the server (It would be great if someone can help confirm). If they use it in the client, it may be a problem when people use the connector for these libraries with Spark. |
|
Ok looking at the datastax cassandra connector I'm not seeing caffeine in the direct dependencies - https://mvnrepository.com/artifact/com.datastax.oss/java-driver-core/4.13.0 , same thing with neo4j client https://mvnrepository.com/artifact/org.neo4j.driver/neo4j-java-driver/4.3.4 While I won't -1 this, I am -0 given the lack of an known conflict. |
|
https://mvnrepository.com/artifact/com.github.ben-manes.caffeine/caffeine/usages says
https://mvnrepository.com/artifact/org.apache.spark/spark-sql/usages says
This suggests that Caffeine is as popular as Spark. I think the risk of conflict is pretty high. |
Then it should be easy to find a case that matters :) |
|
Because, for comparison, we use antlr4 which has ~955 usages & bouncycastle with > 2.5k usages without problem. |
|
Apache Solr is using Caffeine https://mvnrepository.com/artifact/org.apache.solr/solr-core/8.9.0 Sounds like a good example of potential conflict. Anyway, we still cannot predict the Caffeine versions used by Spark applications directly since most of users won't publish their applications to maven central. I'm definitely -1 for putting Caffeine on the Spark classpath directly. But +0 if other people think the extra work to shade the three libraries and the maintenance burden are the reasonable cost to pay. |
I think it's good to distinguish between the client, which doesn't depend on solr-core or caffeine https://mvnrepository.com/artifact/org.apache.solr/solr-solrj/8.9.0 , and the server process (which seems like a less likely source of conflicts because I'd expect most people to be using the client).
Agreed, predicting conflicts from maven central is probably not the best way, but since you brought up maven statistics in your discussion I figured I'd point out that we use other libraries with more public usage than caffeine already.
I'm not super sure how -1 on already committed but not released code works ( https://spark.apache.org/committers.html doesn't really call that out), but like I've said I'm only -0 on this revert. That said, there have been other reservations expressed by other committers you should check in with around consensus before merging. |
Thanks for your clarifying. Will wait for other committers' feedback. |
|
Caffeine is definitely fairly popular (At Databricks, I actually had a project that used it but eventually removed it due to a timing bug we found; the bug was ultimately fixed but we had to remove it due to project deadline; Ben knows what I'm talking about here). I'm confused here. Did we see perf improvement with Caffeine in any real scenarios? I saw some microbenchmarks in the original JIRA ticket saying there was improvements (and with some commentary from Ben saying the benchmark was not done correctly), but the pull request here claiming no improvement. |
|
Looks like performance benefit is something not everyone seems to be agreed with, then it cannot be the rationalization of introducing the new dependency. In other perspective, I'd be happy to see the movement of gradually removing Guava usage (I expect that is the main reason other projects start to introduce Caffeine), "after" we build the consensus. If there's no strong agreement on the movement, nothing would happen at least for now, which is not wrong. |
|
I am fine with reverting if somebody feels strongly on this. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this revert - it might better be conservative here after rethinking. Let me get this in if that's fine to all here.
|
Merged to master. But please let me know if you guys think differently. I am open to revert the revert. |
What changes were proposed in this pull request?
This pr revert the change of SPARK-34309, includes:
Why are the changes needed?
Does this PR introduce any user-facing change?
No
How was this patch tested?
Pass the Jenkins or GitHub Action