Skip to content

Conversation

@LuciferYang
Copy link
Contributor

What changes were proposed in this pull request?

This pr revert the change of SPARK-34309, includes:

Why are the changes needed?

  1. No really performance improvement in Spark
  2. Added an additional dependency

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass the Jenkins or GitHub Action

@LuciferYang
Copy link
Contributor Author

cc @gatorsmile @HyukjinKwon @zsxwing @sarutak @sunchao @holdenk @mridulm @dongjoon-hyun and RemoteBlockPushResolver has some conflicts, I solved it manually. please help review ~

@HyukjinKwon
Copy link
Member

hm I was thinking that we'd keep that fix since that's already merged.

@sarutak
Copy link
Member

sarutak commented Aug 19, 2021

That PR was merged for 3.3.0 and we have time to release it. So, can we keep that change until we meet actual problem?

@SparkQA
Copy link

SparkQA commented Aug 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47138/

@SparkQA
Copy link

SparkQA commented Aug 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47138/

@SparkQA
Copy link

SparkQA commented Aug 19, 2021

Test build #142637 has finished for PR 33784 at commit 756c905.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Aug 19, 2021

I'm also negative about reverting this on master branch.

cc @sunchao , @viirya , @dbtsai

@sunchao
Copy link
Member

sunchao commented Aug 19, 2021

+1 on keeping this. I think the main pushback is that this will leak dependency to downstream apps and cause conflicts, but we can figure out a way, such as shading, during the Spark 3.3 timeline. The same issue also applies to Guava today and we should tackle them together.

@zsxwing
Copy link
Member

zsxwing commented Aug 19, 2021

@LuciferYang Thank you for submitting the revert PR.

@sarutak @dongjoon-hyun As I mentioned in the previous PR, adding a popular library to Spark's classpath directly is risky. @LuciferYang 's suggestion to shade Caffeine is a good solution. But it sounds not worth because:

  • There is no evidence to show performance improvement in practice. The code path using cache is not on any hot path, and the difference between Guava Cache and Caffeine is probably not noticeable in practice.
  • Extra work to shade the following three libraries (Need to verify this approach is working):
caffeine/2.9.2//caffeine-2.9.2.jar
checker-qual/3.10.0//checker-qual-3.10.0.jar
error_prone_annotations/2.5.1//error_prone_annotations-2.5.1.jar
  • Increase the maintenance burden when backporting a patch that touches the changed code to other branches.

Do I miss any benefits provided by Caffeine?

@holdenk
Copy link
Contributor

holdenk commented Aug 19, 2021

Maybe I'm missing some context, what's the reason for a revert to a slower implementation? Is it that there might be a conflict? Have you encountered this conflict with any projects? If anything reducing our scope on Guava (the cause of many conflicts) seems to be useful?

@holdenk
Copy link
Contributor

holdenk commented Aug 19, 2021

Also wait, I'm under the impression that shuffle block fetching is in the hot-path, we have to block on fetches.
I'm (currently) against this change, but I'm super open to additional context (like if this is causing conflicts in practice).

@zsxwing
Copy link
Member

zsxwing commented Aug 19, 2021

I'm under the impression that shuffle block fetching is in the hot-path, we have to block on fetches.

The improvement doesn't sound noticeable since this is a RPC operation.

like if this is causing conflicts in practice

It's hard to check this before the release goes out. But since Caffeine is pretty popular (10.3k stars), the overlap of Spark and Caffeine users should not be small, and the chance to hit the conflict is not low.

SPARK-34309 also mentioned that

caffeine has been used in some open source projects like Cassandra, Hbase, Neo4j, Druid, Spring and so on.

Not sure whether they use it in the client or the server (It would be great if someone can help confirm). If they use it in the client, it may be a problem when people use the connector for these libraries with Spark.

@holdenk
Copy link
Contributor

holdenk commented Aug 19, 2021

Ok looking at the datastax cassandra connector I'm not seeing caffeine in the direct dependencies - https://mvnrepository.com/artifact/com.datastax.oss/java-driver-core/4.13.0 , same thing with neo4j client https://mvnrepository.com/artifact/org.neo4j.driver/neo4j-java-driver/4.3.4

While I won't -1 this, I am -0 given the lack of an known conflict.

@zsxwing
Copy link
Member

zsxwing commented Aug 19, 2021

https://mvnrepository.com/artifact/com.github.ben-manes.caffeine/caffeine/usages says

Artifacts using Caffeine Cache (1,429)

https://mvnrepository.com/artifact/org.apache.spark/spark-sql/usages says

Artifacts using Spark Project SQL (1,436)

This suggests that Caffeine is as popular as Spark. I think the risk of conflict is pretty high.

@holdenk
Copy link
Contributor

holdenk commented Aug 19, 2021

This suggests that Caffeine is as popular as Spark. I think the risk of conflict is pretty high.

Then it should be easy to find a case that matters :)

@holdenk
Copy link
Contributor

holdenk commented Aug 19, 2021

Because, for comparison, we use antlr4 which has ~955 usages & bouncycastle with > 2.5k usages without problem.

@zsxwing
Copy link
Member

zsxwing commented Aug 19, 2021

Apache Solr is using Caffeine https://mvnrepository.com/artifact/org.apache.solr/solr-core/8.9.0 Sounds like a good example of potential conflict.

Anyway, we still cannot predict the Caffeine versions used by Spark applications directly since most of users won't publish their applications to maven central.

I'm definitely -1 for putting Caffeine on the Spark classpath directly. But +0 if other people think the extra work to shade the three libraries and the maintenance burden are the reasonable cost to pay.

@holdenk
Copy link
Contributor

holdenk commented Aug 19, 2021

Apache Solr is using Caffeine https://mvnrepository.com/artifact/org.apache.solr/solr-core/8.9.0 Sounds like a good example of potential conflict.

I think it's good to distinguish between the client, which doesn't depend on solr-core or caffeine https://mvnrepository.com/artifact/org.apache.solr/solr-solrj/8.9.0 , and the server process (which seems like a less likely source of conflicts because I'd expect most people to be using the client).

Anyway, we still cannot predict the Caffeine versions used by Spark applications directly since most of users won't publish their applications to maven central.

Agreed, predicting conflicts from maven central is probably not the best way, but since you brought up maven statistics in your discussion I figured I'd point out that we use other libraries with more public usage than caffeine already.

I'm definitely -1 for putting Caffeine on the Spark classpath directly. But +0 if other people think the extra work to shade the three libraries and the maintenance burden are the reasonable cost to pay.

I'm not super sure how -1 on already committed but not released code works ( https://spark.apache.org/committers.html doesn't really call that out), but like I've said I'm only -0 on this revert.

That said, there have been other reservations expressed by other committers you should check in with around consensus before merging.

@zsxwing
Copy link
Member

zsxwing commented Aug 19, 2021

but like I've said I'm only -0 on this revert.

Thanks for your clarifying. Will wait for other committers' feedback.

@rxin
Copy link
Contributor

rxin commented Aug 19, 2021

Caffeine is definitely fairly popular (At Databricks, I actually had a project that used it but eventually removed it due to a timing bug we found; the bug was ultimately fixed but we had to remove it due to project deadline; Ben knows what I'm talking about here).

I'm confused here. Did we see perf improvement with Caffeine in any real scenarios? I saw some microbenchmarks in the original JIRA ticket saying there was improvements (and with some commentary from Ben saying the benchmark was not done correctly), but the pull request here claiming no improvement.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Aug 20, 2021

Looks like performance benefit is something not everyone seems to be agreed with, then it cannot be the rationalization of introducing the new dependency.

In other perspective, I'd be happy to see the movement of gradually removing Guava usage (I expect that is the main reason other projects start to introduce Caffeine), "after" we build the consensus. If there's no strong agreement on the movement, nothing would happen at least for now, which is not wrong.

@HyukjinKwon
Copy link
Member

I am fine with reverting if somebody feels strongly on this.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this revert - it might better be conservative here after rethinking. Let me get this in if that's fine to all here.

@HyukjinKwon
Copy link
Member

Merged to master.

But please let me know if you guys think differently. I am open to revert the revert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants