Skip to content

Conversation

@zhenlineo
Copy link
Contributor

@zhenlineo zhenlineo commented Jan 24, 2023

What changes were proposed in this pull request?

The Spark Connect Scala Client should provide the same API as the existing SQL API. This PR adds the tests to ensure the generated binaries of two modules are compatible using MiMa.
The covered APIs are:

  • Dataset,
  • SparkSession with all implemented methods,
  • Column with all implemented methods,
  • DataFrame

Why are the changes needed?

Ensures the binary compatibility of the two APIs.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Integration tests.

Note: This PR need to be merged into 3.4 too.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@zhenlineo zhenlineo force-pushed the cp-test branch 2 times, most recently from 2ed597b to 8a79ffd Compare January 24, 2023 18:57
@zhenlineo zhenlineo changed the title [TODO][Connect] Scala Client Mima Compatibility Tests [SPARK-42172][Connect] Scala Client Mima Compatibility Tests Jan 24, 2023
@zhenlineo zhenlineo marked this pull request as ready for review January 24, 2023 19:06
@HyukjinKwon HyukjinKwon changed the title [SPARK-42172][Connect] Scala Client Mima Compatibility Tests [SPARK-42172][CONNECT] Scala Client Mima Compatibility Tests Jan 25, 2023
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's probably file a JIRA

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://issues.apache.org/jira/browse/SPARK-42175 This was skipped as I do not want to include too much API impl with the compatibility test PR.

Copy link
Contributor

@LuciferYang LuciferYang Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add this jira id to the TODO, like
TODO(SPARK-42175): Add the Dataset object definition

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we're not using this Logging

Copy link
Contributor Author

@zhenlineo zhenlineo Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging is needed for the binary compatibility: class type shall be exactly the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class Column(val expr: Expression) extends Logging {

should we delete private[sql] here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My poor Scala knowledge indicate this only marks one constructor private. The intension is to mark the current constructor private. For more constructor methods, we will add in follow up PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm....why is it not consistent with spark.sql.Column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our type is proto.Expression, it is not the same as Expression. I leave it to later PRs to decide how to support Expression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, why not

class Column(val expr: proto.Expression) extends Logging { ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I am not certain if we should expose constructor this(expr: proto.Expression) and val expr:proto.Expression.
They are not the same type as this(expr: Expression) and val expr: Expression.

Our type proto.Expression is some grpc class and Expression is in sql package. They are different types from the binary code point of view.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it private[sql] for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use SBT to check this instead of Maven? We have one place for MiMa so far in SBT (See also project/MimaBuild.scala, and dev/mima)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SBT MiMa check has some limitations to run as a SBT rule:
It is the best for a stable API. e.g. current vs previous. It is not very friendly to configure to test e.g. scala-client vs sql while we are actively working on the scala-client API.
To be more specific, the problems I hit were:

  1. I cannot configure the MiMa rule to find the current SQL SNAPSHOT jar.
  2. I cannot use ClassLoader correctly in the SBT rule to load all methods in the client API.

As a result, I end up this test where we have more freedom to grow the API test coverage with the client API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotya. Let's probably add a couple of comments here and there to make it clear .. I am sure this is confusing to other developers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @dongjoon-hyun , also cc @pan3793 Do you have any suggestions for this?

Copy link
Contributor Author

@zhenlineo zhenlineo Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check out the MiMa SBT impl I did here: zhenlineo#6
I marked the two problems in the PR code. Unless we can fix the two problems. I do not feel it is a better solution than this PR: calling MiMa directly in a test.

@zhenlineo zhenlineo force-pushed the cp-test branch 2 times, most recently from be1954c to 8f30d76 Compare January 25, 2023 23:40
Copy link
Contributor

@LuciferYang LuciferYang Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Column and Column$ are private[sql] access scope with this pr, so this is not an API for users?

Seem users cannot create a Column in their own package with this pr, for example:

package org.apache.spark.test

import org.scalatest.funsuite.AnyFunSuite // scalastyle:ignore funsuite
import org.apache.spark.sql.Column

class MyTestSuite
  extends AnyFunSuite // scalastyle:ignore funsuite
{
  test("new column") {
    val a = Column("a") // Symbol apply is inaccessible from this place
    val b = new Column(null) // No constructor accessible from here
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think org.apache.spark.sql.Column#apply was a public api before. If private[sql] is added to object Column, it may require more code refactoring work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your inputs.

Looking the current Column class, the SQL API give two public APIs to construct the Column:

class Column(val expr: Expression) extends Logging {

  def this(name: String) = this(name match {
    case "*" => UnresolvedStar(None)
    case _ if name.endsWith(".*") =>
      val parts = UnresolvedAttribute.parseAttributeName(name.substring(0, name.length - 2))
      UnresolvedStar(Some(parts))
    case _ => UnresolvedAttribute.quotedString(name)
  })
...

Right now the client API is very far from completion. We will add new methods in coming PRs. I am sure there will be a Column(name: String) for users to use. But it is out the scope of this PR to include all public constructors needed for the client.

The compatibility check added with this PR will grow the check coverage when more and more methods are added in the client. The current check ensures when a new method are added, the new method should be binary compatible with the existing SQL API. When the client API coverage is up (~80%) we can switch to a more aggressive check to ensure we did not miss any methods by mistake.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see what you mean. It seems that this is just an intermediate state. So it really doesn't need to consider users' use now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest version is 1.1.1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is a bug in 1.1.1, where the MiMa will not be able to check the class methods if the object is marked private. Thus I used the same one that our SBT build uses, which is 1.1.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like above TODO, need create a jira and add the corresponding jira id to TODO

@LuciferYang
Copy link
Contributor

@zhenlineo Manual test as follows:

gh pr checkout 39712
build/sbt "connect-client-jvm/testOnly CompatibilitySuite"

and test failed

[info] CompatibilitySuite:
[info] - compatibility MiMa tests *** FAILED *** (51 milliseconds)
[info]   java.lang.AssertionError: assertion failed: Failed to find the jar inside folder: /${basedir}/spark-mine/connector/connect/client/jvm/target
[info]   at scala.Predef$.assert(Predef.scala:223)
[info]   at org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:66)
[info]   at org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57)
[info]   at org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53)
[info]   at org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$1(CompatibilitySuite.scala:69)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info]   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
[info]   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
[info]   at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
[info]   at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[info]   at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info]   at scala.collection.immutable.List.foreach(List.scala:431)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
[info]   at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
[info]   at org.scalatest.Suite.run(Suite.scala:1114)
[info]   at org.scalatest.Suite.run$(Suite.scala:1096)
[info]   at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
[info]   at org.scalatest.funsuite.AnyFunSuite.run(AnyFunSuite.scala:1564)
[info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info]   at java.lang.Thread.run(Thread.java:750)
[info] - compatibility API tests: Dataset *** FAILED *** (22 milliseconds)
[info]   java.lang.AssertionError: assertion failed: Failed to find the jar inside folder: /Users/yangjie01/SourceCode/git/spark-mine-sbt/connector/connect/client/jvm/target
[info]   at scala.Predef$.assert(Predef.scala:223)
[info]   at org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:66)
[info]   at org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57)
[info]   at org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53)
[info]   at org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$7(CompatibilitySuite.scala:103)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info]   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
[info]   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
[info]   at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
[info]   at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[info]   at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info]   at scala.collection.immutable.List.foreach(List.scala:431)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
[info]   at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
[info]   at org.scalatest.Suite.run(Suite.scala:1114)
[info]   at org.scalatest.Suite.run$(Suite.scala:1096)
[info]   at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
[info]   at org.scalatest.funsuite.AnyFunSuite.run(AnyFunSuite.scala:1564)
[info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info]   at java.lang.Thread.run(Thread.java:750)
[info] Run completed in 1 second, 234 milliseconds.
[info] Total number of tests run: 2
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 0, failed 2, canceled 0, ignored 0, pending 0
[info] *** 2 TESTS FAILED ***
[error] Failed tests:
[error] 	org.apache.spark.sql.connect.client.CompatibilitySuite

@zhenlineo
Copy link
Contributor Author

@LuciferYang Thanks so much to look into this PR, I added instructions to run this PR on the top of the test class
See the code here

In short, we need first run sbt package and then run the test. This is a integration test. It needs all the artifacts being built first.

@LuciferYang
Copy link
Contributor

LuciferYang commented Jan 26, 2023

Maven test has some problems:

run

gh pr checkout 39712
mvn clean install -DskipTests -pl connector/connect/client/jvm -am
mvn clean test -pl connector/connect/client/jvm -Dtest=none -DwildcardSuites=org.apache.spark.sql.connect.client.CompatibilitySuite
CompatibilitySuite:
- compatibility MiMa tests *** FAILED ***
  java.lang.AssertionError: assertion failed: Failed to find the jar inside folder: /${basedir}/spark-source/connector/connect/client/jvm/target
  at scala.Predef$.assert(Predef.scala:223)
  at org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:66)
  at org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57)
  at org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53)
  at org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$1(CompatibilitySuite.scala:69)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  ...
- compatibility API tests: Dataset *** FAILED ***
  java.lang.AssertionError: assertion failed: Failed to find the jar inside folder: /${basedir}/spark-source/connector/connect/client/jvm/target
  at scala.Predef$.assert(Predef.scala:223)
  at org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:66)
  at org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57)
  at org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53)
  at org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$7(CompatibilitySuite.scala:103)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  ...
Run completed in 209 milliseconds.
Total number of tests run: 2
Suites: completed 2, aborted 0
Tests: succeeded 0, failed 2, canceled 0, ignored 0, pending 0
*** 2 TESTS FAILED ***

GA didn't fail because GA didn't test this connect-client-jvm module? @HyukjinKwon Should we enable the GA test of the connect-client-jvm first before development

@LuciferYang
Copy link
Contributor

@zhenlineo So we need package sql module and connect-client-jvm first, then test?

@zhenlineo
Copy link
Contributor Author

@LuciferYang yes. As the test compares the binary jars. Let me know if there is some good way to enforce to build the jars first.

@LuciferYang
Copy link
Contributor

local test

build/sbt "sql/package"  
build/sbt "connect-client-jvm/package" 
build/sbt "connect-client-jvm/testOnly *CompatibilitySuite" 

The test still failed, did I execute the commands incorrectly? Can you give me a correct?

@zhenlineo
Copy link
Contributor Author

"It worked on my computer." 😂

Did you get the same error? Do you have any jars under the folder/${basedir}/spark-mine/connector/connect/client/jvm/target? Either in SBT target folder or maven target folder? If so what's your jars name and path to it?
IntegrationTestUtils#findJar method is the logic used to search for the target jars.

@LuciferYang
Copy link
Contributor

Yes, the same error and the jar package exists:

connector/connect/client/jvm/target/scala-2.12/spark-connect-client-jvm_2.12-3.5.0-SNAPSHOT.jar

image

@zhenlineo
Copy link
Contributor Author

zhenlineo commented Jan 26, 2023

Oh, you have to do a global sbt package as we need the shaded jar. The shaded jar name has a assembly inside. The reason to use the shaded jar is that's the real jar that we are going to give users to use.

@zhenlineo
Copy link
Contributor Author

@LuciferYang I verified the path for SBT and maven again. All are able to run the test successfully.
The quickest is the following:

sbt package
sbt "connect-client-jvm/testOnly *CompatibilitySuite" 

For maven, we can build the two package separately and then run the maven test correctly.

The reason your sbt command did not work is that the command build/sbt "connect-client-jvm/package" does not shade the client jar. When running a global sbt package, a new shaded jar named spark-connect-client-jvm-assembly will be added besides the non shaded jar.

I am all up to improve the test flow if you can advise?

@LuciferYang
Copy link
Contributor

LuciferYang commented Jan 26, 2023

local test

build/sbt "sql/package"  
build/sbt "connect-client-jvm/package" 
build/sbt "connect-client-jvm/testOnly *CompatibilitySuite" 

The test still failed, did I execute the commands incorrectly? Can you give me a correct?

Thanks @zhenlineo

when I change build/sbt "connect-client-jvm/package" to build/sbt "connect-client-jvm/assembly", the test can pass.

But I think this is not friendly to developers, for maven users, the full modules build and test can be through mvn clean install before, but now they may need to be build through mvn clean install -DskipTests first, and then test through mvn test without clean, otherwise CompatibilitySuite will fail(I verified this manually).

I don't have any good suggestions now, need more time to think about it.

cc @srowen @dongjoon-hyun @JoshRosen FYI

@zhenlineo
Copy link
Contributor Author

@LuciferYang I have another ClientE2ESuite that also requires a build. How about this: let's merge this integration in and I will investigate SPARK-42215 for a better developer experience. WDYT?

Note to all reviewers: All scala client changes need to be merged into branch-3.4 too.

@LuciferYang
Copy link
Contributor

Let's see what others think :)

@HyukjinKwon
Copy link
Member

I am fine w/ doing it separately. @LuciferYang's suggestion makes a lot of sense to me.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving it but let's make sure to address @LuciferYang's suggestions. I do have the same concern about different way of testing which makes other developers hard to test, validate the change, etc.

@LuciferYang
Copy link
Contributor

If we are sure that there will be further work to solve the problem, I have no objection to merge it now, but I must stress again that this merge will make mvn clean install without -DskipTests fail(and I think there is no way to only skip the connector/connect/client/jvm module test, maybe I don't know).

@zhenlineo
Copy link
Contributor Author

Hi @LuciferYang, I added the ability to skip the client integration tests with maven:

mvn test -pl connector/connect/client/jvm -DskipJvmClientITs=true

@zhenlineo
Copy link
Contributor Author

@LuciferYang Let me know if there is any other blockers to merge this PR. Thanks a lot for the review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add this profile, We can use -Dtest.exclude.tags=org.apache.spark.sql.connect.client.util.ClientIntegrationTest directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did similar work a few days ago:
#39768

The test case CompatibilitySuite was not added at that time, so @dongjoon-hyun suggested not to add this Tag

Need @HyukjinKwon or @dongjoon-hyun for double check for this

Copy link
Contributor

@LuciferYang LuciferYang Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test Tag should be uniformly added to the tag module(The current naming rule is ExtendedXXX) and please update the pr description to explain why this Tag is added @zhenlineo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LuciferYang I've reverted the last commit to skip the e2e tests. As it will not make the build worse (as it doing the same as E2E suite). Let's fix the build issue do it in another PR. We might have better solutions.

f.getName.startsWith(sbtName) && f.getName.endsWith(".jar")) ||
// Maven Jar
(f.getParent.endsWith("target") &&
f.getName.startsWith(mvnName) && f.getName.endsWith(".jar"))
Copy link
Contributor

@LuciferYang LuciferYang Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed a bad case for maven find Jar in #39810

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Added back.

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM(pending CI)

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hvanhovell hvanhovell closed this in 15971a0 Feb 2, 2023
hvanhovell pushed a commit that referenced this pull request Feb 2, 2023
### What changes were proposed in this pull request?

The Spark Connect Scala Client should provide the same API as the existing SQL API. This PR adds the tests to ensure the generated binaries of two modules are compatible using MiMa.
The covered APIs are:
* `Dataset`,
* `SparkSession` with all implemented methods,
* `Column` with all implemented methods,
* `DataFrame`

### Why are the changes needed?
Ensures the binary compatibility of the two APIs.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Integration tests.

Note: This PR need to be merged into 3.4 too.

Closes #39712 from zhenlineo/cp-test.

Authored-by: Zhen Li <zhenlineo@users.noreply.github.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
(cherry picked from commit 15971a0)
Signed-off-by: Herman van Hovell <herman@databricks.com>
val clientClassLoader: URLClassLoader = new URLClassLoader(Seq(clientJar.toURI.toURL).toArray)
val sqlClassLoader: URLClassLoader = new URLClassLoader(Seq(sqlJar.toURI.toURL).toArray)

val clientClass = clientClassLoader.loadClass("org.apache.spark.sql.Dataset")
Copy link
Contributor

@LuciferYang LuciferYang Feb 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI ~ @zhenlineo @HyukjinKwon , there may be some problems with this test case, I add some logs as follows:
https://github.com/apache/spark/compare/master...LuciferYang:spark:CompatibilitySuite?expand=1

image

From the log, I found that both clientClass and sqlClass are loaded from file:/home/runner/work/spark/spark/connector/connect/client/jvm/target/scala-2.12/spark-connect-client-jvm_2.12-3.5.0-SNAPSHOT.jar, and the contents of newMethods and oldMethods are the same ...

https://pipelines.actions.githubusercontent.com/serviceHosts/c184045e-b556-4e78-b8ef-fb37b2eda9a3/_apis/pipelines/1/runs/62963/signedlogcontent/14?urlExpires=2023-02-27T08%3A53%3A13.6716136Z&urlSigningMethod=HMACV1&urlSignature=XkRKqix4ZapzEeczn7ZhWpAFhSnwWW74UX%2BKUhocftc%3D

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At present, using this way to check, at least 4 apis should be incompatible:

private[sql] def withResult[E](f: SparkResult => E): E
def collectResult(): SparkResult
private[sql] def analyze: proto.AnalyzePlanResponse
private[sql] val plan: proto.Plan

Because when using Java reflection, the above 4 methods will be identified as public apis, even though three of them are private [sql], and these four apis do not exist in the Dataset of the sql module:

public java.lang.Object org.apache.spark.sql.Dataset.withResult(scala.Function1)$
public org.apache.spark.sql.connect.client.SparkResult org.apache.spark.sql.Dataset.collectResult()$
public org.apache.spark.connect.proto.AnalyzePlanResponse org.apache.spark.sql.Dataset.analyze()$
public org.apache.spark.connect.proto.Plan org.apache.spark.sql.Dataset.plan()$

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also cc @hvanhovell

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for looking into this. The dataset test is not as important as Mima test. I will check if we can fix the issue you found. Otherwise it should be safe to delete as the test is already covered by mima.

Copy link
Contributor

@LuciferYang LuciferYang Feb 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhenlineo, If it has been covered, we can delete it :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the investigation, @LuciferYang .

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
### What changes were proposed in this pull request?

The Spark Connect Scala Client should provide the same API as the existing SQL API. This PR adds the tests to ensure the generated binaries of two modules are compatible using MiMa.
The covered APIs are:
* `Dataset`,
* `SparkSession` with all implemented methods,
* `Column` with all implemented methods,
* `DataFrame`

### Why are the changes needed?
Ensures the binary compatibility of the two APIs.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Integration tests.

Note: This PR need to be merged into 3.4 too.

Closes apache#39712 from zhenlineo/cp-test.

Authored-by: Zhen Li <zhenlineo@users.noreply.github.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
(cherry picked from commit 15971a0)
Signed-off-by: Herman van Hovell <herman@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants