Skip to content

Conversation

@nkronenfeld
Copy link
Contributor

What changes were proposed in this pull request?

Support unit tests of external code (i.e., applications that use spark) using scalatest that don't want to use FunSuite. SharedSparkContext already supports this, but SharedSQLContext does not.

I've introduced SharedSparkSession as a parent to SharedSQLContext, written in a way that it does support all scalatest styles.

How was this patch tested?

There are three new unit test suites added that just test using FunSpec, FlatSpec, and WordSpec.

@nkronenfeld
Copy link
Contributor Author

I made my original changes here by using

git mv PlanTest.scala PlanTestBase.scala
git mv SQLTestUnit.scala SQLTestUnitBase.scala
git mv SharedSQLContext.scala SharedSparkSession.scala

and then created new PlanTest, SQLTestUnit, and SharedSQLContext files, under the assumption that most of the code would go in the base, and this would help git provide better history and continuity. I'm not sure if that was the right decision or not - probably it is with PlanTest and SQLTestUnit, possibly not with SharedSQLContext, but since the diff in the PR doesn't reflect the git mv properly, I'm not sure if it will make a difference either way.

If reviewers wish me to redo this PR without the initial git mv, I'll be happy to.

@nkronenfeld
Copy link
Contributor Author

nkronenfeld commented Oct 19, 2017

There is one small hack in the way this was done, which is documented - see the comments and documentation on SharedSparkSession.initializeSession and SharedSparkContext.initializeContext. I would rather just have the session and context be lazy transient val's, which would work fine without this initialize call, but I didn't want to change the way tests currently ran without input.

I'll test to see if that works for all internal unit tests (I haven't even checked), but would welcome other ideas for how to deal with this issue.

@nkronenfeld nkronenfeld changed the title Support alternative unit testing styles in external applications [SPARK-22308] Support alternative unit testing styles in external applications Oct 19, 2017
@SparkQA
Copy link

SparkQA commented Oct 19, 2017

Test build #82895 has finished for PR 19529 at commit 802a958.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nkronenfeld
Copy link
Contributor Author

nkronenfeld commented Oct 19, 2017

nope, using lazy val initialization won't work - at the very least, UnsafeKryoSerializerSuite modifies conf before context construction. Perhaps with transient... but I'm a bit more nervous about that, possibly having different contexts on different threads

@nkronenfeld
Copy link
Contributor Author

Documentation removed as per @srowen 's request in the associated JIRA issue [SPARK-22308]

@SparkQA
Copy link

SparkQA commented Oct 23, 2017

Test build #82969 has finished for PR 19529 at commit 4218b86.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}
}
trait PlanTest extends SparkFunSuite with PlanTestBase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks like the content of this file was copied to a superclass, and now this is trait that extends it without adding anything. Is the point that it "extends SparkFunSuite"?

I don't know if this is possible in git, but renaming the class to PlanTestBase and then re-creating the current trait might have led to a diff where it's possible to see what did and didn't change in the move. Was there any substantive change? that's what I'm having trouble evaluating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually did start with 'git mv PlanTest.scala PlanTestBase.scala' - sadly, the diffs don't catch that. :-(
You understand the intent exactly - to pull the FunSuite part out of PlanTestBase and include it in PlanTest.

To confirm that nothing else changed, I ran:

git checkout 4a779bdac3e75c17b7d36c5a009ba6c948fa9fb6 PlanTest.scala
diff PlanTest.scala PlanTestBase.scala

and got the following:

20c20,21
< import org.apache.spark.SparkFunSuite
---
> import org.scalatest.Suite
> 
30c31,32
<  * Provides helper methods for comparing plans.
---
>  * Provides helper methods for comparing plans, but without the overhead of
>  * mandating a FunSuite.
32c34
< trait PlanTest extends SparkFunSuite with PredicateHelper {
---
> trait PlanTestBase extends PredicateHelper { self: Suite =>

* The purpose of this suite is to make sure that generic FlatSpec-based scala
* tests work with a shared spark session
*/
class GenericFlatSpecSuite extends FlatSpec with SharedSparkSession {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are testing tests? I get it, but I don't know if it's necessary. If FlatSpec didn't work, scalatest would hopefully catch it. If initializeSession / SharedSparkSession didn't, presumably tons of tests wouldn't work.

Copy link
Contributor Author

@nkronenfeld nkronenfeld Oct 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't totally sure about that either. I started off with the tests in another branch I wasn't going to include in the PR, just to make sure I'd done everything correctly.

I eventually came to the conclusion that they are useful. I think of it more in terms of preventing regressions; they should prevent a FunSuite dependecy creeping back in to SharedSparkSession.

For similar reasons, I should probably put similar tests in for SharedSparkContext.


protected def sparkContext = spark.sparkContext

private[sql] trait SQLTestUtils extends SparkFunSuite with SQLTestUtilsBase with PlanTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same general comment as PlanTestBase vs PlanTest above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has more changes - I left the data initialization stuff in SQLTestUtils, as it wouldn't be needed by external tests. All the <'s below are things that I pulled out of SQLTestUtilsBase and put back into SQLTestUtils

Running similarly:

git checkout 4a779bdac3e75c17b7d36c5a009ba6c948fa9fb6 SQLTestUtils.scala
diff SQLTestUtils.scala SQLTestUtilsBase.scala

I get

27d26
< import scala.util.control.NonFatal
30c29
< import org.scalatest.BeforeAndAfterAll
---
> import org.scalatest.{BeforeAndAfterAll, Suite}
33d31
< import org.apache.spark.SparkFunSuite
38c36
< import org.apache.spark.sql.catalyst.plans.PlanTest
---
> import org.apache.spark.sql.catalyst.plans.PlanTestBase
40d37
< import org.apache.spark.sql.catalyst.util._
43c40
< import org.apache.spark.util.{UninterruptibleThread, Utils}
---
> import org.apache.spark.util.Utils
46c43
<  * Helper trait that should be extended by all SQL test suites.
---
>  * Helper trait that can be extended by all external SQL test suites.
48,49c45
<  * This allows subclasses to plugin a custom `SQLContext`. It comes with test data
<  * prepared in advance as well as all implicit conversions used extensively by dataframes.
---
>  * This allows subclasses to plugin a custom `SQLContext`.
55,56c51,52
< private[sql] trait SQLTestUtils
<   extends SparkFunSuite with Eventually
---
> private[sql] trait SQLTestUtilsBase
>   extends Eventually
59c55
<   with PlanTest { self =>
---
>   with PlanTestBase { self: Suite =>
63,65d58
<   // Whether to materialize all test data before the first test is run
<   private var loadTestDataBeforeTests = false
< 
80,94d72
<   /**
<    * Materialize the test data immediately after the `SQLContext` is set up.
<    * This is necessary if the data is accessed by name but not through direct reference.
<    */
<   protected def setupTestData(): Unit = {
<     loadTestDataBeforeTests = true
<   }
< 
<   protected override def beforeAll(): Unit = {
<     super.beforeAll()
<     if (loadTestDataBeforeTests) {
<       loadTestData()
<     }
<   }
< 
300,354d277
<   /**
<    * Disable stdout and stderr when running the test. To not output the logs to the console,
<    * ConsoleAppender's `follow` should be set to `true` so that it will honors reassignments of
<    * System.out or System.err. Otherwise, ConsoleAppender will still output to the console even if
<    * we change System.out and System.err.
<    */
<   protected def testQuietly(name: String)(f: => Unit): Unit = {
<     test(name) {
<       quietly {
<         f
<       }
<     }
<   }
< 
<   /**
<    * Run a test on a separate `UninterruptibleThread`.
<    */
<   protected def testWithUninterruptibleThread(name: String, quietly: Boolean = false)
<     (body: => Unit): Unit = {
<     val timeoutMillis = 10000
<     @transient var ex: Throwable = null
< 
<     def runOnThread(): Unit = {
<       val thread = new UninterruptibleThread(s"Testing thread for test $name") {
<         override def run(): Unit = {
<           try {
<             body
<           } catch {
<             case NonFatal(e) =>
<               ex = e
<           }
<         }
<       }
<       thread.setDaemon(true)
<       thread.start()
<       thread.join(timeoutMillis)
<       if (thread.isAlive) {
<         thread.interrupt()
<         // If this interrupt does not work, then this thread is most likely running something that
<         // is not interruptible. There is not much point to wait for the thread to termniate, and
<         // we rather let the JVM terminate the thread on exit.
<         fail(
<           s"Test '$name' running on o.a.s.util.UninterruptibleThread timed out after" +
<             s" $timeoutMillis ms")
<       } else if (ex != null) {
<         throw ex
<       }
<     }
< 
<     if (quietly) {
<       testQuietly(name) { runOnThread() }
<     } else {
<       test(name) { runOnThread() }
<     }
<   }
365,407d287
<   }
< }
< 
< private[sql] object SQLTestUtils {
< 
<   def compareAnswers(
<       sparkAnswer: Seq[Row],
<       expectedAnswer: Seq[Row],
<       sort: Boolean): Option[String] = {
<     def prepareAnswer(answer: Seq[Row]): Seq[Row] = {
<       // Converts data to types that we can do equality comparison using Scala collections.
<       // For BigDecimal type, the Scala type has a better definition of equality test (similar to
<       // Java's java.math.BigDecimal.compareTo).
<       // For binary arrays, we convert it to Seq to avoid of calling java.util.Arrays.equals for
<       // equality test.
<       // This function is copied from Catalyst's QueryTest
<       val converted: Seq[Row] = answer.map { s =>
<         Row.fromSeq(s.toSeq.map {
<           case d: java.math.BigDecimal => BigDecimal(d)
<           case b: Array[Byte] => b.toSeq
<           case o => o
<         })
<       }
<       if (sort) {
<         converted.sortBy(_.toString())
<       } else {
<         converted
<       }
<     }
<     if (prepareAnswer(expectedAnswer) != prepareAnswer(sparkAnswer)) {
<       val errorMessage =
<         s"""
<            | == Results ==
<            | ${sideBySide(
<           s"== Expected Answer - ${expectedAnswer.size} ==" +:
<             prepareAnswer(expectedAnswer).map(_.toString()),
<           s"== Actual Answer - ${sparkAnswer.size} ==" +:
<             prepareAnswer(sparkAnswer).map(_.toString())).mkString("\n")}
<       """.stripMargin
<       Some(errorMessage)
<     } else {
<       None
<     }

/**
* Helper trait for SQL test suites where all tests share a single [[TestSparkSession]].
*/
trait SharedSparkSession
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually specific to SQL?
And is it really for symmetry with SharedSparkContext?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you're asking.

It's specific to the Spark SQL package - in that SparkSession resides there - but not specific to using the language SQL. When I say, 'SQL test suites', I'm referring to the former.

@gatorsmile
Copy link
Member

gatorsmile commented Oct 23, 2017

@nkronenfeld Generally, it makes sense to me. Since the code changes are pretty large here, it is not very straightforward for us to review it. Do you mind if I taking over some of code movement and then you can continue the work on it? Or could you split the PR to multiple smaller ones?

@nkronenfeld
Copy link
Contributor Author

nkronenfeld commented Oct 23, 2017

@gatorsmile the code changes aren't huge - there's almost no new code here, it's all just moving code around from one file to another in order to expose a SharedSparkSession with no dependence on FunSuite.

I don't think that could be broken up into multiple PRs - half-done, it won't compile.

As for taking over the code movement, I'm not sure what you mean; please explain further? If you mean you create a branch, do the git mv, check it in, and I then proceed to do the rest of the changes on top of that branch - sure, if that makes it easier to review, that would be fine.

Or I can redo it myself, checking in each file move as a separate commit, to make it easier to review that way, if that would be easier for you.

@gatorsmile
Copy link
Member

I found a very simple way to reduce the line of changes.

Could you put the PlanTest and PlanTestBase in the same file? We can refactor it later, if necessary. For example, in PlanTest.scala

/**
 * Provides helper methods for comparing plans.
 */
trait PlanTest extends SparkFunSuite with PlanTestBase

/**
 * Provides helper methods for comparing plans, but without the overhead of
 * mandating a FunSuite.
 */
trait PlanTestBase extends PredicateHelper { self: Suite =>

You also can do the same thing for the other files.

@nkronenfeld
Copy link
Contributor Author

@gatorsmile sounds good, giving that a try now... assuming tests pass, I'll check it in and see if it's any better.

I've so far done this for PlanTest and SQLTestUtils
PlanTest I suspect it will make much cleaner.
In SQLTestUtils I suspect it won't help as much, as it was more a pick-and-choose (this function goes in base, this doesn't)

I haven't done it at all for SharedSQLContext/SharedSparkSession... that one seems more deserving of a first-level place to me, so I'm more hesitant to, but if you want, I'll do that one too.

I suspect the correct answer is going to be redoing the PR, with careful commits that are clearer about what each does... but I'll try this first anyway :-)

… where they came from, in an attempt to make diffs simpler
@nkronenfeld
Copy link
Contributor Author

Yeah, as predicted, that made PlanTest very easy to review, but didn't do as well with SQLTestUtils. I suspect I reordered functions and what-not when I was moving stuff around.

If this is still too confusing to deal with, just let me know. Even if I can't make the end diffs of the entire PR non-trivial, I could certainly re-implement this in a way that the individual commits would each be trivial; then it'd just be a question of following along commit-by-commit, and shouldn't be too bad.

@SparkQA
Copy link

SparkQA commented Oct 24, 2017

Test build #83006 has finished for PR 19529 at commit 2d927e9.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait PlanTestBase extends PredicateHelper

@nkronenfeld
Copy link
Contributor Author

I assume the "unknown error code, -9" is:

[error] running /home/jenkins/workspace/SparkPullRequestBuilder/build/sbt -Phadoop-2.6 -Pflume -Phive-thriftserver -Pyarn -Pkafka-0-8 -Phive -Pkinesis-asl -Pmesos -Dtest.exclude.tags=org.apache.spark.tags.ExtendedHiveTest,org.apache.spark.tags.ExtendedYarnTest test ; process was terminated by signal 9

It started popping up between build 82895 and 82969 - between which all I did was eliminate the documentation on unit testing. I wouldn't think this should affect the build.

Signal 9 is a kill signal - which means something external killed the build, I think. Any idea why this is happening for the last several builds?

@srowen
Copy link
Member

srowen commented Oct 24, 2017

The Jenkins jobs get killed not-infrequently for various reasons. You can ignore that and retest.

var conf = new SparkConf(false)

/**
* Initialize the [[SparkContext]]. Generally, this is just called from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The max length of each line in the comment should also be 100, seems you are limiting that to 80?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed here, not elsewhere yet

}
}

private[sql] object SQLTestUtils {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this after SQLTestUtilsBase ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* Initialize the [[TestSparkSession]].
*/
trait SharedSQLContext extends SQLTestUtils with SharedSparkSession {
protected override def beforeAll(): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we still need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still used throughout the unit tests.
I had noticed another issue related to that ([SPARK-15037]), but that is marked resolved without having gotten rid of this.
Note that SharedSQLContext is to SharedSessionContext as PlanTest is to PlanTestBase - it extends FunSuite.

@SparkQA
Copy link

SparkQA commented Oct 24, 2017

Test build #3960 has finished for PR 19529 at commit 2d927e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait PlanTestBase extends PredicateHelper

@SparkQA
Copy link

SparkQA commented Oct 25, 2017

Test build #83046 has finished for PR 19529 at commit 24fc4a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except three comments.

*/
trait PlanTest extends SparkFunSuite with PredicateHelper {
trait PlanTest extends SparkFunSuite with PlanTestBase {
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the useless { and }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

_spark = createSparkSession
}
// Ensure we have initialized the context before calling parent code
super.beforeAll()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this beforeAll is just calling super.beforeAll, why do we still need to override it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't. Removed.

/**
* Initialize the [[TestSparkSession]].
*/
trait SharedSQLContext extends SQLTestUtils with SharedSparkSession {
Copy link
Member

@gatorsmile gatorsmile Oct 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work? Basically, we are just creating a trait.

trait SharedSQLContext extends SQLTestUtils with SharedSparkSession

Could we just move this line to SharedSparkSession.scala?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could... that would more fit the pattern of what we've done now for PlanTest/PlanTestBase and SQLTestUtils/SQLTestUtilsBase.

I hesitated in this case just because the two are such conceptually different concepts, and the idea is that both would actually get used - SharedSQLContext in internal tests, SharedSparkSession in external tests.

@gatorsmile
Copy link
Member

LGTM pending Jenkins

@SparkQA
Copy link

SparkQA commented Oct 26, 2017

Test build #83074 has finished for PR 19529 at commit 6c0b0d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in 592cfea Oct 26, 2017
@srowen
Copy link
Member

srowen commented Oct 29, 2017

@nkronenfeld @gatorsmile I think this has been failing the master build (Maven only) for a few days:

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/3981/consoleFull

SQLQuerySuite:
*** RUN ABORTED ***
  org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
org.apache.spark.sql.test.GenericFunSpecSuite.<init>(GenericFunSpecSuite.scala:28)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:422)
java.lang.Class.newInstance(Class.java:442)
org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:66)
org.scalatest.tools.DiscoverySuite$$anonfun$1.apply(DiscoverySuite.scala:38)
org.scalatest.tools.DiscoverySuite$$anonfun$1.apply(DiscoverySuite.scala:37)
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
scala.collection.Iterator$class.foreach(Iterator.scala:893)
scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
scala.collection.AbstractIterable.foreach(Iterable.scala:54)
scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
scala.collection.AbstractTraversable.map(Traversable.scala:104)
org.scalatest.tools.DiscoverySuite.<init>(DiscoverySuite.scala:37)
org.scalatest.tools.Runner$.genDiscoSuites$1(Runner.scala:1165)
org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1250)

I suspect that there's some slightly different way that the tests execute in Maven that may highlight a problem with how the SQL tests use and reuse SparkContexts. It's likely this change did cause it.

@gatorsmile
Copy link
Member

I just reverted this PR. @nkronenfeld Could you submit another PR and update the title to
[SPARK-22308][test-maven] Support alternative unit testing styles in external applications? Thanks!

@nkronenfeld
Copy link
Contributor Author

hm... I was always testing with sbt, because maven was so slow to do anything.

Will do

@srowen
Copy link
Member

srowen commented Oct 30, 2017

Maven is the build of reference... YMMV but I also have a bunch of trouble with SBT whereas not with Maven. I'd honestly prefer we not support an SBT build explicitly for reasons like this, though make sure it's not hard to let it work if people want to. at least the PR builder should be Maven.

asfgit pushed a commit that referenced this pull request Nov 10, 2017
…external applications

Continuation of PR#19528 (#19529 (comment))

The problem with the maven build in the previous PR was the new tests.... the creation of a spark session outside the tests meant there was more than one spark session around at a time.
I was using the spark session outside the tests so that the tests could share data; I've changed it so that each test creates the data anew.

Author: Nathan Kronenfeld <nicole.oresme@gmail.com>
Author: Nathan Kronenfeld <nkronenfeld@uncharted.software>

Closes #19705 from nkronenfeld/alternative-style-tests-2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants