[SPARK-11884] Drop multiple columns in the DataFrame API #9862

ted-yu · 2015-11-20T18:27:44Z

See the thread Ben started:
http://search-hadoop.com/m/q3RTtveEuhjsr7g/

This PR adds drop() method to DataFrame which accepts multiple column names

marmbrus · 2015-11-20T19:45:21Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

This would need a varargs annotation and I don't think that we should duplicate the column resolution logic. Otherwise it might fall out of sync.

ted-yu · 2015-11-20T20:06:56Z

I am open to rewriting column resolution logic in the new method but may need some pointer since I am not familiar with this area of the codebase

marmbrus · 2015-11-20T20:25:15Z

Why not just have the single column version delegate to this one instead of copying the code.

SparkQA · 2015-11-20T20:29:18Z

Test build #46426 has finished for PR 9862 at commit f2ca6d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BenFradet · 2015-11-20T20:40:19Z

Yeah, I had this idea in mind.

SparkQA · 2015-11-20T21:09:30Z

Test build #46439 has finished for PR 9862 at commit e1612ca.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * public class JavaLDAExample\n * class ChiSqSelectorModelWriter(instance: ChiSqSelectorModel) extends MLWriter\n * class PCA (override val uid: String) extends Estimator[PCAModel] with PCAParams\n * class VectorIndexerModelWriter(instance: VectorIndexerModel) extends MLWriter\n * final class Word2Vec(override val uid: String) extends Estimator[Word2VecModel] with Word2VecBase\n * class Word2VecModelWriter(instance: Word2VecModel) extends MLWriter\n

ted-yu · 2015-11-20T21:24:29Z

Jenkins, test this please

SparkQA · 2015-11-20T22:08:25Z

Test build #46434 has finished for PR 9862 at commit 01686ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * public class JavaLDAExample\n

BenFradet · 2015-11-20T22:12:18Z

Maybe define a unit test, just in case?

ted-yu · 2015-11-20T22:34:49Z

I looked at sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala but testData has only one column
Suggestion on where the test should be added is welcome

BenFradet · 2015-11-20T22:38:02Z

I suggest having a look at SQLTestData.

marmbrus · 2015-11-20T22:41:01Z

We are trying to delete that class. Just define a dataframe in the test.

val df = Seq((1,2,3)).toDF("a", "b", "c")

ted-yu · 2015-11-20T22:53:19Z

Thanks for the prompt hint, Michael.
New test coming shortly.

ted-yu · 2015-11-20T22:58:38Z

With

diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
index dd6d065..fedb0df 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
@@ -378,6 +378,15 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
     assert(df.schema.map(_.name) === Seq("value"))
   }

+  test("drop columns using drop") {
+    val src = Seq((1,2,3)).toDF("a", "b", "c")
+    val df = src.drop("a", "b")
+    checkAnswer(
+      df,
+      src.collect().map(x => Row(x.getInt(1))).toSeq)
+    assert(df.schema.map(_.name) === Seq("c"))
+  }
+
   test("drop unknown column (no-op)") {
     val df = testData.drop("random")
     checkAnswer(

I got:

- drop columns using drop *** FAILED ***
  Results do not match for query:
  == Parsed Logical Plan ==
  'Project [unresolvedalias('c)]
   Project [b#134,c#135]
    Project [_1#130 AS a#133,_2#131 AS b#134,_3#132 AS c#135]
     LocalRelation [_1#130,_2#131,_3#132], [[1,2,3]]

  == Analyzed Logical Plan ==
  c: int
  Project [c#135]
   Project [b#134,c#135]
    Project [_1#130 AS a#133,_2#131 AS b#134,_3#132 AS c#135]
     LocalRelation [_1#130,_2#131,_3#132], [[1,2,3]]

  == Optimized Logical Plan ==
  LocalRelation [c#135], [[3]]

  == Physical Plan ==
  LocalTableScan [c#135], [[3]]
  == Results ==
  !== Correct Answer - 1 ==   == Spark Answer - 1 ==
  ![2]                        [3] (QueryTest.scala:126)

Some hint ?

BenFradet · 2015-11-20T23:12:04Z

The answer is in the test output:

checkAnswer(df, src.collect().map(x => Row(x.getInt(2))).toSeq)

ted-yu · 2015-11-20T23:28:32Z

Thanks, Ben

SparkQA · 2015-11-20T23:35:13Z

Test build #46442 has finished for PR 9862 at commit 64a959b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-20T23:43:24Z

Test build #46443 has finished for PR 9862 at commit 64a959b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * public class JavaLDAExample\n

ted-yu · 2015-11-20T23:53:01Z

 > git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/9862/*:refs/remotes/origin/pr/9862/* # timeout=15
ERROR: Timeout after 15 minutes
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from https://github.com/apache/spark.git
    at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:763)

ted-yu · 2015-11-20T23:53:11Z

Jenkins, retest this please

ted-yu · 2015-11-21T00:00:36Z

Jenkins, test this please

SparkQA · 2015-11-21T00:14:22Z

Test build #46459 has finished for PR 9862 at commit 6bbe12f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

ted-yu · 2015-12-04T00:56:54Z

Jenkins, test this please.

SparkQA · 2015-12-04T01:12:42Z

Test build #47171 has finished for PR 9862 at commit 34ccee0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

ted-yu · 2015-12-04T01:40:16Z

Jenkins, test this please

SparkQA · 2015-12-04T01:48:45Z

Test build #47172 has finished for PR 9862 at commit 1e4555a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

ted-yu · 2015-12-04T02:01:02Z

Jenkins, test this please

SparkQA · 2015-12-04T02:20:55Z

Test build #47173 has finished for PR 9862 at commit 86c68bc.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

ted-yu · 2015-12-04T02:58:16Z

Jenkins, test this please

SparkQA · 2015-12-04T03:37:16Z

Test build #47182 has finished for PR 9862 at commit b18fc6e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-12-04T03:42:28Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

why using contains instead of using sqlContext.analyzer.resolver?

how about:

val resolver = sqlContext.analyzer.resolver val remainingCols = schema.filter(f => colNames.forall(n => !resolver(f.name, n))).map(f => Column(f.name))

SparkQA · 2015-12-04T05:37:18Z

Test build #47186 has finished for PR 9862 at commit 331b892.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-04T07:37:56Z

Test build #47188 has finished for PR 9862 at commit 04adf76.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ted-yu · 2015-12-04T18:05:53Z

@cloud-fan @marmbrus
Kindly let me know what else needs to be done

cloud-fan · 2015-12-05T03:11:25Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

how about just checkAnswer(df, Row(3))

SparkQA · 2015-12-05T05:27:45Z

Test build #47214 has finished for PR 9862 at commit a28313b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-12-05T05:56:26Z

LGTM

ted-yu · 2015-12-07T16:58:01Z

@marmbrus
What do you think ?

marmbrus · 2015-12-07T22:47:44Z

Thanks, merging to master.

ted-yu · 2015-12-07T23:01:01Z

Thanks for the reviews, Michael and Wenchen

sun-rui · 2015-12-08T06:14:00Z

There are two drop variants for single column:

def drop(colName: String)
def drop(col: Column)

But there is only one drop accepting multiple column names, why there is no version accepting multiple Columns?

tedyu · 2015-12-08T06:19:23Z

I can send out another PR if other people think that variant is needed.

This PR has been closed.

Drop multiple columns in the DataFrame API

f2ca6d0

ted-yu changed the title ~~Drop multiple columns in the DataFrame API~~ [SPARK-11884] Drop multiple columns in the DataFrame API Nov 20, 2015

marmbrus reviewed Nov 20, 2015
View reviewed changes

Add varargs annotation

01686ad

tedyu added 2 commits November 20, 2015 12:56

have the single column version delegate to the new method

e1612ca

have the single column version delegate to the new method

2c23f90

Correct syntax for passing varargs

64a959b

Add test for dropping multiple columns

4541231

Formatting

6bbe12f

Address Michael's review comments

34ccee0

Address Scalastyle check warnings

1e4555a

tedyu added 2 commits December 3, 2015 17:59

Address Scalastyle check warnings

394632f

Address Scalastyle check warnings

86c68bc

Address compilation error

b18fc6e

cloud-fan reviewed Dec 4, 2015
View reviewed changes

Address Wenchen's comment

331b892

Add missing parameter for resolver

04adf76

cloud-fan reviewed Dec 5, 2015
View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala Outdated

Copy link

Contributor

cloud-fan Dec 5, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about just checkAnswer(df, Row(3))

Address Wenchen's comment

a28313b

asfgit closed this in 84b8094 Dec 7, 2015

aray mentioned this pull request Dec 10, 2015

[SPARK-12227][SQL] Support drop multiple columns specified by Column class in DataFrame API #10218

Closed

[SPARK-11884] Drop multiple columns in the DataFrame API #9862

[SPARK-11884] Drop multiple columns in the DataFrame API #9862

Uh oh!

Conversation

ted-yu commented Nov 20, 2015

Uh oh!

marmbrus Nov 20, 2015

Choose a reason for hiding this comment

Uh oh!

ted-yu commented Nov 20, 2015

Uh oh!

marmbrus commented Nov 20, 2015

Uh oh!

SparkQA commented Nov 20, 2015

Uh oh!

BenFradet commented Nov 20, 2015

Uh oh!

SparkQA commented Nov 20, 2015

Uh oh!

ted-yu commented Nov 20, 2015

Uh oh!

SparkQA commented Nov 20, 2015

Uh oh!

BenFradet commented Nov 20, 2015

Uh oh!

ted-yu commented Nov 20, 2015

Uh oh!

BenFradet commented Nov 20, 2015

Uh oh!

marmbrus commented Nov 20, 2015

Uh oh!

ted-yu commented Nov 20, 2015

Uh oh!

ted-yu commented Nov 20, 2015

Uh oh!

BenFradet commented Nov 20, 2015

Uh oh!

ted-yu commented Nov 20, 2015

Uh oh!

SparkQA commented Nov 20, 2015

Uh oh!

SparkQA commented Nov 20, 2015

Uh oh!

ted-yu commented Nov 20, 2015

Uh oh!

ted-yu commented Nov 20, 2015

Uh oh!

ted-yu commented Nov 21, 2015

Uh oh!

SparkQA commented Nov 21, 2015

Uh oh!

ted-yu commented Dec 4, 2015

Uh oh!

SparkQA commented Dec 4, 2015

Uh oh!

ted-yu commented Dec 4, 2015

Uh oh!

SparkQA commented Dec 4, 2015

Uh oh!

ted-yu commented Dec 4, 2015

Uh oh!

SparkQA commented Dec 4, 2015

Uh oh!

ted-yu commented Dec 4, 2015

Uh oh!

SparkQA commented Dec 4, 2015

Uh oh!

cloud-fan Dec 4, 2015

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 4, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 4, 2015

Uh oh!

SparkQA commented Dec 4, 2015

Uh oh!

ted-yu commented Dec 4, 2015

Uh oh!

cloud-fan Dec 5, 2015