Skip to content

Conversation

@ted-yu
Copy link

@ted-yu ted-yu commented Nov 20, 2015

See the thread Ben started:
http://search-hadoop.com/m/q3RTtveEuhjsr7g/

This PR adds drop() method to DataFrame which accepts multiple column names

@ted-yu ted-yu changed the title Drop multiple columns in the DataFrame API [SPARK-11884] Drop multiple columns in the DataFrame API Nov 20, 2015
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would need a varargs annotation and I don't think that we should duplicate the column resolution logic. Otherwise it might fall out of sync.

@ted-yu
Copy link
Author

ted-yu commented Nov 20, 2015

I am open to rewriting column resolution logic in the new method but may need some pointer since I am not familiar with this area of the codebase

@marmbrus
Copy link
Contributor

Why not just have the single column version delegate to this one instead of copying the code.

@SparkQA
Copy link

SparkQA commented Nov 20, 2015

Test build #46426 has finished for PR 9862 at commit f2ca6d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BenFradet
Copy link
Contributor

Yeah, I had this idea in mind.

@SparkQA
Copy link

SparkQA commented Nov 20, 2015

Test build #46439 has finished for PR 9862 at commit e1612ca.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * public class JavaLDAExample\n * class ChiSqSelectorModelWriter(instance: ChiSqSelectorModel) extends MLWriter\n * class PCA (override val uid: String) extends Estimator[PCAModel] with PCAParams\n * class VectorIndexerModelWriter(instance: VectorIndexerModel) extends MLWriter\n * final class Word2Vec(override val uid: String) extends Estimator[Word2VecModel] with Word2VecBase\n * class Word2VecModelWriter(instance: Word2VecModel) extends MLWriter\n

@ted-yu
Copy link
Author

ted-yu commented Nov 20, 2015

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Nov 20, 2015

Test build #46434 has finished for PR 9862 at commit 01686ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * public class JavaLDAExample\n

@BenFradet
Copy link
Contributor

Maybe define a unit test, just in case?

@ted-yu
Copy link
Author

ted-yu commented Nov 20, 2015

I looked at sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala but testData has only one column
Suggestion on where the test should be added is welcome

@BenFradet
Copy link
Contributor

I suggest having a look at SQLTestData.

@marmbrus
Copy link
Contributor

We are trying to delete that class. Just define a dataframe in the test.

val df = Seq((1,2,3)).toDF("a", "b", "c")

@ted-yu
Copy link
Author

ted-yu commented Nov 20, 2015

Thanks for the prompt hint, Michael.
New test coming shortly.

@ted-yu
Copy link
Author

ted-yu commented Nov 20, 2015

With

diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
index dd6d065..fedb0df 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
@@ -378,6 +378,15 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
     assert(df.schema.map(_.name) === Seq("value"))
   }

+  test("drop columns using drop") {
+    val src = Seq((1,2,3)).toDF("a", "b", "c")
+    val df = src.drop("a", "b")
+    checkAnswer(
+      df,
+      src.collect().map(x => Row(x.getInt(1))).toSeq)
+    assert(df.schema.map(_.name) === Seq("c"))
+  }
+
   test("drop unknown column (no-op)") {
     val df = testData.drop("random")
     checkAnswer(

I got:

- drop columns using drop *** FAILED ***
  Results do not match for query:
  == Parsed Logical Plan ==
  'Project [unresolvedalias('c)]
   Project [b#134,c#135]
    Project [_1#130 AS a#133,_2#131 AS b#134,_3#132 AS c#135]
     LocalRelation [_1#130,_2#131,_3#132], [[1,2,3]]

  == Analyzed Logical Plan ==
  c: int
  Project [c#135]
   Project [b#134,c#135]
    Project [_1#130 AS a#133,_2#131 AS b#134,_3#132 AS c#135]
     LocalRelation [_1#130,_2#131,_3#132], [[1,2,3]]

  == Optimized Logical Plan ==
  LocalRelation [c#135], [[3]]

  == Physical Plan ==
  LocalTableScan [c#135], [[3]]
  == Results ==
  !== Correct Answer - 1 ==   == Spark Answer - 1 ==
  ![2]                        [3] (QueryTest.scala:126)

Some hint ?

@BenFradet
Copy link
Contributor

The answer is in the test output:

checkAnswer(df, src.collect().map(x => Row(x.getInt(2))).toSeq)

@ted-yu
Copy link
Author

ted-yu commented Nov 20, 2015

Thanks, Ben

@SparkQA
Copy link

SparkQA commented Nov 20, 2015

Test build #46442 has finished for PR 9862 at commit 64a959b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 20, 2015

Test build #46443 has finished for PR 9862 at commit 64a959b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * public class JavaLDAExample\n

@ted-yu
Copy link
Author

ted-yu commented Nov 20, 2015

 > git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/9862/*:refs/remotes/origin/pr/9862/* # timeout=15
ERROR: Timeout after 15 minutes
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from https://github.com/apache/spark.git
    at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:763)

@ted-yu
Copy link
Author

ted-yu commented Nov 20, 2015

Jenkins, retest this please

@ted-yu
Copy link
Author

ted-yu commented Nov 21, 2015

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Nov 21, 2015

Test build #46459 has finished for PR 9862 at commit 6bbe12f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ted-yu
Copy link
Author

ted-yu commented Dec 4, 2015

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Dec 4, 2015

Test build #47171 has finished for PR 9862 at commit 34ccee0.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ted-yu
Copy link
Author

ted-yu commented Dec 4, 2015

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Dec 4, 2015

Test build #47172 has finished for PR 9862 at commit 1e4555a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ted-yu
Copy link
Author

ted-yu commented Dec 4, 2015

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Dec 4, 2015

Test build #47173 has finished for PR 9862 at commit 86c68bc.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ted-yu
Copy link
Author

ted-yu commented Dec 4, 2015

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Dec 4, 2015

Test build #47182 has finished for PR 9862 at commit b18fc6e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why using contains instead of using sqlContext.analyzer.resolver?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about:

val resolver = sqlContext.analyzer.resolver
val remainingCols = schema.filter(f => colNames.forall(n => !resolver(f.name, n))).map(f => Column(f.name))

@SparkQA
Copy link

SparkQA commented Dec 4, 2015

Test build #47186 has finished for PR 9862 at commit 331b892.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 4, 2015

Test build #47188 has finished for PR 9862 at commit 04adf76.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ted-yu
Copy link
Author

ted-yu commented Dec 4, 2015

@cloud-fan @marmbrus
Kindly let me know what else needs to be done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about just checkAnswer(df, Row(3))

@SparkQA
Copy link

SparkQA commented Dec 5, 2015

Test build #47214 has finished for PR 9862 at commit a28313b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM

@ted-yu
Copy link
Author

ted-yu commented Dec 7, 2015

@marmbrus
What do you think ?

@marmbrus
Copy link
Contributor

marmbrus commented Dec 7, 2015

Thanks, merging to master.

@ted-yu
Copy link
Author

ted-yu commented Dec 7, 2015

Thanks for the reviews, Michael and Wenchen

@asfgit asfgit closed this in 84b8094 Dec 7, 2015
@sun-rui
Copy link
Contributor

sun-rui commented Dec 8, 2015

There are two drop variants for single column:

def drop(colName: String)
def drop(col: Column)

But there is only one drop accepting multiple column names, why there is no version accepting multiple Columns?

@tedyu
Copy link
Contributor

tedyu commented Dec 8, 2015

I can send out another PR if other people think that variant is needed.

This PR has been closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants