[SPARK-17805][PYSPARK] Fix in sqlContext.read.text when pass in list of paths #15379

BryanCutler · 2016-10-06T18:38:58Z

What changes were proposed in this pull request?

If given a list of paths, pyspark.sql.readwriter.text will attempt to use an undefined variable paths. This change checks if the param paths is a basestring and then converts it to a list, so that the same variable paths can be used for both cases

How was this patch tested?

Added unit test for reading list of files

SparkQA · 2016-10-06T19:16:08Z

Test build #66455 has finished for PR 15379 at commit ade9823.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2016-10-06T19:17:45Z

ping @davies @yanboliang

HyukjinKwon · 2016-10-07T01:15:17Z

python/pyspark/sql/readwriter.py

-            path = [paths]
-        return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(path)))
+            paths = [paths]
+        return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(paths)))


This is a super minor but I think it'd be nicer to match up the variable name to path if this makes sense. For parquet, it takes non-keyword arguments so it seems paths but for others, it seems take a single argument, path.

So I agree keeping path here kind of makes sense.

Its unfortunate we didn't catch the difference in the named parameter difference between these reader functions back during 2.0. At this point changing the named parameter from paths to path we need to be a bit careful with incase people are using named params (if we did that we would need to add a version changed note and be careful). We could also have it (transitionally) take a kwargs work with either for a version (while updating the pydoc of course).

HyukjinKwon · 2016-10-07T01:16:35Z

+1 for this PR and please allow me to cc @holdenk here.

holdenk

Thanks for working on this @BryanCutler and thanks for pointing it to me @HyukjinKwon. Definitetly a good thing to fix and it does helpfully point out some of our API inconsistency (although I'm not 100% sure if fright now is the best time to fix the named parameter difference - but if it isn't we should make a follow up task to clean it up the next time we are more ok with making breaking changes).

holdenk · 2016-10-07T02:42:00Z

python/pyspark/sql/readwriter.py

-            path = [paths]
-        return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(path)))
+            paths = [paths]
+        return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(paths)))


So I agree keeping path here kind of makes sense.

Its unfortunate we didn't catch the difference in the named parameter difference between these reader functions back during 2.0. At this point changing the named parameter from paths to path we need to be a bit careful with incase people are using named params (if we did that we would need to add a version changed note and be careful). We could also have it (transitionally) take a kwargs work with either for a version (while updating the pydoc of course).

rxin · 2016-10-07T07:27:11Z

Thanks - the argument renaming issue is largely orthogonal and I don't think we can break it now. I'm going to merge this in master/2.0.

…of paths ## What changes were proposed in this pull request? If given a list of paths, `pyspark.sql.readwriter.text` will attempt to use an undefined variable `paths`. This change checks if the param `paths` is a basestring and then converts it to a list, so that the same variable `paths` can be used for both cases ## How was this patch tested? Added unit test for reading list of files Author: Bryan Cutler <cutlerb@gmail.com> Closes #15379 from BryanCutler/sql-readtext-paths-SPARK-17805. (cherry picked from commit bcaa799) Signed-off-by: Reynold Xin <rxin@databricks.com>

BryanCutler · 2016-10-07T16:51:43Z

Thanks @rxin, @HyukjinKwon and @holdenk for reviewing!

…of paths ## What changes were proposed in this pull request? If given a list of paths, `pyspark.sql.readwriter.text` will attempt to use an undefined variable `paths`. This change checks if the param `paths` is a basestring and then converts it to a list, so that the same variable `paths` can be used for both cases ## How was this patch tested? Added unit test for reading list of files Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#15379 from BryanCutler/sql-readtext-paths-SPARK-17805.

fix in pyspark sql read.text to accept list of paths

ade9823

HyukjinKwon reviewed Oct 7, 2016

View reviewed changes

holdenk reviewed Oct 7, 2016

View reviewed changes

asfgit closed this in bcaa799 Oct 7, 2016

BryanCutler deleted the sql-readtext-paths-SPARK-17805 branch December 2, 2016 01:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17805][PYSPARK] Fix in sqlContext.read.text when pass in list of paths #15379

[SPARK-17805][PYSPARK] Fix in sqlContext.read.text when pass in list of paths #15379

Uh oh!

BryanCutler commented Oct 6, 2016

Uh oh!

SparkQA commented Oct 6, 2016

Uh oh!

BryanCutler commented Oct 6, 2016

Uh oh!

HyukjinKwon Oct 7, 2016 •

edited

Loading

Uh oh!

holdenk Oct 7, 2016 •

edited

Loading

Uh oh!

HyukjinKwon commented Oct 7, 2016 •

edited

Loading

Uh oh!

holdenk left a comment

Uh oh!

holdenk Oct 7, 2016 •

edited

Loading

Uh oh!

rxin commented Oct 7, 2016

Uh oh!

BryanCutler commented Oct 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-17805][PYSPARK] Fix in sqlContext.read.text when pass in list of paths #15379

[SPARK-17805][PYSPARK] Fix in sqlContext.read.text when pass in list of paths #15379

Uh oh!

Conversation

BryanCutler commented Oct 6, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 6, 2016

Uh oh!

BryanCutler commented Oct 6, 2016

Uh oh!

HyukjinKwon Oct 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk Oct 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

holdenk Oct 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Oct 7, 2016

Uh oh!

BryanCutler commented Oct 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon Oct 7, 2016 •

edited

Loading

holdenk Oct 7, 2016 •

edited

Loading

HyukjinKwon commented Oct 7, 2016 •

edited

Loading

holdenk Oct 7, 2016 •

edited

Loading