Skip to content

Conversation

@BryanCutler
Copy link
Member

What changes were proposed in this pull request?

If given a list of paths, pyspark.sql.readwriter.text will attempt to use an undefined variable paths. This change checks if the param paths is a basestring and then converts it to a list, so that the same variable paths can be used for both cases

How was this patch tested?

Added unit test for reading list of files

@SparkQA
Copy link

SparkQA commented Oct 6, 2016

Test build #66455 has finished for PR 15379 at commit ade9823.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

ping @davies @yanboliang

path = [paths]
return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(path)))
paths = [paths]
return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(paths)))
Copy link
Member

@HyukjinKwon HyukjinKwon Oct 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a super minor but I think it'd be nicer to match up the variable name to path if this makes sense. For parquet, it takes non-keyword arguments so it seems paths but for others, it seems take a single argument, path.

Copy link
Contributor

@holdenk holdenk Oct 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I agree keeping path here kind of makes sense.

Its unfortunate we didn't catch the difference in the named parameter difference between these reader functions back during 2.0. At this point changing the named parameter from paths to path we need to be a bit careful with incase people are using named params (if we did that we would need to add a version changed note and be careful). We could also have it (transitionally) take a kwargs work with either for a version (while updating the pydoc of course).

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Oct 7, 2016

+1 for this PR and please allow me to cc @holdenk here.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @BryanCutler and thanks for pointing it to me @HyukjinKwon. Definitetly a good thing to fix and it does helpfully point out some of our API inconsistency (although I'm not 100% sure if fright now is the best time to fix the named parameter difference - but if it isn't we should make a follow up task to clean it up the next time we are more ok with making breaking changes).

path = [paths]
return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(path)))
paths = [paths]
return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(paths)))
Copy link
Contributor

@holdenk holdenk Oct 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I agree keeping path here kind of makes sense.

Its unfortunate we didn't catch the difference in the named parameter difference between these reader functions back during 2.0. At this point changing the named parameter from paths to path we need to be a bit careful with incase people are using named params (if we did that we would need to add a version changed note and be careful). We could also have it (transitionally) take a kwargs work with either for a version (while updating the pydoc of course).

@rxin
Copy link
Contributor

rxin commented Oct 7, 2016

Thanks - the argument renaming issue is largely orthogonal and I don't think we can break it now. I'm going to merge this in master/2.0.

@asfgit asfgit closed this in bcaa799 Oct 7, 2016
asfgit pushed a commit that referenced this pull request Oct 7, 2016
…of paths

## What changes were proposed in this pull request?
If given a list of paths, `pyspark.sql.readwriter.text` will attempt to use an undefined variable `paths`.  This change checks if the param `paths` is a basestring and then converts it to a list, so that the same variable `paths` can be used for both cases

## How was this patch tested?
Added unit test for reading list of files

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #15379 from BryanCutler/sql-readtext-paths-SPARK-17805.

(cherry picked from commit bcaa799)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@BryanCutler
Copy link
Member Author

Thanks @rxin, @HyukjinKwon and @holdenk for reviewing!

@BryanCutler BryanCutler deleted the sql-readtext-paths-SPARK-17805 branch December 2, 2016 01:01
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…of paths

## What changes were proposed in this pull request?
If given a list of paths, `pyspark.sql.readwriter.text` will attempt to use an undefined variable `paths`.  This change checks if the param `paths` is a basestring and then converts it to a list, so that the same variable `paths` can be used for both cases

## How was this patch tested?
Added unit test for reading list of files

Author: Bryan Cutler <cutlerb@gmail.com>

Closes apache#15379 from BryanCutler/sql-readtext-paths-SPARK-17805.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants