-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17805][PYSPARK] Fix in sqlContext.read.text when pass in list of paths #15379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17805][PYSPARK] Fix in sqlContext.read.text when pass in list of paths #15379
Conversation
|
Test build #66455 has finished for PR 15379 at commit
|
|
ping @davies @yanboliang |
| path = [paths] | ||
| return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(path))) | ||
| paths = [paths] | ||
| return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(paths))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a super minor but I think it'd be nicer to match up the variable name to path if this makes sense. For parquet, it takes non-keyword arguments so it seems paths but for others, it seems take a single argument, path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I agree keeping path here kind of makes sense.
Its unfortunate we didn't catch the difference in the named parameter difference between these reader functions back during 2.0. At this point changing the named parameter from paths to path we need to be a bit careful with incase people are using named params (if we did that we would need to add a version changed note and be careful). We could also have it (transitionally) take a kwargs work with either for a version (while updating the pydoc of course).
|
+1 for this PR and please allow me to cc @holdenk here. |
holdenk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @BryanCutler and thanks for pointing it to me @HyukjinKwon. Definitetly a good thing to fix and it does helpfully point out some of our API inconsistency (although I'm not 100% sure if fright now is the best time to fix the named parameter difference - but if it isn't we should make a follow up task to clean it up the next time we are more ok with making breaking changes).
| path = [paths] | ||
| return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(path))) | ||
| paths = [paths] | ||
| return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(paths))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I agree keeping path here kind of makes sense.
Its unfortunate we didn't catch the difference in the named parameter difference between these reader functions back during 2.0. At this point changing the named parameter from paths to path we need to be a bit careful with incase people are using named params (if we did that we would need to add a version changed note and be careful). We could also have it (transitionally) take a kwargs work with either for a version (while updating the pydoc of course).
|
Thanks - the argument renaming issue is largely orthogonal and I don't think we can break it now. I'm going to merge this in master/2.0. |
…of paths ## What changes were proposed in this pull request? If given a list of paths, `pyspark.sql.readwriter.text` will attempt to use an undefined variable `paths`. This change checks if the param `paths` is a basestring and then converts it to a list, so that the same variable `paths` can be used for both cases ## How was this patch tested? Added unit test for reading list of files Author: Bryan Cutler <cutlerb@gmail.com> Closes #15379 from BryanCutler/sql-readtext-paths-SPARK-17805. (cherry picked from commit bcaa799) Signed-off-by: Reynold Xin <rxin@databricks.com>
|
Thanks @rxin, @HyukjinKwon and @holdenk for reviewing! |
…of paths ## What changes were proposed in this pull request? If given a list of paths, `pyspark.sql.readwriter.text` will attempt to use an undefined variable `paths`. This change checks if the param `paths` is a basestring and then converts it to a list, so that the same variable `paths` can be used for both cases ## How was this patch tested? Added unit test for reading list of files Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#15379 from BryanCutler/sql-readtext-paths-SPARK-17805.
What changes were proposed in this pull request?
If given a list of paths,
pyspark.sql.readwriter.textwill attempt to use an undefined variablepaths. This change checks if the parampathsis a basestring and then converts it to a list, so that the same variablepathscan be used for both casesHow was this patch tested?
Added unit test for reading list of files