-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16380][EXAMPLES] Update SQL examples and programming guide for Python language binding #14317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #62724 has finished for PR 14317 at commit
|
|
Test build #62725 has finished for PR 14317 at commit
|
|
@JoshRosen Would you mind to have a look at this? Thanks! |
|
Merging in master/2.0. |
… Python language binding This PR is based on PR #14098 authored by wangmiao1981. ## What changes were proposed in this pull request? This PR replaces the original Python Spark SQL example file with the following three files: - `sql/basic.py` Demonstrates basic Spark SQL features. - `sql/datasource.py` Demonstrates various Spark SQL data sources. - `sql/hive.py` Demonstrates Spark SQL Hive interaction. This PR also removes hard-coded Python example snippets in the SQL programming guide by extracting snippets from the above files using the `include_example` Liquid template tag. ## How was this patch tested? Manually tested. Author: wm624@hotmail.com <wm624@hotmail.com> Author: Cheng Lian <lian@databricks.com> Closes #14317 from liancheng/py-examples-update. (cherry picked from commit 53b2456) Signed-off-by: Reynold Xin <rxin@databricks.com>
| The entry point into all functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder`: | ||
|
|
||
| {% include_example init_session python/sql.py %} | ||
| {% include_example init_session python/sql/basic.py %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file name is not consistent with Scala and Java version. The file names are SparkSQLExample.scala and SparkSQLExample.java. The Hive and Data Source examples file names are not consistent either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Scala and Java, it's a convention that the file name should be the same as the (major) class defined in the file, while camel case file name doesn't conform to Python code convention. You may check other PySpark file names in the repo as a reference.
| # +-------+ | ||
|
|
||
| # Select everybody, but increment the age by 1 | ||
| df.select(df['name'], df['age'] + 1).show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to use col('...'). I have tested it and it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I know I brought up this issue, but it is still in question... Although df['...'] has potential issue with self-join, it is the way Pandas DataFrame works. Considering we've tried to workaround various self-join corner cases within Catalyst, now I tend to preserve it as is. Maybe we'll deprecate this syntax later.
This PR is based on PR #14098 authored by @wangmiao1981.
What changes were proposed in this pull request?
This PR replaces the original Python Spark SQL example file with the following three files:
sql/basic.pyDemonstrates basic Spark SQL features.
sql/datasource.pyDemonstrates various Spark SQL data sources.
sql/hive.pyDemonstrates Spark SQL Hive interaction.
This PR also removes hard-coded Python example snippets in the SQL programming guide by extracting snippets from the above files using the
include_exampleLiquid template tag.How was this patch tested?
Manually tested.