-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43009][SQL] Parameterized sql()
with Any
constants
#40623
Conversation
sql()
with literal argssql()
with constants
sql()
with constantssql()
with constants
sql()
with constantssql()
with Any
constants
* "DATE'2023-03-21'". The fragments of string values belonged to SQL comments are skipped | ||
* while parsing. | ||
* A map of parameter names to Java/Scala objects that can be converted to SQL literal | ||
* expressions. See <a href="https://spark.apache.org/docs/latest/sql-ref-datatypes.html"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at the doc, I think we should update it to include the new java datetime api like LocalDate
, and put then at the beginning to promote them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the PR #40644
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala
Outdated
Show resolved
Hide resolved
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala
Outdated
Show resolved
Hide resolved
// belonged to SQL comments are skipped while parsing. | ||
map<string, string> args = 2; | ||
// (Optional) A map of parameter names to literal expressions. | ||
map<string, Expression.Literal> args = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @grundprinzip are these protocol changes ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is an incomatible change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be valid:
map<string, Expression.Literal> args = 2; | |
map<string, string> args = 2; | |
map<string, Expression.Literal> expr_args = 3; |
so might be
map<string, Expression.Literal> args = 2; | |
oneof args { | |
map<string, string> args = 2; | |
map<string, Expression.Literal> expr_args = 3; | |
} |
Please run
buf breaking --against "https://github.com/apache/spark/archive/master.zip#strip_components=1,subdir=connector/connect/common/src/main"
// belonged to SQL comments are skipped while parsing. | ||
map<string, string> args = 2; | ||
// (Optional) A map of parameter names to literal expressions. | ||
map<string, Expression.Literal> args = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Spark Connect changes are breaking.
@grundprinzip Spark Connect is not released yet, I think we can still change it? This PR should go to Spark 3.4. cc @xinrong-meng |
If it's guaranteed to be included in 3.4 it's not a breaking change. |
thanks, merging to master/3.4! |
It has conflicts with 3.4, @MaxGekk can you create a backport PR? Thanks! |
@cloud-fan I am working on the backport ... |
In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in Scala/Java/Python and `Expression.Literal` in protobuf API. Language API can accept `Any` objects from which it is possible to construct literal expressions. ```scala def sql(sqlText: String, args: Map[String, Any]): DataFrame ``` values of the `args` map are wrapped by the `lit()` function which leaves `Column` as is and creates a literal from other Java/Scala objects (for more details see the `Scala` tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). ```python def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame: ``` Similarly to Scala/Java `sql`, Python's `sql()` accepts Python objects as values of the `args` dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). `sql()` converts dictionary values to `Column` literal expressions by `lit()`. ```proto message SqlCommand { // (Required) SQL Query. string sql = 1; // (Optional) A map of parameter names to literal expressions. map<string, Expression.Literal> args = 2; } ``` For example: ```scala scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name""" sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false) +-------------+ |s | +-------------+ |E'Twaun Moore| +-------------+ ``` The current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues: 1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input. 2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'` No since the parameterized SQL feature apache#38864 hasn't been released yet. By running the affected tests: ``` $ build/sbt "test:testOnly *ParametersSuite" $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args' $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql' ``` Closes apache#40623 from MaxGekk/parameterized-sql-any. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 156a12e) Signed-off-by: Max Gekk <max.gekk@gmail.com>
Here is the backport to |
### What changes were proposed in this pull request? In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in Scala/Java/Python and `Expression.Literal` in protobuf API. Language API can accept `Any` objects from which it is possible to construct literal expressions. This is a backport of #40623 #### Scala/Java: ```scala def sql(sqlText: String, args: Map[String, Any]): DataFrame ``` values of the `args` map are wrapped by the `lit()` function which leaves `Column` as is and creates a literal from other Java/Scala objects (for more details see the `Scala` tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). #### Python: ```python def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame: ``` Similarly to Scala/Java `sql`, Python's `sql()` accepts Python objects as values of the `args` dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). `sql()` converts dictionary values to `Column` literal expressions by `lit()`. #### Protobuf: ```proto message SqlCommand { // (Required) SQL Query. string sql = 1; // (Optional) A map of parameter names to literal expressions. map<string, Expression.Literal> args = 2; } ``` For example: ```scala scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name""" sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false) +-------------+ |s | +-------------+ |E'Twaun Moore| +-------------+ ``` ### Why are the changes needed? The current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues: 1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input. 2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'` ### Does this PR introduce _any_ user-facing change? No since the parameterized SQL feature #38864 hasn't been released yet. ### How was this patch tested? By running the affected tests: ``` $ build/sbt "test:testOnly *ParametersSuite" $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args' $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql' ``` Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 156a12e) Closes #40666 from MaxGekk/parameterized-sql-any-3.4-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in Scala/Java/Python and `Expression.Literal` in protobuf API. Language API can accept `Any` objects from which it is possible to construct literal expressions. This is a backport of apache#40623 #### Scala/Java: ```scala def sql(sqlText: String, args: Map[String, Any]): DataFrame ``` values of the `args` map are wrapped by the `lit()` function which leaves `Column` as is and creates a literal from other Java/Scala objects (for more details see the `Scala` tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). #### Python: ```python def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame: ``` Similarly to Scala/Java `sql`, Python's `sql()` accepts Python objects as values of the `args` dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). `sql()` converts dictionary values to `Column` literal expressions by `lit()`. #### Protobuf: ```proto message SqlCommand { // (Required) SQL Query. string sql = 1; // (Optional) A map of parameter names to literal expressions. map<string, Expression.Literal> args = 2; } ``` For example: ```scala scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name""" sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false) +-------------+ |s | +-------------+ |E'Twaun Moore| +-------------+ ``` ### Why are the changes needed? The current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues: 1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input. 2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'` ### Does this PR introduce _any_ user-facing change? No since the parameterized SQL feature apache#38864 hasn't been released yet. ### How was this patch tested? By running the affected tests: ``` $ build/sbt "test:testOnly *ParametersSuite" $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args' $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql' ``` Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 156a12e) Closes apache#40666 from MaxGekk/parameterized-sql-any-3.4-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
In the PR, I propose to change API of parameterized SQL, and replace type of argument values from
string
toAny
in Scala/Java/Python andExpression.Literal
in protobuf API. Language API can acceptAny
objects from which it is possible to construct literal expressions.Scala/Java:
values of the
args
map are wrapped by thelit()
function which leavesColumn
as is and creates a literal from other Java/Scala objects (for more details see theScala
tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html).Python:
Similarly to Scala/Java
sql
, Python'ssql()
accepts Python objects as values of theargs
dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html).sql()
converts dictionary values toColumn
literal expressions bylit()
.Protobuf:
For example:
Why are the changes needed?
The current implementation the parameterized
sql()
requires arguments as string values parsed to SQL literal expressions that causes the following issues:'Europe -- Amsterdam'
. In this case,-- Amsterdam
is excluded from the input.'E\'Twaun Moore'
Does this PR introduce any user-facing change?
No since the parameterized SQL feature #38864 hasn't been released yet.
How was this patch tested?
By running the affected tests: