[SPARK-43009][SQL] Parameterized `sql()` with `Any` constants #40623

MaxGekk · 2023-03-31T10:00:13Z

What changes were proposed in this pull request?

In the PR, I propose to change API of parameterized SQL, and replace type of argument values from string to Any in Scala/Java/Python and Expression.Literal in protobuf API. Language API can accept Any objects from which it is possible to construct literal expressions.

Scala/Java:

  def sql(sqlText: String, args: Map[String, Any]): DataFrame

values of the args map are wrapped by the lit() function which leaves Column as is and creates a literal from other Java/Scala objects (for more details see the Scala tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html).

Python:

def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame:

Similarly to Scala/Java sql, Python's sql() accepts Python objects as values of the args dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). sql() converts dictionary values to Column literal expressions by lit().

Protobuf:

message SqlCommand {
  // (Required) SQL Query.
  string sql = 1;

  // (Optional) A map of parameter names to literal expressions.
  map<string, Expression.Literal> args = 2;
}

For example:

scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name"""
sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name

scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false)
+-------------+
|s            |
+-------------+
|E'Twaun Moore|
+-------------+

Why are the changes needed?

The current implementation the parameterized sql() requires arguments as string values parsed to SQL literal expressions that causes the following issues:

SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, 'Europe -- Amsterdam'. In this case, -- Amsterdam is excluded from the input.
Special chars in string values must be escaped, for instance 'E\'Twaun Moore'

Does this PR introduce any user-facing change?

No since the parameterized SQL feature #38864 hasn't been released yet.

How was this patch tested?

By running the affected tests:

$ build/sbt "test:testOnly *ParametersSuite"
$ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args'
$ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql'

cloud-fan · 2023-04-03T09:29:29Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala

-   *   "DATE'2023-03-21'". The fragments of string values belonged to SQL comments are skipped
-   *   while parsing.
+   *   A map of parameter names to Java/Scala objects that can be converted to SQL literal
+   *   expressions. See <a href="https://spark.apache.org/docs/latest/sql-ref-datatypes.html">


looking at the doc, I think we should update it to include the new java datetime api like LocalDate, and put then at the beginning to promote them.

Here is the PR #40644

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala

hvanhovell · 2023-04-04T02:06:02Z

connector/connect/common/src/main/protobuf/spark/connect/commands.proto

-  // belonged to SQL comments are skipped while parsing.
-  map<string, string> args = 2;
+  // (Optional) A map of parameter names to literal expressions.
+  map<string, Expression.Literal> args = 2;


cc @grundprinzip are these protocol changes ok?

No, this is an incomatible change.

would be valid:

Suggested change

map<string, Expression.Literal> args = 2;

map<string, string> args = 2;

map<string, Expression.Literal> expr_args = 3;

so might be

Suggested change

map<string, Expression.Literal> args = 2;

oneof args {

map<string, string> args = 2;

map<string, Expression.Literal> expr_args = 3;

}

Please run

buf breaking --against "https://github.com/apache/spark/archive/master.zip#strip_components=1,subdir=connector/connect/common/src/main"

grundprinzip · 2023-04-04T02:19:14Z

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

-  // belonged to SQL comments are skipped while parsing.
-  map<string, string> args = 2;
+  // (Optional) A map of parameter names to literal expressions.
+  map<string, Expression.Literal> args = 2;


grundprinzip

The Spark Connect changes are breaking.

cloud-fan · 2023-04-04T02:31:24Z

@grundprinzip Spark Connect is not released yet, I think we can still change it? This PR should go to Spark 3.4. cc @xinrong-meng

grundprinzip · 2023-04-04T02:43:19Z

If it's guaranteed to be included in 3.4 it's not a breaking change.

cloud-fan · 2023-04-04T15:22:08Z

thanks, merging to master/3.4!

cloud-fan · 2023-04-04T15:24:17Z

It has conflicts with 3.4, @MaxGekk can you create a backport PR? Thanks!

MaxGekk · 2023-04-04T15:53:45Z

@cloud-fan I am working on the backport ...

In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in Scala/Java/Python and `Expression.Literal` in protobuf API. Language API can accept `Any` objects from which it is possible to construct literal expressions. ```scala def sql(sqlText: String, args: Map[String, Any]): DataFrame ``` values of the `args` map are wrapped by the `lit()` function which leaves `Column` as is and creates a literal from other Java/Scala objects (for more details see the `Scala` tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). ```python def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame: ``` Similarly to Scala/Java `sql`, Python's `sql()` accepts Python objects as values of the `args` dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). `sql()` converts dictionary values to `Column` literal expressions by `lit()`. ```proto message SqlCommand { // (Required) SQL Query. string sql = 1; // (Optional) A map of parameter names to literal expressions. map<string, Expression.Literal> args = 2; } ``` For example: ```scala scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name""" sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false) +-------------+ |s | +-------------+ |E'Twaun Moore| +-------------+ ``` The current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues: 1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input. 2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'` No since the parameterized SQL feature apache#38864 hasn't been released yet. By running the affected tests: ``` $ build/sbt "test:testOnly *ParametersSuite" $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args' $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql' ``` Closes apache#40623 from MaxGekk/parameterized-sql-any. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 156a12e) Signed-off-by: Max Gekk <max.gekk@gmail.com>

MaxGekk · 2023-04-04T17:50:49Z

Here is the backport to branch-3.4: #40666

### What changes were proposed in this pull request? In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in Scala/Java/Python and `Expression.Literal` in protobuf API. Language API can accept `Any` objects from which it is possible to construct literal expressions. This is a backport of #40623 #### Scala/Java: ```scala def sql(sqlText: String, args: Map[String, Any]): DataFrame ``` values of the `args` map are wrapped by the `lit()` function which leaves `Column` as is and creates a literal from other Java/Scala objects (for more details see the `Scala` tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). #### Python: ```python def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame: ``` Similarly to Scala/Java `sql`, Python's `sql()` accepts Python objects as values of the `args` dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). `sql()` converts dictionary values to `Column` literal expressions by `lit()`. #### Protobuf: ```proto message SqlCommand { // (Required) SQL Query. string sql = 1; // (Optional) A map of parameter names to literal expressions. map<string, Expression.Literal> args = 2; } ``` For example: ```scala scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name""" sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false) +-------------+ |s | +-------------+ |E'Twaun Moore| +-------------+ ``` ### Why are the changes needed? The current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues: 1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input. 2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'` ### Does this PR introduce _any_ user-facing change? No since the parameterized SQL feature #38864 hasn't been released yet. ### How was this patch tested? By running the affected tests: ``` $ build/sbt "test:testOnly *ParametersSuite" $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args' $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql' ``` Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 156a12e) Closes #40666 from MaxGekk/parameterized-sql-any-3.4-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in Scala/Java/Python and `Expression.Literal` in protobuf API. Language API can accept `Any` objects from which it is possible to construct literal expressions. This is a backport of apache#40623 #### Scala/Java: ```scala def sql(sqlText: String, args: Map[String, Any]): DataFrame ``` values of the `args` map are wrapped by the `lit()` function which leaves `Column` as is and creates a literal from other Java/Scala objects (for more details see the `Scala` tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). #### Python: ```python def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame: ``` Similarly to Scala/Java `sql`, Python's `sql()` accepts Python objects as values of the `args` dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). `sql()` converts dictionary values to `Column` literal expressions by `lit()`. #### Protobuf: ```proto message SqlCommand { // (Required) SQL Query. string sql = 1; // (Optional) A map of parameter names to literal expressions. map<string, Expression.Literal> args = 2; } ``` For example: ```scala scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name""" sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false) +-------------+ |s | +-------------+ |E'Twaun Moore| +-------------+ ``` ### Why are the changes needed? The current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues: 1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input. 2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'` ### Does this PR introduce _any_ user-facing change? No since the parameterized SQL feature apache#38864 hasn't been released yet. ### How was this patch tested? By running the affected tests: ``` $ build/sbt "test:testOnly *ParametersSuite" $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args' $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql' ``` Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 156a12e) Closes apache#40666 from MaxGekk/parameterized-sql-any-3.4-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Parameterized sql with literal args

5844cbb

github-actions bot added CONNECT SQL labels Mar 31, 2023

MaxGekk added 3 commits March 31, 2023 17:54

Fix SparkSession.scala

9cdf053

Fix coding style

7bc0bec

Fix python/pyspark/sql/session.py

302f801

github-actions bot added CORE PYTHON labels Mar 31, 2023

MaxGekk added 10 commits April 1, 2023 10:37

Update proto

9ff50d0

Fix python coding style

5c0c455

Fix plan.py

b07ccf2

Reformat plan.py

0a7da2b

Fix coding style

5b7d90b

Add a test

4543764

Update comments

c482018

Fix coding style

ae11139

Re-gen proto files

ba54339

Merge remote-tracking branch 'origin/master' into parameterized-sql-any

f6b9ee4

MaxGekk changed the title ~~[WIP][SQL] Parameterized sql() with literal args~~ [WIP][SQL] Parameterized sql() with constants Apr 3, 2023

MaxGekk changed the title ~~[WIP][SQL] Parameterized sql() with constants~~ [SPARK-43009][SQL] Parameterized sql() with constants Apr 3, 2023

MaxGekk marked this pull request as ready for review April 3, 2023 08:54

MaxGekk changed the title ~~[SPARK-43009][SQL] Parameterized sql() with constants~~ [SPARK-43009][SQL] Parameterized sql() with Any constants Apr 3, 2023

cloud-fan reviewed Apr 3, 2023

View reviewed changes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Apr 3, 2023

View reviewed changes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Apr 3, 2023

View reviewed changes

Address Wenchen's review comments

67f34e9

entong approved these changes Apr 3, 2023

View reviewed changes

srielau approved these changes Apr 3, 2023

View reviewed changes

hvanhovell reviewed Apr 4, 2023

View reviewed changes

grundprinzip reviewed Apr 4, 2023

View reviewed changes

grundprinzip requested changes Apr 4, 2023

View reviewed changes

cloud-fan closed this in 156a12e Apr 4, 2023

MaxGekk mentioned this pull request Apr 4, 2023

[SPARK-43009][SQL][3.4] Parameterized sql() with Any constants #40666

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43009][SQL] Parameterized `sql()` with `Any` constants #40623

[SPARK-43009][SQL] Parameterized `sql()` with `Any` constants #40623

MaxGekk commented Mar 31, 2023 •

edited

Loading

cloud-fan Apr 3, 2023 •

edited

Loading

MaxGekk Apr 3, 2023

hvanhovell Apr 4, 2023

grundprinzip Apr 4, 2023

grundprinzip Apr 4, 2023

grundprinzip Apr 4, 2023

grundprinzip left a comment

cloud-fan commented Apr 4, 2023

grundprinzip commented Apr 4, 2023

cloud-fan commented Apr 4, 2023

cloud-fan commented Apr 4, 2023

MaxGekk commented Apr 4, 2023

MaxGekk commented Apr 4, 2023

	map<string, Expression.Literal> args = 2;
	map<string, string> args = 2;
	map<string, Expression.Literal> expr_args = 3;

[SPARK-43009][SQL] Parameterized sql() with Any constants #40623

[SPARK-43009][SQL] Parameterized sql() with Any constants #40623

Conversation

MaxGekk commented Mar 31, 2023 • edited Loading

What changes were proposed in this pull request?

Scala/Java:

Python:

Protobuf:

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan Apr 3, 2023 • edited Loading

Choose a reason for hiding this comment

MaxGekk Apr 3, 2023

Choose a reason for hiding this comment

hvanhovell Apr 4, 2023

Choose a reason for hiding this comment

grundprinzip Apr 4, 2023

Choose a reason for hiding this comment

grundprinzip Apr 4, 2023

Choose a reason for hiding this comment

grundprinzip Apr 4, 2023

Choose a reason for hiding this comment

grundprinzip left a comment

Choose a reason for hiding this comment

cloud-fan commented Apr 4, 2023

grundprinzip commented Apr 4, 2023

cloud-fan commented Apr 4, 2023

cloud-fan commented Apr 4, 2023

MaxGekk commented Apr 4, 2023

MaxGekk commented Apr 4, 2023

[SPARK-43009][SQL] Parameterized `sql()` with `Any` constants #40623

[SPARK-43009][SQL] Parameterized `sql()` with `Any` constants #40623

MaxGekk commented Mar 31, 2023 •

edited

Loading

cloud-fan Apr 3, 2023 •

edited

Loading