Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43009][SQL] Parameterized sql() with Any constants #40623

Closed
wants to merge 15 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Mar 31, 2023

What changes were proposed in this pull request?

In the PR, I propose to change API of parameterized SQL, and replace type of argument values from string to Any in Scala/Java/Python and Expression.Literal in protobuf API. Language API can accept Any objects from which it is possible to construct literal expressions.

Scala/Java:

  def sql(sqlText: String, args: Map[String, Any]): DataFrame

values of the args map are wrapped by the lit() function which leaves Column as is and creates a literal from other Java/Scala objects (for more details see the Scala tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html).

Python:

def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame:

Similarly to Scala/Java sql, Python's sql() accepts Python objects as values of the args dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). sql() converts dictionary values to Column literal expressions by lit().

Protobuf:

message SqlCommand {
  // (Required) SQL Query.
  string sql = 1;

  // (Optional) A map of parameter names to literal expressions.
  map<string, Expression.Literal> args = 2;
}

For example:

scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name"""
sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name

scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false)
+-------------+
|s            |
+-------------+
|E'Twaun Moore|
+-------------+

Why are the changes needed?

The current implementation the parameterized sql() requires arguments as string values parsed to SQL literal expressions that causes the following issues:

  1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, 'Europe -- Amsterdam'. In this case, -- Amsterdam is excluded from the input.
  2. Special chars in string values must be escaped, for instance 'E\'Twaun Moore'

Does this PR introduce any user-facing change?

No since the parameterized SQL feature #38864 hasn't been released yet.

How was this patch tested?

By running the affected tests:

$ build/sbt "test:testOnly *ParametersSuite"
$ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args'
$ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql'

@MaxGekk MaxGekk changed the title [WIP][SQL] Parameterized sql() with literal args [WIP][SQL] Parameterized sql() with constants Apr 3, 2023
@MaxGekk MaxGekk changed the title [WIP][SQL] Parameterized sql() with constants [SPARK-43009][SQL] Parameterized sql() with constants Apr 3, 2023
@MaxGekk MaxGekk marked this pull request as ready for review April 3, 2023 08:54
@MaxGekk MaxGekk changed the title [SPARK-43009][SQL] Parameterized sql() with constants [SPARK-43009][SQL] Parameterized sql() with Any constants Apr 3, 2023
* "DATE'2023-03-21'". The fragments of string values belonged to SQL comments are skipped
* while parsing.
* A map of parameter names to Java/Scala objects that can be converted to SQL literal
* expressions. See <a href="https://spark.apache.org/docs/latest/sql-ref-datatypes.html">
Copy link
Contributor

@cloud-fan cloud-fan Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the doc, I think we should update it to include the new java datetime api like LocalDate, and put then at the beginning to promote them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR #40644

// belonged to SQL comments are skipped while parsing.
map<string, string> args = 2;
// (Optional) A map of parameter names to literal expressions.
map<string, Expression.Literal> args = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @grundprinzip are these protocol changes ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is an incomatible change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be valid:

Suggested change
map<string, Expression.Literal> args = 2;
map<string, string> args = 2;
map<string, Expression.Literal> expr_args = 3;

so might be

Suggested change
map<string, Expression.Literal> args = 2;
oneof args {
map<string, string> args = 2;
map<string, Expression.Literal> expr_args = 3;
}

Please run

buf breaking --against "https://github.com/apache/spark/archive/master.zip#strip_components=1,subdir=connector/connect/common/src/main"

// belonged to SQL comments are skipped while parsing.
map<string, string> args = 2;
// (Optional) A map of parameter names to literal expressions.
map<string, Expression.Literal> args = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Copy link
Contributor

@grundprinzip grundprinzip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Spark Connect changes are breaking.

@cloud-fan
Copy link
Contributor

@grundprinzip Spark Connect is not released yet, I think we can still change it? This PR should go to Spark 3.4. cc @xinrong-meng

@grundprinzip
Copy link
Contributor

If it's guaranteed to be included in 3.4 it's not a breaking change.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.4!

@cloud-fan cloud-fan closed this in 156a12e Apr 4, 2023
@cloud-fan
Copy link
Contributor

It has conflicts with 3.4, @MaxGekk can you create a backport PR? Thanks!

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 4, 2023

@cloud-fan I am working on the backport ...

MaxGekk added a commit to MaxGekk/spark that referenced this pull request Apr 4, 2023
In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in Scala/Java/Python and `Expression.Literal` in protobuf API. Language API can accept `Any` objects from which it is possible to construct literal expressions.

```scala
  def sql(sqlText: String, args: Map[String, Any]): DataFrame
```
values of the `args` map are wrapped by the `lit()` function which leaves `Column` as is and creates a literal from other Java/Scala objects (for more details see the `Scala` tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html).

```python
def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame:
```
Similarly to Scala/Java `sql`, Python's `sql()` accepts Python objects as values of the `args` dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). `sql()` converts dictionary values to `Column` literal expressions by `lit()`.

```proto
message SqlCommand {
  // (Required) SQL Query.
  string sql = 1;

  // (Optional) A map of parameter names to literal expressions.
  map<string, Expression.Literal> args = 2;
}
```

For example:
```scala
scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name"""
sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name

scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false)
+-------------+
|s            |
+-------------+
|E'Twaun Moore|
+-------------+
```

The current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues:
1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input.
2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'`

No since the parameterized SQL feature apache#38864 hasn't been released yet.

By running the affected tests:
```
$ build/sbt "test:testOnly *ParametersSuite"
$ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args'
$ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql'
```

Closes apache#40623 from MaxGekk/parameterized-sql-any.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 156a12e)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 4, 2023

Here is the backport to branch-3.4: #40666

HyukjinKwon pushed a commit that referenced this pull request Apr 5, 2023
### What changes were proposed in this pull request?
In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in Scala/Java/Python and `Expression.Literal` in protobuf API. Language API can accept `Any` objects from which it is possible to construct literal expressions.

This is a backport of #40623

#### Scala/Java:

```scala
  def sql(sqlText: String, args: Map[String, Any]): DataFrame
```
values of the `args` map are wrapped by the `lit()` function which leaves `Column` as is and creates a literal from other Java/Scala objects (for more details see the `Scala` tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html).

#### Python:

```python
def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame:
```
Similarly to Scala/Java `sql`, Python's `sql()` accepts Python objects as values of the `args` dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). `sql()` converts dictionary values to `Column` literal expressions by `lit()`.

#### Protobuf:

```proto
message SqlCommand {
  // (Required) SQL Query.
  string sql = 1;

  // (Optional) A map of parameter names to literal expressions.
  map<string, Expression.Literal> args = 2;
}
```

For example:
```scala
scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name"""
sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name

scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false)
+-------------+
|s            |
+-------------+
|E'Twaun Moore|
+-------------+
```

### Why are the changes needed?
The current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues:
1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input.
2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'`

### Does this PR introduce _any_ user-facing change?
No since the parameterized SQL feature #38864 hasn't been released yet.

### How was this patch tested?
By running the affected tests:
```
$ build/sbt "test:testOnly *ParametersSuite"
$ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args'
$ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql'
```

Authored-by: Max Gekk <max.gekkgmail.com>
(cherry picked from commit 156a12e)

Closes #40666 from MaxGekk/parameterized-sql-any-3.4-2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
### What changes were proposed in this pull request?
In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in Scala/Java/Python and `Expression.Literal` in protobuf API. Language API can accept `Any` objects from which it is possible to construct literal expressions.

This is a backport of apache#40623

#### Scala/Java:

```scala
  def sql(sqlText: String, args: Map[String, Any]): DataFrame
```
values of the `args` map are wrapped by the `lit()` function which leaves `Column` as is and creates a literal from other Java/Scala objects (for more details see the `Scala` tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html).

#### Python:

```python
def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame:
```
Similarly to Scala/Java `sql`, Python's `sql()` accepts Python objects as values of the `args` dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). `sql()` converts dictionary values to `Column` literal expressions by `lit()`.

#### Protobuf:

```proto
message SqlCommand {
  // (Required) SQL Query.
  string sql = 1;

  // (Optional) A map of parameter names to literal expressions.
  map<string, Expression.Literal> args = 2;
}
```

For example:
```scala
scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name"""
sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name

scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false)
+-------------+
|s            |
+-------------+
|E'Twaun Moore|
+-------------+
```

### Why are the changes needed?
The current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues:
1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input.
2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'`

### Does this PR introduce _any_ user-facing change?
No since the parameterized SQL feature apache#38864 hasn't been released yet.

### How was this patch tested?
By running the affected tests:
```
$ build/sbt "test:testOnly *ParametersSuite"
$ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args'
$ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql'
```

Authored-by: Max Gekk <max.gekkgmail.com>
(cherry picked from commit 156a12e)

Closes apache#40666 from MaxGekk/parameterized-sql-any-3.4-2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants