[SPARK-42128][SQL] Support TOP (N) for MS SQL Server dialect as an alternative to Limit pushdown #39660

sadikovi · 2023-01-20T04:33:21Z

What changes were proposed in this pull request?

This PR adds Limit pushdown support for MS SQL Server. Although SQL Server does not support LIMIT clause, it supports TOP (N) expression (https://learn.microsoft.com/en-us/sql/t-sql/queries/top-transact-sql?view=sql-server-ver16) which works in a similar way.

It is a simple change that adds another method to the dialect API to handle TOP expression specifically for SQL Server (or other database that might support it in the future). However, TOP and LIMIT are mutually exclusive in this PR.

TOP (N) also requires that a subquery is named, e.g. select TOP (5) * from (select * from test) XYZ, otherwise the query fails. Currently Spark injects SPARK_GEN_SUBQ_ name, so it works fine.

Why are the changes needed?

Enables Limit pushdown for MS SQL Server.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

I added a unit test and verified manually against a SQL Server instance.

sadikovi · 2023-01-20T04:35:18Z

@sunchao @beliefer @dongjoon-hyun Could you review this PR? Thank you.

It would be great if you could also suggest on how I can test the query string generated by Spark before it is sent to the JDBC driver. Are there such existing tests in the codebase?

dongjoon-hyun

Thank you for pinging me, @sadikovi .

@huaxingao is also an expert in JDBC pushdown domain.

dongjoon-hyun · 2023-01-20T04:38:05Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala

  }

+  override def getTopExpression(limit: Integer): String = {
+    if (limit > 0) s"TOP ($limit)" else ""


Just curious, MsSQL syntax doesn't allow TOP 0?

TOP 0 is supported but 0 is used as a sentinel value to disable the pushdown, it is similar to how LIMIT works currently. I tested df.select("a").limit(0) query and it is optimised to ignore the query (?).

dongjoon-hyun · 2023-01-20T04:38:37Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

    }
  }

+  test("Dialect Limit and Top implementation") {


nit. Shall we use the test prefix, SPARK-42128: ?

dongjoon-hyun

I'm fine with this approach.

beliefer · 2023-01-20T08:26:42Z

I have a question if Top(n) without order by is Top N ? or it is just limit N ?

sadikovi · 2023-01-20T09:06:31Z

Thanks @dongjoon-hyun. I will address your comments soon-ish 🙂.

@beliefer, Yes, you are right. The documentation describes TOP (N) returning the N top rows when used together with ORDER BY but when it is by itself, it behaves just like LIMIT - no particular order is guaranteed.

beliefer · 2023-01-20T09:19:12Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

+   * MS SQL Server version of `getLimitClause`.
+   * This is only supported by SQL Server as it uses TOP (N) instead.
+   */
+  def getTopExpression(limit: Integer): String = {


This API looks a little hack.

#39667 refactor the API and this PR could override getSQLText.
cc @cloud-fan @huaxingao

What exactly is hacky in this API? As far as I can see, it just follows the existing JDBC code. If you are referring to an alternative implementation, then I am happy to consider it, otherwise please elaborate.

I also don't see how the PR you linked solves this problem. Yes, there is no top reference in the general JdbcDialects but now MSSQL dialect needs to reimplement getSQLText method which is suboptimal IMHO.

I think we should define the different syntax by dialect themself. I exact the API getSQLText so as some dialect could implement it in special way.

I agree, it would be good to have something more flexible than the current API; however, I also don't think that moving getSQLText into JdbcDialects is good approach either: difficult to evolve API, too much boilerplate to copy-paste if you need to change a small detail in the query.

I think it would be good to split the query into fragments and have methods for each. However, there has already been a precedent with getTableSample which is similar to TOP N but is only supported by Postgres.

I can work on a refined API for JdbcDialects as a follow-up for this work, would that work? I may need to prepare a small SPIP/design doc on it.

@beliefer @srowen @dongjoon-hyun What do you think? Or should I just work on a new API right away without this change? I would appreciate your suggestions.

I agree, it would be good to have something more flexible than the current API; however, I also don't think that moving getSQLText into JdbcDialects is good approach either: difficult to evolve API, too much boilerplate to copy-paste if you need to change a small detail in the query.

@sadikovi You said is right. I simplified the #39667. getSQLText only used to organize the SQL syntax.

srowen · 2023-01-20T15:23:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala

      ""
    }

+    val myTopExpression: String = dialect.getTopExpression(limit) // SQL Server Limit alternative


Yeah, feels like this belongs all in the MySQL dialect, one way or the other. Is it not just an alternative LIMIT clause?

As I mentioned in the PR description, SQL Server does not support LIMIT clause.

Of course, what i mean is, is "TOP n" just the 'limit' clause you need to generate for MySQL? then it doesn't need different handling, just needs to generate a different clause. But looks like it goes into the SELECT part, not at the end, OK

MySQL supports LIMIT clause, I assume you meant MSSQL/SQL Server.

Yes, you are right. Although TOP N has similar semantics as LIMIT, it is more of an expression (?) rather than a clause, sort of like DISTINCT. We need to insert TOP N before the list of columns in select, unfortunately, this is the syntax for it.

I came up with this approach of introducing another getTopExpression because I thought it would be the least intrusive one but I am open to alternative approaches, please let me know if you have any suggestions as to how I can refactor the code to support this.

Oops yes I'm talking about what you're talking about, typo - MS SQL Server
I'm OK with it; the alternative is to somehow edit the SQL query down in the MS SQL dialect maybe, I haven't thought about it. It'd be nicer but not sure it's cleaner.

Actually, there is already a precedent for adding something like this: getTableSample method.

AmplabJenkins · 2023-01-23T01:08:12Z

Can one of the admins verify this patch?

sadikovi · 2023-01-25T04:06:23Z

@beliefer @srowen @dongjoon-hyun Could you please check the following comments:

[SPARK-42128][SQL] Support TOP (N) for MS SQL Server dialect as an alternative to Limit pushdown #39660 (comment)
[SPARK-42128][SQL] Support TOP (N) for MS SQL Server dialect as an alternative to Limit pushdown #39660 (comment)

Do you prefer to redesign JdbcDialects API for this work as a follow-up? Or should I close this PR and just work on redesign directly? Thanks.

cloud-fan · 2023-01-30T02:46:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala


    val sqlText = options.prepareQuery +
-      s"SELECT $columnList FROM ${options.tableOrQuery} $myTableSampleClause" +
+      s"SELECT $myTopExpression $columnList FROM ${options.tableOrQuery} $myTableSampleClause" +


It's hard to define a SQL template that works for all dialects. The approach we take is to try to make it a superset of all dialects.

A better design is to let the dialect build the SQL. For example, we can add a new API in JdbcDialect

def getSQLBuilder: SQLQueryBuilder = ... the default implementation builds the SQL in a standard way.

Then Spark gets the builder and do

val builder = ... builder.withColumns(...).withFromRelation(...).withLimit(...)...build()

SQLServerDialect can put the limit before the column list.

I will do the work.

cloud-fan · 2023-01-30T02:51:58Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala


+  /**
+   * MS SQL Server version of `getLimitClause`.
+   * This is only supported by SQL Server as it uses TOP (N) instead.


The problem is TOP (N) is just the SQLServer syntax of LIMIT. It's really weird to have a new dialect API for it, as it's different from table sample which is a new feature.

sadikovi · 2023-01-30T22:13:42Z

What is the resolution? Is it okay to merge the PR or do you think someone should work on refactoring the API first?

cloud-fan · 2023-01-31T02:07:09Z

This is a new feature and we don't need to rush (can't catch 3.4 anyway). I'd like to have an API refactor first.

add top implementation

eb11a68

sadikovi changed the title ~~[SPARK-42128] Support TOP (N) for MS SQL Server dialect as an alternative to Limit pushdown~~ [SPARK-42128][SQL] Support TOP (N) for MS SQL Server dialect as an alternative to Limit pushdown Jan 20, 2023

github-actions bot added the SQL label Jan 20, 2023

dongjoon-hyun reviewed Jan 20, 2023

View reviewed changes

dongjoon-hyun approved these changes Jan 20, 2023

View reviewed changes

beliefer reviewed Jan 20, 2023

View reviewed changes

srowen reviewed Jan 20, 2023

View reviewed changes

update test name

c451c7b

sadikovi mentioned this pull request Jan 29, 2023

[SPARK-42131][SQL] Extract the function that construct the select statement for JDBC dialect. #39667

Closed

cloud-fan reviewed Jan 30, 2023

View reviewed changes

sadikovi closed this Feb 1, 2023

[SPARK-42128][SQL] Support TOP (N) for MS SQL Server dialect as an alternative to Limit pushdown #39660

[SPARK-42128][SQL] Support TOP (N) for MS SQL Server dialect as an alternative to Limit pushdown #39660

Uh oh!

Conversation

sadikovi commented Jan 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sadikovi commented Jan 20, 2023

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

beliefer commented Jan 20, 2023

Uh oh!

sadikovi commented Jan 20, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadikovi Jan 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beliefer Jan 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beliefer Jan 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Jan 23, 2023

Uh oh!

sadikovi commented Jan 25, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadikovi commented Jan 30, 2023

Uh oh!

cloud-fan commented Jan 31, 2023

Uh oh!

Reviewers

Assignees

Labels

sadikovi commented Jan 20, 2023 •

edited

Loading

sadikovi Jan 22, 2023 •

edited

Loading

beliefer Jan 23, 2023 •

edited

Loading

beliefer Jan 28, 2023 •

edited

Loading