-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-42128][SQL] Support TOP (N) for MS SQL Server dialect as an alternative to Limit pushdown #39660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@sunchao @beliefer @dongjoon-hyun Could you review this PR? Thank you. It would be great if you could also suggest on how I can test the query string generated by Spark before it is sent to the JDBC driver. Are there such existing tests in the codebase? |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pinging me, @sadikovi .
@huaxingao is also an expert in JDBC pushdown domain.
| } | ||
|
|
||
| override def getTopExpression(limit: Integer): String = { | ||
| if (limit > 0) s"TOP ($limit)" else "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, MsSQL syntax doesn't allow TOP 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TOP 0 is supported but 0 is used as a sentinel value to disable the pushdown, it is similar to how LIMIT works currently. I tested df.select("a").limit(0) query and it is optimised to ignore the query (?).
| } | ||
| } | ||
|
|
||
| test("Dialect Limit and Top implementation") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Shall we use the test prefix, SPARK-42128: ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this approach.
|
I have a question if Top(n) without order by is Top N ? or it is just limit N ? |
|
Thanks @dongjoon-hyun. I will address your comments soon-ish 🙂. @beliefer, Yes, you are right. The documentation describes TOP (N) returning the N top rows when used together with ORDER BY but when it is by itself, it behaves just like LIMIT - no particular order is guaranteed. |
| * MS SQL Server version of `getLimitClause`. | ||
| * This is only supported by SQL Server as it uses TOP (N) instead. | ||
| */ | ||
| def getTopExpression(limit: Integer): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API looks a little hack.
#39667 refactor the API and this PR could override getSQLText.
cc @cloud-fan @huaxingao
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is hacky in this API? As far as I can see, it just follows the existing JDBC code. If you are referring to an alternative implementation, then I am happy to consider it, otherwise please elaborate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't see how the PR you linked solves this problem. Yes, there is no top reference in the general JdbcDialects but now MSSQL dialect needs to reimplement getSQLText method which is suboptimal IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should define the different syntax by dialect themself. I exact the API getSQLText so as some dialect could implement it in special way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it would be good to have something more flexible than the current API; however, I also don't think that moving getSQLText into JdbcDialects is good approach either: difficult to evolve API, too much boilerplate to copy-paste if you need to change a small detail in the query.
I think it would be good to split the query into fragments and have methods for each. However, there has already been a precedent with getTableSample which is similar to TOP N but is only supported by Postgres.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can work on a refined API for JdbcDialects as a follow-up for this work, would that work? I may need to prepare a small SPIP/design doc on it.
@beliefer @srowen @dongjoon-hyun What do you think? Or should I just work on a new API right away without this change? I would appreciate your suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it would be good to have something more flexible than the current API; however, I also don't think that moving
getSQLTextintoJdbcDialectsis good approach either: difficult to evolve API, too much boilerplate to copy-paste if you need to change a small detail in the query.
@sadikovi You said is right. I simplified the #39667. getSQLText only used to organize the SQL syntax.
| "" | ||
| } | ||
|
|
||
| val myTopExpression: String = dialect.getTopExpression(limit) // SQL Server Limit alternative |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, feels like this belongs all in the MySQL dialect, one way or the other. Is it not just an alternative LIMIT clause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in the PR description, SQL Server does not support LIMIT clause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, what i mean is, is "TOP n" just the 'limit' clause you need to generate for MySQL? then it doesn't need different handling, just needs to generate a different clause. But looks like it goes into the SELECT part, not at the end, OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MySQL supports LIMIT clause, I assume you meant MSSQL/SQL Server.
Yes, you are right. Although TOP N has similar semantics as LIMIT, it is more of an expression (?) rather than a clause, sort of like DISTINCT. We need to insert TOP N before the list of columns in select, unfortunately, this is the syntax for it.
I came up with this approach of introducing another getTopExpression because I thought it would be the least intrusive one but I am open to alternative approaches, please let me know if you have any suggestions as to how I can refactor the code to support this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops yes I'm talking about what you're talking about, typo - MS SQL Server
I'm OK with it; the alternative is to somehow edit the SQL query down in the MS SQL dialect maybe, I haven't thought about it. It'd be nicer but not sure it's cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there is already a precedent for adding something like this: getTableSample method.
|
Can one of the admins verify this patch? |
|
@beliefer @srowen @dongjoon-hyun Could you please check the following comments:
Do you prefer to redesign JdbcDialects API for this work as a follow-up? Or should I close this PR and just work on redesign directly? Thanks. |
|
|
||
| val sqlText = options.prepareQuery + | ||
| s"SELECT $columnList FROM ${options.tableOrQuery} $myTableSampleClause" + | ||
| s"SELECT $myTopExpression $columnList FROM ${options.tableOrQuery} $myTableSampleClause" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to define a SQL template that works for all dialects. The approach we take is to try to make it a superset of all dialects.
A better design is to let the dialect build the SQL. For example, we can add a new API in JdbcDialect
def getSQLBuilder: SQLQueryBuilder = ... the default implementation builds the SQL in a standard way.
Then Spark gets the builder and do
val builder = ...
builder.withColumns(...).withFromRelation(...).withLimit(...)...build()
SQLServerDialect can put the limit before the column list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do the work.
|
|
||
| /** | ||
| * MS SQL Server version of `getLimitClause`. | ||
| * This is only supported by SQL Server as it uses TOP (N) instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is TOP (N) is just the SQLServer syntax of LIMIT. It's really weird to have a new dialect API for it, as it's different from table sample which is a new feature.
|
What is the resolution? Is it okay to merge the PR or do you think someone should work on refactoring the API first? |
|
This is a new feature and we don't need to rush (can't catch 3.4 anyway). I'd like to have an API refactor first. |
What changes were proposed in this pull request?
This PR adds Limit pushdown support for MS SQL Server. Although SQL Server does not support LIMIT clause, it supports TOP (N) expression (https://learn.microsoft.com/en-us/sql/t-sql/queries/top-transact-sql?view=sql-server-ver16) which works in a similar way.
It is a simple change that adds another method to the dialect API to handle TOP expression specifically for SQL Server (or other database that might support it in the future). However, TOP and LIMIT are mutually exclusive in this PR.
TOP (N) also requires that a subquery is named, e.g.
select TOP (5) * from (select * from test) XYZ, otherwise the query fails. Currently Spark injectsSPARK_GEN_SUBQ_name, so it works fine.Why are the changes needed?
Enables Limit pushdown for MS SQL Server.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
I added a unit test and verified manually against a SQL Server instance.