Skip to content

Conversation

@sadikovi
Copy link
Contributor

@sadikovi sadikovi commented Jan 20, 2023

What changes were proposed in this pull request?

This PR adds Limit pushdown support for MS SQL Server. Although SQL Server does not support LIMIT clause, it supports TOP (N) expression (https://learn.microsoft.com/en-us/sql/t-sql/queries/top-transact-sql?view=sql-server-ver16) which works in a similar way.

It is a simple change that adds another method to the dialect API to handle TOP expression specifically for SQL Server (or other database that might support it in the future). However, TOP and LIMIT are mutually exclusive in this PR.

TOP (N) also requires that a subquery is named, e.g. select TOP (5) * from (select * from test) XYZ, otherwise the query fails. Currently Spark injects SPARK_GEN_SUBQ_ name, so it works fine.

Why are the changes needed?

Enables Limit pushdown for MS SQL Server.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

I added a unit test and verified manually against a SQL Server instance.

@sadikovi sadikovi changed the title [SPARK-42128] Support TOP (N) for MS SQL Server dialect as an alternative to Limit pushdown [SPARK-42128][SQL] Support TOP (N) for MS SQL Server dialect as an alternative to Limit pushdown Jan 20, 2023
@github-actions github-actions bot added the SQL label Jan 20, 2023
@sadikovi
Copy link
Contributor Author

@sunchao @beliefer @dongjoon-hyun Could you review this PR? Thank you.

It would be great if you could also suggest on how I can test the query string generated by Spark before it is sent to the JDBC driver. Are there such existing tests in the codebase?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pinging me, @sadikovi .

@huaxingao is also an expert in JDBC pushdown domain.

}

override def getTopExpression(limit: Integer): String = {
if (limit > 0) s"TOP ($limit)" else ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, MsSQL syntax doesn't allow TOP 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOP 0 is supported but 0 is used as a sentinel value to disable the pushdown, it is similar to how LIMIT works currently. I tested df.select("a").limit(0) query and it is optimised to ignore the query (?).

}
}

test("Dialect Limit and Top implementation") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Shall we use the test prefix, SPARK-42128: ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this approach.

@beliefer
Copy link
Contributor

I have a question if Top(n) without order by is Top N ? or it is just limit N ?

@sadikovi
Copy link
Contributor Author

Thanks @dongjoon-hyun. I will address your comments soon-ish 🙂.

@beliefer, Yes, you are right. The documentation describes TOP (N) returning the N top rows when used together with ORDER BY but when it is by itself, it behaves just like LIMIT - no particular order is guaranteed.

* MS SQL Server version of `getLimitClause`.
* This is only supported by SQL Server as it uses TOP (N) instead.
*/
def getTopExpression(limit: Integer): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API looks a little hack.

#39667 refactor the API and this PR could override getSQLText.
cc @cloud-fan @huaxingao

Copy link
Contributor Author

@sadikovi sadikovi Jan 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is hacky in this API? As far as I can see, it just follows the existing JDBC code. If you are referring to an alternative implementation, then I am happy to consider it, otherwise please elaborate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't see how the PR you linked solves this problem. Yes, there is no top reference in the general JdbcDialects but now MSSQL dialect needs to reimplement getSQLText method which is suboptimal IMHO.

Copy link
Contributor

@beliefer beliefer Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should define the different syntax by dialect themself. I exact the API getSQLText so as some dialect could implement it in special way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it would be good to have something more flexible than the current API; however, I also don't think that moving getSQLText into JdbcDialects is good approach either: difficult to evolve API, too much boilerplate to copy-paste if you need to change a small detail in the query.

I think it would be good to split the query into fragments and have methods for each. However, there has already been a precedent with getTableSample which is similar to TOP N but is only supported by Postgres.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can work on a refined API for JdbcDialects as a follow-up for this work, would that work? I may need to prepare a small SPIP/design doc on it.

@beliefer @srowen @dongjoon-hyun What do you think? Or should I just work on a new API right away without this change? I would appreciate your suggestions.

Copy link
Contributor

@beliefer beliefer Jan 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it would be good to have something more flexible than the current API; however, I also don't think that moving getSQLText into JdbcDialects is good approach either: difficult to evolve API, too much boilerplate to copy-paste if you need to change a small detail in the query.

@sadikovi You said is right. I simplified the #39667. getSQLText only used to organize the SQL syntax.

""
}

val myTopExpression: String = dialect.getTopExpression(limit) // SQL Server Limit alternative
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, feels like this belongs all in the MySQL dialect, one way or the other. Is it not just an alternative LIMIT clause?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in the PR description, SQL Server does not support LIMIT clause.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, what i mean is, is "TOP n" just the 'limit' clause you need to generate for MySQL? then it doesn't need different handling, just needs to generate a different clause. But looks like it goes into the SELECT part, not at the end, OK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MySQL supports LIMIT clause, I assume you meant MSSQL/SQL Server.

Yes, you are right. Although TOP N has similar semantics as LIMIT, it is more of an expression (?) rather than a clause, sort of like DISTINCT. We need to insert TOP N before the list of columns in select, unfortunately, this is the syntax for it.

I came up with this approach of introducing another getTopExpression because I thought it would be the least intrusive one but I am open to alternative approaches, please let me know if you have any suggestions as to how I can refactor the code to support this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops yes I'm talking about what you're talking about, typo - MS SQL Server
I'm OK with it; the alternative is to somehow edit the SQL query down in the MS SQL dialect maybe, I haven't thought about it. It'd be nicer but not sure it's cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there is already a precedent for adding something like this: getTableSample method.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@sadikovi
Copy link
Contributor Author

@beliefer @srowen @dongjoon-hyun Could you please check the following comments:

Do you prefer to redesign JdbcDialects API for this work as a follow-up? Or should I close this PR and just work on redesign directly? Thanks.


val sqlText = options.prepareQuery +
s"SELECT $columnList FROM ${options.tableOrQuery} $myTableSampleClause" +
s"SELECT $myTopExpression $columnList FROM ${options.tableOrQuery} $myTableSampleClause" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to define a SQL template that works for all dialects. The approach we take is to try to make it a superset of all dialects.

A better design is to let the dialect build the SQL. For example, we can add a new API in JdbcDialect

def getSQLBuilder: SQLQueryBuilder = ... the default implementation builds the SQL in a standard way.

Then Spark gets the builder and do

val builder = ...
builder.withColumns(...).withFromRelation(...).withLimit(...)...build()

SQLServerDialect can put the limit before the column list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do the work.


/**
* MS SQL Server version of `getLimitClause`.
* This is only supported by SQL Server as it uses TOP (N) instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is TOP (N) is just the SQLServer syntax of LIMIT. It's really weird to have a new dialect API for it, as it's different from table sample which is a new feature.

@sadikovi
Copy link
Contributor Author

What is the resolution? Is it okay to merge the PR or do you think someone should work on refactoring the API first?

@cloud-fan
Copy link
Contributor

This is a new feature and we don't need to rush (can't catch 3.4 anyway). I'd like to have an API refactor first.

@sadikovi sadikovi closed this Feb 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants