Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -307,11 +307,12 @@ private[jdbc] class JDBCRDD(
""
}

val myTopExpression: String = dialect.getTopExpression(limit) // SQL Server Limit alternative
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, feels like this belongs all in the MySQL dialect, one way or the other. Is it not just an alternative LIMIT clause?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in the PR description, SQL Server does not support LIMIT clause.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, what i mean is, is "TOP n" just the 'limit' clause you need to generate for MySQL? then it doesn't need different handling, just needs to generate a different clause. But looks like it goes into the SELECT part, not at the end, OK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MySQL supports LIMIT clause, I assume you meant MSSQL/SQL Server.

Yes, you are right. Although TOP N has similar semantics as LIMIT, it is more of an expression (?) rather than a clause, sort of like DISTINCT. We need to insert TOP N before the list of columns in select, unfortunately, this is the syntax for it.

I came up with this approach of introducing another getTopExpression because I thought it would be the least intrusive one but I am open to alternative approaches, please let me know if you have any suggestions as to how I can refactor the code to support this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops yes I'm talking about what you're talking about, typo - MS SQL Server
I'm OK with it; the alternative is to somehow edit the SQL query down in the MS SQL dialect maybe, I haven't thought about it. It'd be nicer but not sure it's cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there is already a precedent for adding something like this: getTableSample method.

val myLimitClause: String = dialect.getLimitClause(limit)
val myOffsetClause: String = dialect.getOffsetClause(offset)

val sqlText = options.prepareQuery +
s"SELECT $columnList FROM ${options.tableOrQuery} $myTableSampleClause" +
s"SELECT $myTopExpression $columnList FROM ${options.tableOrQuery} $myTableSampleClause" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to define a SQL template that works for all dialects. The approach we take is to try to make it a superset of all dialects.

A better design is to let the dialect build the SQL. For example, we can add a new API in JdbcDialect

def getSQLBuilder: SQLQueryBuilder = ... the default implementation builds the SQL in a standard way.

Then Spark gets the builder and do

val builder = ...
builder.withColumns(...).withFromRelation(...).withLimit(...)...build()

SQLServerDialect can put the limit before the column list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do the work.

s" $myWhereClause $getGroupByClause $getOrderByClause $myLimitClause $myOffsetClause"
stmt = conn.prepareStatement(sqlText,
ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -544,6 +544,14 @@ abstract class JdbcDialect extends Serializable with Logging {
if (limit > 0 ) s"LIMIT $limit" else ""
}

/**
* MS SQL Server version of `getLimitClause`.
* This is only supported by SQL Server as it uses TOP (N) instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is TOP (N) is just the SQLServer syntax of LIMIT. It's really weird to have a new dialect API for it, as it's different from table sample which is a new feature.

*/
def getTopExpression(limit: Integer): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API looks a little hack.

#39667 refactor the API and this PR could override getSQLText.
cc @cloud-fan @huaxingao

Copy link
Contributor Author

@sadikovi sadikovi Jan 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is hacky in this API? As far as I can see, it just follows the existing JDBC code. If you are referring to an alternative implementation, then I am happy to consider it, otherwise please elaborate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't see how the PR you linked solves this problem. Yes, there is no top reference in the general JdbcDialects but now MSSQL dialect needs to reimplement getSQLText method which is suboptimal IMHO.

Copy link
Contributor

@beliefer beliefer Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should define the different syntax by dialect themself. I exact the API getSQLText so as some dialect could implement it in special way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it would be good to have something more flexible than the current API; however, I also don't think that moving getSQLText into JdbcDialects is good approach either: difficult to evolve API, too much boilerplate to copy-paste if you need to change a small detail in the query.

I think it would be good to split the query into fragments and have methods for each. However, there has already been a precedent with getTableSample which is similar to TOP N but is only supported by Postgres.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can work on a refined API for JdbcDialects as a follow-up for this work, would that work? I may need to prepare a small SPIP/design doc on it.

@beliefer @srowen @dongjoon-hyun What do you think? Or should I just work on a new API right away without this change? I would appreciate your suggestions.

Copy link
Contributor

@beliefer beliefer Jan 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it would be good to have something more flexible than the current API; however, I also don't think that moving getSQLText into JdbcDialects is good approach either: difficult to evolve API, too much boilerplate to copy-paste if you need to change a small detail in the query.

@sadikovi You said is right. I simplified the #39667. getSQLText only used to organize the SQL syntax.

""
}

/**
* returns the OFFSET clause for the SELECT statement
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -167,10 +167,15 @@ private object MsSqlServerDialect extends JdbcDialect {
throw QueryExecutionErrors.commentOnTableUnsupportedError()
}

// SQL Server does not support, it uses `getTopExpression` instead.
override def getLimitClause(limit: Integer): String = {
""
}

override def getTopExpression(limit: Integer): String = {
if (limit > 0) s"TOP ($limit)" else ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, MsSQL syntax doesn't allow TOP 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOP 0 is supported but 0 is used as a sentinel value to disable the pushdown, it is similar to how LIMIT works currently. I tested df.select("a").limit(0) query and it is optimised to ignore the query (?).

}

override def classifyException(message: String, e: Throwable): AnalysisException = {
e match {
case sqlException: SQLException =>
Expand Down
29 changes: 29 additions & 0 deletions sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -1001,6 +1001,35 @@ class JDBCSuite extends QueryTest with SharedSparkSession {
}
}

test("SPARK-42128: Dialect Limit and Top implementation") {
// Dialects that support LIMIT N.
val limitDialects = Seq(
JdbcDialects.get("jdbc:mysql"),
JdbcDialects.get("jdbc:postgresql"),
JdbcDialects.get("jdbc:db2"),
JdbcDialects.get("jdbc:h2")
)

for (dialect <- limitDialects) {
assert(dialect.getLimitClause(0) == "")
assert(dialect.getTopExpression(0) == "")
assert(dialect.getLimitClause(100) == "LIMIT 100")
assert(dialect.getTopExpression(100) == "")
}

// Dialects that support TOP (N)
val topDialects = Seq(
JdbcDialects.get("jdbc:sqlserver")
)

for (dialect <- topDialects) {
assert(dialect.getLimitClause(0) == "")
assert(dialect.getTopExpression(0) == "")
assert(dialect.getLimitClause(100) == "")
assert(dialect.getTopExpression(100) == "TOP (100)")
}
}

test("table exists query by jdbc dialect") {
val MySQL = JdbcDialects.get("jdbc:mysql://127.0.0.1/db")
val Postgres = JdbcDialects.get("jdbc:postgresql://127.0.0.1/db")
Expand Down