-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24423][SQL] Add a new option for JDBC sources #21590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #92073 has finished for PR 21590 at commit
|
| require( | ||
| !(tableName.isDefined && query.isDefined), | ||
| s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put these require in the head of this section (line 68)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu These two requires are using tableName and query which is computed in lines before. Thats why i have placed these two requires after.
| | .option("partitionColumn", "subq.c1" | ||
| | .load() | ||
| """.stripMargin | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we define these two parameters simultaneously? How about the case where columns in query output have a partition column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu So since the we auto generate a subquery alias here for easy of use, we r disallowing the query option together with partition columns. As users wouldn't know how to qualify the partition columns given the suquery alias is generated implicitly. In this case, we ask them to use the existing dbtable to specify the query where they are in control to specify the alias themselves. Another option i considered is to introduce "queryAlias" as another option. But thought to avoid it for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu The new option query is just a syntactic sugar for simplifying the work from many basic JDBC users. We can improve it in the future. For example, parsing the user-specified query and make all the other options work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the case below (this can be accepted in dbtable and query is not?):
postgres=# select * from t;
c0 | c1 | c2
----+----+----
1 | 1 | 1
2 | 2 | 2
3 | 3 | 3
(3 rows)
scala> :paste
val df = spark.read
.format("jdbc")
.option("driver", "org.postgresql.Driver")
.option("url", "jdbc:postgresql://localhost:5432/postgres?user=maropu")
.option("dbtable", "(select c0 p0, c1 p1, c2 p2 from t where c0 > 1) t")
.option("partitionColumn", "p2")
.option("lowerBound", "1")
.option("upperBound", "3")
.option("numPartitions", "2")
.load()
scala> df.show
+---+---+---+
| p0| p1| p2|
+---+---+---+
| 2| 2| 2|
| 3| 3| 3|
+---+---+---+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dilipbiswal Could you reply @maropu with an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Currently we disallow it to be on the safe side. Lets take your example. When using the query option to pass on the query, we basically expect the users to supply
select c0 p0, c1 p1, c2 p2 from t where c0 > 1In spark , we will parentesize the query and add in an alias to confirm to the table subquery syntax. Given the user input the above query, he could decide to qualify the partition column names with the table name. So he could do the following :
al df = spark.read
.format("jdbc")
.option("driver", "org.postgresql.Driver")
.option("url", "jdbc:postgresql://localhost:5432/postgres?user=maropu")
.option("query", "select c0 p0, c1 p1, c2 p2 from t where c0 > 1")
.option("partitionColumn", "t.p2") ==> User qualifies the column names.
.option("lowerBound", "1")
.option("upperBound", "3")
.option("numPartitions", "2")
.load()In this case we will end up generating the query of the following form -
select * from (select c0 p0, c1 p1, c2 p2 from t where c0 > 1) __SPARK_GEN_ALIAS where t.p2 >= 1 and t.p2 <=3However this would be an invalid query. In the query option, its possible to specify a complex query involving joins.
Thats the reason, we disallow it to be in safe side. In the dbtable option, users are responsible to explicitly specify the alias and would now how to qualify the partition columns.
Lets see if we can improve this in future. If you have some ideas, please let us know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aha, ok. Thanks for the kind explanation.
| val tableExpression = tableName.map(_.trim).getOrElse { | ||
| // We have ensured in the code above that either dbtable or query is specified. | ||
| query.get match { | ||
| case subq if subq.nonEmpty => s"(${subq}) spark_gen_${curId.getAndIncrement()}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need curId? A constant name is bad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Don't mind using a constant name . "spark_gen_alias" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, simpler, better. Btw, we essentially need the alias name for a query? When we currently describe a query in dbtable, it seems we don't need the name?
| val longRunningQuery = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Yeah. we need an alias. Systems like postgress require a mandatory table subquery alias.
|
@gatorsmile Sorry to be late on this. Please look at this when you have time. |
| val e1 = intercept[RuntimeException] { | ||
| val df = spark.read.format("jdbc") | ||
| .option("Url", urlWithUserAndPass) | ||
| .option("query", query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Thanks.. I will update the doc.
| } | ||
| } | ||
|
|
||
| require(tableExpression.nonEmpty, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error check and error message here are confusing. It seems telling user that the two options can be both specified.
Maybe we should just check the defined one and improve the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang I see your point. Does this read better to you ?
require(tableOrQuery.nonEmpty,
s"Empty string is not allowed in either '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' options"
)
| val url = parameters(JDBC_URL) | ||
| // name of table | ||
| val table = parameters(JDBC_TABLE_NAME) | ||
| val tableName = parameters.get(JDBC_TABLE_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I prefer:
val tableExpression = if (parameters.isDefinedAt(JDBC_TABLE_NAME)) {
require(!parameters.isDefinedAt(JDBC_QUERY_STRING), "...")
parameters.get(JDBC_TABLE_NAME).get.trim
} else {
require(parameters.isDefinedAt(JDBC_QUERY_STRING), "...")
s"(${parameters.get(JDBC_QUERY_STRING)}) ${curId.getAndIncrement()}"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang Thanks.. Actually i had tried a couple of different ways. Some how i found this a little hard to follow when i embed the error message. I like to check things upfront along with comments on top easy to follow. But if others find this easy to follow as well, then i will change.
val tableExpression = if (parameters.isDefinedAt(JDBC_TABLE_NAME)) {
require(!parameters.isDefinedAt(JDBC_QUERY_STRING),
s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified."
)
parameters.get(JDBC_TABLE_NAME).get.trim
} else {
require(parameters.isDefinedAt(JDBC_QUERY_STRING),
s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required."
)
s"(${parameters.get(JDBC_QUERY_STRING)}) ${curId.getAndIncrement()}"
}|
Test build #92118 has finished for PR 21590 at commit
|
|
I think the format in example code in the PR description is not well done, e.g. |
| // We have ensured in the code above that either dbtable or query is specified. | ||
| query.get match { | ||
| case subq if subq.nonEmpty => s"(${subq}) spark_gen_${curId.getAndIncrement()}" | ||
| case subq => subq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, an empty subquery?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I saw you forbidding it later.
docs/sql-programming-guide.md
Outdated
| A query that will be used to read data into Spark. The specified query will be parenthesized and used | ||
| as a subquery in the <code>FROM</code> clause. Spark will also assign a alias to the subquery clause. | ||
| As an example, spark will issue a query of the following form to the datasource.<br> | ||
| <code> SELECT <columns> FROM (<user_specified_query>) spark_generated_alias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can mention dbtable and query can't be specified at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya OK, i will add this.
docs/sql-programming-guide.md
Outdated
| <td><code>query</code></td> | ||
| <td> | ||
| A query that will be used to read data into Spark. The specified query will be parenthesized and used | ||
| as a subquery in the <code>FROM</code> clause. Spark will also assign a alias to the subquery clause. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to mention this query alias? Meaningful to end users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya I think its better to let users know how we generate the from clause. That way they can choose to qualify the partition columns if needed. However, if you strongly feel otherwise, i will remove from doc.
| ) | ||
|
|
||
| // table name or a table expression. | ||
| val tableExpression = tableName.map(_.trim).getOrElse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expression sounds a bit confusing. tableOrQuery? tableSource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya ok. i will change.
| val tableExpression = tableName.map(_.trim).getOrElse { | ||
| // We have ensured in the code above that either dbtable or query is specified. | ||
| query.get match { | ||
| case subq if subq.nonEmpty => s"(${subq}) spark_gen_${curId.getAndIncrement()}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: subq -> subQuery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Will change.
|
Test build #92151 has finished for PR 21590 at commit
|
| ) | ||
|
|
||
|
|
||
| // ------------------------------------------------------------ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: revert this line
| // name of table | ||
| val table = parameters(JDBC_TABLE_NAME) | ||
| val tableName = parameters.get(JDBC_TABLE_NAME) | ||
| val query = parameters.get(JDBC_QUERY_STRING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, since the tableName and query variables don't need to be exposed to other classes, can we remove them?
Btw, I feel sharing the tableName variable int both write/read paths makes code some complicated, so how about splitting the variable into two part: tableOrQuery for reading and outputName for writing?
e.g., d62372a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Thank you for taking the time to think about this throughly. A couple of questions/comments.
- Looks like for read path we give precedence to dbtable over query. I feel its good to explicitly disallow this with a clear message in case of an ambiguity.
- Usage of lazy here (especially to trigger errors) makes me a little nervous. Like if we want to introduce a debug statement to print the variables in side the QueryOptions class, things will not work any more, right ? Thats the reason, i had opted to check for the "invalid query option in write path" in the write function itself (i.e when i am sure of the calling context). Perhaps that how its used every where in which case it may be okay to follow the same approach here.
I am okay with this. Lets get some opinion from @gatorsmile. Once i have the final set of comments, i will make the changes. Thanks again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, it's okay to depend on xiao's opnion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to follow what we are doing in another PR: #21247 ? We are facing the same issue there. The options are shared by both read and write paths. However, the limitations are different.
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
docs/sql-programming-guide.md
Outdated
| subquery in parentheses. | ||
| The JDBC table that should be read from or written into. Note that when using it in the read | ||
| path anything that is valid in a <code>FROM</code> clause of a SQL query can be used. | ||
| For example, instead of a full table you could also use a subquery in parentheses. Its not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It is
docs/sql-programming-guide.md
Outdated
| <td><code>query</code></td> | ||
| <td> | ||
| A query that will be used to read data into Spark. The specified query will be parenthesized and used | ||
| as a subquery in the <code>FROM</code> clause. Spark will also assign a alias to the subquery clause. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a alias -> an alias
docs/sql-programming-guide.md
Outdated
| <td> | ||
| A query that will be used to read data into Spark. The specified query will be parenthesized and used | ||
| as a subquery in the <code>FROM</code> clause. Spark will also assign a alias to the subquery clause. | ||
| As an example, spark will issue a query of the following form to the datasource.<br><br> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datasource -> JDBC source
docs/sql-programming-guide.md
Outdated
| as a subquery in the <code>FROM</code> clause. Spark will also assign a alias to the subquery clause. | ||
| As an example, spark will issue a query of the following form to the datasource.<br><br> | ||
| <code> SELECT <columns> FROM (<user_specified_query>) spark_gen_alias</code><br><br> | ||
| Its not allowed to specify `dbtable` and `query` options at the same time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its -> it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also document the limitation of query when having partitionColumn and the workaround?
| // name of table | ||
| val table = parameters(JDBC_TABLE_NAME) | ||
| val tableName = parameters.get(JDBC_TABLE_NAME) | ||
| val query = parameters.get(JDBC_QUERY_STRING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to follow what we are doing in another PR: #21247 ? We are facing the same issue there. The options are shared by both read and write paths. However, the limitations are different.
| val tableOrQuery = tableName.map(_.trim).getOrElse { | ||
| // We have ensured in the code above that either dbtable or query is specified. | ||
| query.get match { | ||
| case subQuery if subQuery.nonEmpty => s"(${subQuery}) spark_gen_${curId.getAndIncrement()}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__SPARK_GEN_JDBC_SUBQUERY_NAME_${curId.getAndIncrement()}__?
| ) | ||
|
|
||
| // table name or a table expression. | ||
| val tableOrQuery = tableName.map(_.trim).getOrElse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a tuple match here?
(tableName, query) match {
case ...
}
| |spark.read.format("jdbc") | ||
| | .option("dbtable", "(select c1, c2 from t1) as subq") | ||
| | .option("partitionColumn", "subq.c1" | ||
| | .load() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Please add them to the doc as I mentioned above?
| | .option("partitionColumn", "subq.c1" | ||
| | .load() | ||
| """.stripMargin | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu The new option query is just a syntactic sugar for simplifying the work from many basic JDBC users. We can improve it in the future. For example, parsing the user-specified query and make all the other options work.
| require( | ||
| options.tableName.isDefined, | ||
| s"Option '${JDBCOptions.JDBC_TABLE_NAME}' is required. " + | ||
| s"Option '${JDBCOptions.JDBC_QUERY_STRING}' is not applicable while writing.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us create a JDBCOptionsInWrite?
|
@gatorsmile Thanks a lot. I will process your comments and get back. |
|
Test build #92319 has finished for PR 21590 at commit
|
|
Test build #92318 has finished for PR 21590 at commit
|
|
@gatorsmile @maropu I have hopefully addressed the comments. Please take a look when you get a chance. |
| val JDBC_SESSION_INIT_STATEMENT = newOption("sessionInitStatement") | ||
| } | ||
|
|
||
| class JdbcOptionsInWrite( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding JdbcOptionsInRead for read-only options, then making the base JdbcOptions trait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Can i take this on as a follow-up ? The reason is i am not fully familiar with all the options. I need to study those a bit more before i refactor them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That depends on comitter's decisions: @gatorsmile
| } | ||
| } | ||
|
|
||
| // ------------------------------------------------------------ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu OK.
| val sessionInitStatement = parameters.get(JDBC_SESSION_INIT_STATEMENT) | ||
| } | ||
|
|
||
| object JDBCOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move JDBCOptions in the end of this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Sure.
| s"Option '${JDBCOptions.JDBC_QUERY_STRING}' is not applicable while writing.") | ||
|
|
||
| val destinationTable = parameters(JDBC_TABLE_NAME) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is it bad to change destinationTable to table for simplicity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu I had it as table and refactored it just before i pushed :-). I will change it back.
| ) | ||
| case (None, None) => | ||
| throw new IllegalArgumentException( | ||
| s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove braces: '{' and '}'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Sure.
| ) | ||
| case (Some(name), None) => | ||
| if (name.isEmpty) { | ||
| throw new IllegalArgumentException(s"Option '${JDBC_TABLE_NAME}' can not be empty.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu OK.
| } | ||
| case (None, Some(subquery)) => | ||
| if (subquery.isEmpty) { | ||
| throw new IllegalArgumentException(s"Option `${JDBC_QUERY_STRING}` can not be empty.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu OK.
|
Test build #92325 has finished for PR 21590 at commit
|
|
retest this please |
|
Test build #92328 has finished for PR 21590 at commit
|
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except a minor comment.
| throw new AnalysisException( | ||
| s"Table or view '${options.table}' already exists. SaveMode: ErrorIfExists.") | ||
| s"Table or view '${options.table}' already exists. " + | ||
| s"SaveMode: ErrorIfExists.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No change, right?
|
Thanks! Merged to master. |
|
Thank you very much @gatorsmile @maropu @viirya @HyukjinKwon @gengliangwang |
What changes were proposed in this pull request?
Here is the description in the JIRA -
Currently, our JDBC connector provides the option
dbtablefor users to specify the to-be-loaded JDBC source table.Normally, users do not fetch the whole JDBC table due to the poor performance/throughput of JDBC. Thus, they normally just fetch a small set of tables. For advanced users, they can pass a subquery as the option.
However, this is straightforward to end users. We should simply allow users to specify the query by a new option
query. We will handle the complexity for them.How was this patch tested?
Added tests in JDBCSuite and JDBCWriterSuite.
Also tested against MySQL, Postgress, Oracle, DB2 (using docker infrastructure) to make sure there are no syntax issues.