Skip to content

Conversation

@dilipbiswal
Copy link
Contributor

@dilipbiswal dilipbiswal commented Jun 19, 2018

What changes were proposed in this pull request?

Here is the description in the JIRA -

Currently, our JDBC connector provides the option dbtable for users to specify the to-be-loaded JDBC source table.

val jdbcDf = spark.read
  .format("jdbc")
  .option("dbtable", "dbName.tableName")
  .options(jdbcCredentials: Map)
  .load()

Normally, users do not fetch the whole JDBC table due to the poor performance/throughput of JDBC. Thus, they normally just fetch a small set of tables. For advanced users, they can pass a subquery as the option.

val query = """ (select * from tableName limit 10) as tmp """
val jdbcDf = spark.read
  .format("jdbc")
  .option("dbtable", query)
  .options(jdbcCredentials: Map)
  .load()

However, this is straightforward to end users. We should simply allow users to specify the query by a new option query. We will handle the complexity for them.

val query = """select * from tableName limit 10"""
val jdbcDf = spark.read
  .format("jdbc")
  .option("query", query)
  .options(jdbcCredentials: Map)
  .load()

How was this patch tested?

Added tests in JDBCSuite and JDBCWriterSuite.
Also tested against MySQL, Postgress, Oracle, DB2 (using docker infrastructure) to make sure there are no syntax issues.

@SparkQA
Copy link

SparkQA commented Jun 19, 2018

Test build #92073 has finished for PR 21590 at commit c8ed9b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

require(
!(tableName.isDefined && query.isDefined),
s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put these require in the head of this section (line 68)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu These two requires are using tableName and query which is computed in lines before. Thats why i have placed these two requires after.

| .option("partitionColumn", "subq.c1"
| .load()
""".stripMargin
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we define these two parameters simultaneously? How about the case where columns in query output have a partition column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu So since the we auto generate a subquery alias here for easy of use, we r disallowing the query option together with partition columns. As users wouldn't know how to qualify the partition columns given the suquery alias is generated implicitly. In this case, we ask them to use the existing dbtable to specify the query where they are in control to specify the alias themselves. Another option i considered is to introduce "queryAlias" as another option. But thought to avoid it for simplicity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu The new option query is just a syntactic sugar for simplifying the work from many basic JDBC users. We can improve it in the future. For example, parsing the user-specified query and make all the other options work.

Copy link
Member

@maropu maropu Jun 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the case below (this can be accepted in dbtable and query is not?):

postgres=# select * from t;
 c0 | c1 | c2
----+----+----
  1 |  1 |  1
  2 |  2 |  2
  3 |  3 |  3
(3 rows)

scala> :paste
val df = spark.read
  .format("jdbc")
  .option("driver", "org.postgresql.Driver")
  .option("url", "jdbc:postgresql://localhost:5432/postgres?user=maropu")
  .option("dbtable", "(select c0 p0, c1 p1, c2 p2 from t where c0 > 1) t")
  .option("partitionColumn", "p2")
  .option("lowerBound", "1")
  .option("upperBound", "3")
  .option("numPartitions", "2")
  .load()

scala> df.show
+---+---+---+                                                                   
| p0| p1| p2|
+---+---+---+
|  2|  2|  2|
|  3|  3|  3|
+---+---+---+

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dilipbiswal Could you reply @maropu with an example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Currently we disallow it to be on the safe side. Lets take your example. When using the query option to pass on the query, we basically expect the users to supply

select c0 p0, c1 p1, c2 p2 from t where c0 > 1

In spark , we will parentesize the query and add in an alias to confirm to the table subquery syntax. Given the user input the above query, he could decide to qualify the partition column names with the table name. So he could do the following :

al df = spark.read
  .format("jdbc")
  .option("driver", "org.postgresql.Driver")
  .option("url", "jdbc:postgresql://localhost:5432/postgres?user=maropu")
  .option("query", "select c0 p0, c1 p1, c2 p2 from t where c0 > 1")
  .option("partitionColumn", "t.p2")  ==> User qualifies the column names.
  .option("lowerBound", "1")
  .option("upperBound", "3")
  .option("numPartitions", "2")
  .load()

In this case we will end up generating the query of the following form -

select * from (select c0 p0, c1 p1, c2 p2 from t where c0 > 1) __SPARK_GEN_ALIAS where t.p2 >= 1 and t.p2 <=3

However this would be an invalid query. In the query option, its possible to specify a complex query involving joins.

Thats the reason, we disallow it to be in safe side. In the dbtable option, users are responsible to explicitly specify the alias and would now how to qualify the partition columns.

Lets see if we can improve this in future. If you have some ideas, please let us know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha, ok. Thanks for the kind explanation.

val tableExpression = tableName.map(_.trim).getOrElse {
// We have ensured in the code above that either dbtable or query is specified.
query.get match {
case subq if subq.nonEmpty => s"(${subq}) spark_gen_${curId.getAndIncrement()}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need curId? A constant name is bad?

Copy link
Contributor Author

@dilipbiswal dilipbiswal Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Don't mind using a constant name . "spark_gen_alias" ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, simpler, better. Btw, we essentially need the alias name for a query? When we currently describe a query in dbtable, it seems we don't need the name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Yeah. we need an alias. Systems like postgress require a mandatory table subquery alias.

@dilipbiswal
Copy link
Contributor Author

@gatorsmile Sorry to be late on this. Please look at this when you have time.

val e1 = intercept[RuntimeException] {
val df = spark.read.format("jdbc")
.option("Url", urlWithUserAndPass)
.option("query", query)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thanks.. I will update the doc.

}
}

require(tableExpression.nonEmpty,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error check and error message here are confusing. It seems telling user that the two options can be both specified.
Maybe we should just check the defined one and improve the error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang I see your point. Does this read better to you ?

require(tableOrQuery.nonEmpty,
    s"Empty string is not allowed in either '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' options"
  )

val url = parameters(JDBC_URL)
// name of table
val table = parameters(JDBC_TABLE_NAME)
val tableName = parameters.get(JDBC_TABLE_NAME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I prefer:

  val tableExpression = if (parameters.isDefinedAt(JDBC_TABLE_NAME)) {
    require(!parameters.isDefinedAt(JDBC_QUERY_STRING), "...")
    parameters.get(JDBC_TABLE_NAME).get.trim
  } else {
    require(parameters.isDefinedAt(JDBC_QUERY_STRING), "...")
    s"(${parameters.get(JDBC_QUERY_STRING)}) ${curId.getAndIncrement()}"
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang Thanks.. Actually i had tried a couple of different ways. Some how i found this a little hard to follow when i embed the error message. I like to check things upfront along with comments on top easy to follow. But if others find this easy to follow as well, then i will change.

val tableExpression = if (parameters.isDefinedAt(JDBC_TABLE_NAME)) {
    require(!parameters.isDefinedAt(JDBC_QUERY_STRING),
      s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified."
    )
    parameters.get(JDBC_TABLE_NAME).get.trim
  } else {
    require(parameters.isDefinedAt(JDBC_QUERY_STRING),
      s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required."
    )
    s"(${parameters.get(JDBC_QUERY_STRING)}) ${curId.getAndIncrement()}"
  }

@SparkQA
Copy link

SparkQA commented Jun 20, 2018

Test build #92118 has finished for PR 21590 at commit 8920793.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Jun 20, 2018

I think the format in example code in the PR description is not well done, e.g. *{color:#ff0000}.

// We have ensured in the code above that either dbtable or query is specified.
query.get match {
case subq if subq.nonEmpty => s"(${subq}) spark_gen_${curId.getAndIncrement()}"
case subq => subq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, an empty subquery?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I saw you forbidding it later.

A query that will be used to read data into Spark. The specified query will be parenthesized and used
as a subquery in the <code>FROM</code> clause. Spark will also assign a alias to the subquery clause.
As an example, spark will issue a query of the following form to the datasource.<br>
<code> SELECT &lt;columns&gt; FROM (&lt;user_specified_query&gt;) spark_generated_alias
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can mention dbtable and query can't be specified at the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya OK, i will add this.

<td><code>query</code></td>
<td>
A query that will be used to read data into Spark. The specified query will be parenthesized and used
as a subquery in the <code>FROM</code> clause. Spark will also assign a alias to the subquery clause.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to mention this query alias? Meaningful to end users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya I think its better to let users know how we generate the from clause. That way they can choose to qualify the partition columns if needed. However, if you strongly feel otherwise, i will remove from doc.

)

// table name or a table expression.
val tableExpression = tableName.map(_.trim).getOrElse {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expression sounds a bit confusing. tableOrQuery? tableSource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya ok. i will change.

val tableExpression = tableName.map(_.trim).getOrElse {
// We have ensured in the code above that either dbtable or query is specified.
query.get match {
case subq if subq.nonEmpty => s"(${subq}) spark_gen_${curId.getAndIncrement()}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: subq -> subQuery.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Will change.

@SparkQA
Copy link

SparkQA commented Jun 21, 2018

Test build #92151 has finished for PR 21590 at commit 12c0523.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

)


// ------------------------------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: revert this line

// name of table
val table = parameters(JDBC_TABLE_NAME)
val tableName = parameters.get(JDBC_TABLE_NAME)
val query = parameters.get(JDBC_QUERY_STRING)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, since the tableName and query variables don't need to be exposed to other classes, can we remove them?

Btw, I feel sharing the tableName variable int both write/read paths makes code some complicated, so how about splitting the variable into two part: tableOrQuery for reading and outputName for writing?

e.g., d62372a

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Thank you for taking the time to think about this throughly. A couple of questions/comments.

  1. Looks like for read path we give precedence to dbtable over query. I feel its good to explicitly disallow this with a clear message in case of an ambiguity.
  2. Usage of lazy here (especially to trigger errors) makes me a little nervous. Like if we want to introduce a debug statement to print the variables in side the QueryOptions class, things will not work any more, right ? Thats the reason, i had opted to check for the "invalid query option in write path" in the write function itself (i.e when i am sure of the calling context). Perhaps that how its used every where in which case it may be okay to follow the same approach here.

I am okay with this. Lets get some opinion from @gatorsmile. Once i have the final set of comments, i will make the changes. Thanks again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, it's okay to depend on xiao's opnion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to follow what we are doing in another PR: #21247 ? We are facing the same issue there. The options are shared by both read and write paths. However, the limitations are different.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

subquery in parentheses.
The JDBC table that should be read from or written into. Note that when using it in the read
path anything that is valid in a <code>FROM</code> clause of a SQL query can be used.
For example, instead of a full table you could also use a subquery in parentheses. Its not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It is

<td><code>query</code></td>
<td>
A query that will be used to read data into Spark. The specified query will be parenthesized and used
as a subquery in the <code>FROM</code> clause. Spark will also assign a alias to the subquery clause.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a alias -> an alias

<td>
A query that will be used to read data into Spark. The specified query will be parenthesized and used
as a subquery in the <code>FROM</code> clause. Spark will also assign a alias to the subquery clause.
As an example, spark will issue a query of the following form to the datasource.<br><br>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datasource -> JDBC source

as a subquery in the <code>FROM</code> clause. Spark will also assign a alias to the subquery clause.
As an example, spark will issue a query of the following form to the datasource.<br><br>
<code> SELECT &lt;columns&gt; FROM (&lt;user_specified_query&gt;) spark_gen_alias</code><br><br>
Its not allowed to specify `dbtable` and `query` options at the same time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its -> it is

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also document the limitation of query when having partitionColumn and the workaround?

// name of table
val table = parameters(JDBC_TABLE_NAME)
val tableName = parameters.get(JDBC_TABLE_NAME)
val query = parameters.get(JDBC_QUERY_STRING)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to follow what we are doing in another PR: #21247 ? We are facing the same issue there. The options are shared by both read and write paths. However, the limitations are different.

val tableOrQuery = tableName.map(_.trim).getOrElse {
// We have ensured in the code above that either dbtable or query is specified.
query.get match {
case subQuery if subQuery.nonEmpty => s"(${subQuery}) spark_gen_${curId.getAndIncrement()}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__SPARK_GEN_JDBC_SUBQUERY_NAME_${curId.getAndIncrement()}__?

)

// table name or a table expression.
val tableOrQuery = tableName.map(_.trim).getOrElse {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a tuple match here?

(tableName, query) match {
  case ...
}

|spark.read.format("jdbc")
| .option("dbtable", "(select c1, c2 from t1) as subq")
| .option("partitionColumn", "subq.c1"
| .load()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Please add them to the doc as I mentioned above?

| .option("partitionColumn", "subq.c1"
| .load()
""".stripMargin
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu The new option query is just a syntactic sugar for simplifying the work from many basic JDBC users. We can improve it in the future. For example, parsing the user-specified query and make all the other options work.

require(
options.tableName.isDefined,
s"Option '${JDBCOptions.JDBC_TABLE_NAME}' is required. " +
s"Option '${JDBCOptions.JDBC_QUERY_STRING}' is not applicable while writing.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us create a JDBCOptionsInWrite?

@dilipbiswal
Copy link
Contributor Author

@gatorsmile Thanks a lot. I will process your comments and get back.

@SparkQA
Copy link

SparkQA commented Jun 26, 2018

Test build #92319 has finished for PR 21590 at commit 765de0b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 26, 2018

Test build #92318 has finished for PR 21590 at commit adf5f43.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
  • class JdbcOptionsInWrite(

@dilipbiswal
Copy link
Contributor Author

@gatorsmile @maropu I have hopefully addressed the comments. Please take a look when you get a chance.

val JDBC_SESSION_INIT_STATEMENT = newOption("sessionInitStatement")
}

class JdbcOptionsInWrite(
Copy link
Member

@maropu maropu Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding JdbcOptionsInRead for read-only options, then making the base JdbcOptions trait?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Can i take this on as a follow-up ? The reason is i am not fully familiar with all the options. I need to study those a bit more before i refactor them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends on comitter's decisions: @gatorsmile

}
}

// ------------------------------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to remove this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu OK.

val sessionInitStatement = parameters.get(JDBC_SESSION_INIT_STATEMENT)
}

object JDBCOptions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move JDBCOptions in the end of this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Sure.

s"Option '${JDBCOptions.JDBC_QUERY_STRING}' is not applicable while writing.")

val destinationTable = parameters(JDBC_TABLE_NAME)
}
Copy link
Member

@maropu maropu Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is it bad to change destinationTable to table for simplicity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu I had it as table and refactored it just before i pushed :-). I will change it back.

)
case (None, None) =>
throw new IllegalArgumentException(
s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove braces: '{' and '}'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Sure.

)
case (Some(name), None) =>
if (name.isEmpty) {
throw new IllegalArgumentException(s"Option '${JDBC_TABLE_NAME}' can not be empty.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu OK.

}
case (None, Some(subquery)) =>
if (subquery.isEmpty) {
throw new IllegalArgumentException(s"Option `${JDBC_QUERY_STRING}` can not be empty.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu OK.

@SparkQA
Copy link

SparkQA commented Jun 26, 2018

Test build #92325 has finished for PR 21590 at commit c083e13.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jun 26, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jun 26, 2018

Test build #92328 has finished for PR 21590 at commit c083e13.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except a minor comment.

throw new AnalysisException(
s"Table or view '${options.table}' already exists. SaveMode: ErrorIfExists.")
s"Table or view '${options.table}' already exists. " +
s"SaveMode: ErrorIfExists.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change, right?

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in 02f8781 Jun 26, 2018
@dilipbiswal
Copy link
Contributor Author

Thank you very much @gatorsmile @maropu @viirya @HyukjinKwon @gengliangwang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants