-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29644][SQL][2.4] Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils #26549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
What changes were proposed in this pull request?
Corrected ShortType and ByteType mapping to SmallInt and TinyInt, corrected setter methods to set ShortType and ByteType as setShort() and setByte(). Changes in JDBCUtils.scala
Fixed Unit test cases to where applicable and added new E2E test cases in to test table read/write using ShortType and ByteType.
Problems
- In master in JDBCUtils.scala line number 547 and 551 have a problem where ShortType and ByteType are set as Integers rather than set as Short and Byte respectively.
```
case ShortType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setInt(pos + 1, row.getShort(pos))
The issue was pointed out by @maropu
case ByteType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setInt(pos + 1, row.getByte(pos))
- Also at line JDBCUtils.scala 247 TinyInt is interpreted wrongly as IntergetType in getCatalystType()
``` case java.sql.Types.TINYINT => IntegerType ```
- At line 172 ShortType was wrongly interpreted as IntegerType
``` case ShortType => Option(JdbcType("INTEGER", java.sql.Types.SMALLINT)) ```
- All thru out tests, ShortType and ByteType were being interpreted as IntegerTypes.
Why are the changes needed?
Given type should be set using the right type.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Corrected Unit test cases where applicable. Validated in CI/CD
Added/fixed test case in MsSqlServerIntegrationSuite.scala, PostgresIntegrationSuite.scala , MySQLIntegrationSuite.scala to
write/read tables from dataframe with cols as shorttype and bytetype. Validated by manual as follows.
./build/mvn install -DskipTests
./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12
|
ok to test |
|
@dongjoon-hyun @maropu @srowen here's the PR that ports #26301 to Spark 2.4. Thanks |
|
Yup. I've been waiting. (BTW, actually, I expected your follow-up PR on test case refactoring on master branch first.) 😄 |
|
I think this could be OK; it's a behavior change though? is it OK for a 2.4.x maintenance release? I guess it depends on how much we think this is more bug fix than improvement. |
|
Yes. I agree with you, @srowen . Due to this patch, the user facing behavior will be fixed. cc @gatorsmile and @cloud-fan |
@dongjoon-hyun That's on my mind. Will take some time as i have to better understand the current suite and my availability next few weeks. But definitely on me to fix |
|
Test build #113894 has finished for PR 26549 at commit
|
Can you update the description above? |
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
…mallInt and TinyInt in JDBCUtils This is a port SPARK-29644 to 2.4 ### What changes were proposed in this pull request? Corrected ShortType and ByteType mapping to SmallInt and TinyInt, corrected setter methods to set ShortType and ByteType as setShort() and setByte(). Changes in JDBCUtils.scala Fixed Unit test cases to where applicable and added new E2E test cases in to test table read/write using ShortType and ByteType. Problems - In master in JDBCUtils.scala line number 547 and 551 have a problem where ShortType and ByteType are set as Integers rather than set as Short and Byte respectively. ``` case ShortType => (stmt: PreparedStatement, row: Row, pos: Int) => stmt.setInt(pos + 1, row.getShort(pos)) The issue was pointed out by maropu case ByteType => (stmt: PreparedStatement, row: Row, pos: Int) => stmt.setInt(pos + 1, row.getByte(pos)) ``` - Also at line JDBCUtils.scala 247 TinyInt is interpreted wrongly as IntergetType in getCatalystType() ``` case java.sql.Types.TINYINT => IntegerType ``` - At line 172 ShortType was wrongly interpreted as IntegerType ``` case ShortType => Option(JdbcType("INTEGER", java.sql.Types.SMALLINT)) ``` - All thru out tests, ShortType and ByteType were being interpreted as IntegerTypes. ### Why are the changes needed? Given type should be set using the right type. ### Does this PR introduce any user-facing change? Yes. - User will now be able to create tables where dataframe contains ByteType when using JDBC connector in overwrite mode. - Users will see a SQL side table are created with the right data type. ShortType in spark will translate to smallint and ByteType to TinyInt on the SQL side. This will resulting in small size of db tables where applicable. ### How was this patch tested? Corrected Unit test cases where applicable. Validated in CI/CD Added/fixed test case in MsSqlServerIntegrationSuite.scala, PostgresIntegrationSuite.scala , MySQLIntegrationSuite.scala to write/read tables from dataframe with cols as shorttype and bytetype. Validated by manual as follows. ``` ./build/mvn install -DskipTests ./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 ``` Closes #26549 from shivsood/port_29644_2.4. Authored-by: shivsood <shivsood@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
|
Merged to 2.4 |
|
How serious is it to the end users? I tend to agree @maropu this is not a big issue. You know, fixing this in the maintenance releases should be avoided. This will change the data type of the external schema. Normally, it will easily break the external applications. Can we revert this? |
| |INSERT INTO numbers VALUES ( | ||
| |0, | ||
| |255, 32767, 2147483647, 9223372036854775807, | ||
| |127, 32767, 2147483647, 9223372036854775807, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, how was this possible before? Do we cover unsigned cases too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all test cases including signed and unsigned are covered now for ByteType and ShortType. Refer to test("SPARK-29644: Write tables with ShortType") and test("SPARK-29644: Write tables with ByteType") in JDBCWriteSuite.scala and MsSQLServerIntegrationSuite.scala.
Earlier it all worked as everything was treated as an integer. Every test cases treated ByteType and ShortType as integers. These test are corrected now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant
- This TINYINT seems able to contain unsigned values (https://dev.mysql.com/doc/refman/8.0/en/integer-types.html) up to 255. How do we handle?
- Previously the value given here was
255forTINYINTwhich is performed via MySQL if I am not mistaken. How was this possible without UNSIGNED keyword?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon .
It seems that we need to revert #23400 with the same reason. How do you think about that?
|
Fixing bugs always change the behavior. All existing application which depends on the bug are affected naturally. I think this is a kind of issue of release note or migration guide (not a revert). |
|
|
These are good questions. If this is not just a bug fix, letting things work that didn't before, it maybe shouldn't be in 2.4. That's the question at #26301 - we can discuss there. If there isn't a quick resolution, we can revert this pending further investigation. |
|
Hi, All. |
|
@shivsood . Please make a followup PR against |
|
Agree, these are significant enough questions that we should reevaluate in master first. Good catch @gatorsmile ! thank you. |
|
Thanks everyone for the comments. 'll send a PR to revert the changes in master. |
|
Along with this, I suggest to revert the followings which mapped SQL
|
|
@dongjoon-hyun, seems #23400 itself is fine although the test alone might have a potential issue. If inferred type is a bytetype, it should get a byte. If there's an issue in type inference in JDBC, we should fix that code path instead. |
|
+1 for the partial revert. Only the following part is valid although |
|
@HyukjinKwon and I double-checked that. Luckily, |
|
Thanks for investigation. |
This is a port SPARK-29644 to 2.4
What changes were proposed in this pull request?
Corrected ShortType and ByteType mapping to SmallInt and TinyInt, corrected setter methods to set ShortType and ByteType as setShort() and setByte(). Changes in JDBCUtils.scala
Fixed Unit test cases to where applicable and added new E2E test cases in to test table read/write using ShortType and ByteType.
Problems
Also at line JDBCUtils.scala 247 TinyInt is interpreted wrongly as IntergetType in getCatalystType()
case java.sql.Types.TINYINT => IntegerTypeAt line 172 ShortType was wrongly interpreted as IntegerType
case ShortType => Option(JdbcType("INTEGER", java.sql.Types.SMALLINT))All thru out tests, ShortType and ByteType were being interpreted as IntegerTypes.
Why are the changes needed?
Given type should be set using the right type.
Does this PR introduce any user-facing change?
Yes.
How was this patch tested?
Corrected Unit test cases where applicable. Validated in CI/CD
Added/fixed test case in MsSqlServerIntegrationSuite.scala, PostgresIntegrationSuite.scala , MySQLIntegrationSuite.scala to
write/read tables from dataframe with cols as shorttype and bytetype. Validated by manual as follows.