-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17477][SQL] SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type #15264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…> Long when parquet files have Int as its type while hive metastore has Long as its type
|
Can one of the admins verify this patch? |
|
@HyukjinKwon Yup. I made a mistake in managing my branches so that I decided to create the PR again. Sorry for this confusion... |
|
@HyukjinKwon thanks for taking a look at the previous patch. I suggested this fix to @wgtmac offline but perhaps I didn't quite understand the "old"/"new" file issue that you mentioned. Can you please elaborate it here? |
|
Hi @sameeragarwal , I just meant if we want to read new and old Parquet files (one having int and one having long for the same column) described in the JIRA and PR description, we could do one of the followings.
If this PR is dealing with only the second case, we should take care of more types. I mean, I guess we don't want multiple PRs and JIRAs for each type, each datasources and vecterized/non-vecterized readers. I might be wrong but this was what I thought. BTW, if this change looks okay, I'd try to reopen my previous one rathet than making duplicated efforts. |
|
I wouldn't stay against but rather neutral if you strongly feel this one is a right fix. We could go merging this and then introduce a general upcasting logic later. |
|
That's a great explanation. Yes, this patch is really targeted at fixing the second issue and it does seem like a subset of SPARK-16544. If you have cycles towards working on a broader fix, it'd be great to have your patch instead; otherwise we can just merge this smaller fix for now (along with a small test case). |
|
@sameeragarwal Thanks. One thing I'd like to sure is, though, do you think this is the right place to fix? I just want to be very sure on this before I reopen my old pr and proceed further. (BTW, I don't mind merging it first as I think it'd take a bit of time to make the general approach merged anyway.) |
|
^ @davies what do you think? Would |
|
@HyukjinKwon @wgtmac I confirmed with Davies that |
|
@sameeragarwal I agree. I'm looking forward to a comprehensive patch for this. Thanks! |
|
@wgtmac @sameeragarwal Yup. For me I dont't mind keeping open or closed. Please feel free to review #14215 then. I will rebase it and then proceed further. Thank you both. |
|
BTW, I just wonder if this PR is closable if we want to do this with more types :). |
|
@HyukjinKwon I agree. Would you have cycles to re-open #14215 by any chance? This is something that'd be great to have that in 2.2. |
|
Yeap, I will try to get that back after finishing up few issues I am currently working on. I just realised that it'd take a bit of time for me to proceed (as I noticed we need a more careful touch for it). Please feel free to take over it if anyone is interested in it. Otherwise, let me try to proceed even if it takes a while. |
## What changes were proposed in this pull request? This PR proposes to close stale PRs. What I mean by "stale" here includes that there are some review comments by reviewers but the author looks inactive without any answer to them more than a month. I left some comments roughly a week ago to ping and the author looks still inactive in these PR below These below includes some PR suggested to be closed and a PR against another branch which seems obviously inappropriate. Given the comments in the last three PRs below, they are probably worth being taken over by anyone who is interested in it. Closes apache#7963 Closes apache#8374 Closes apache#11192 Closes apache#11374 Closes apache#11692 Closes apache#12243 Closes apache#12583 Closes apache#12620 Closes apache#12675 Closes apache#12697 Closes apache#12800 Closes apache#13715 Closes apache#14266 Closes apache#15053 Closes apache#15159 Closes apache#15209 Closes apache#15264 Closes apache#15267 Closes apache#15871 Closes apache#15861 Closes apache#16319 Closes apache#16324 Closes apache#16890 Closes apache#12398 Closes apache#12933 Closes apache#14517 ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#16937 from HyukjinKwon/stale-prs-close.
What changes were proposed in this pull request?
Using SparkSession in Spark 2.0 to read a Hive table which is stored as parquet files and if there has been a schema evolution from int to long of a column, we will get java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt. To be specific, if there are some old parquet files using int for the column while some new parquet files use long and the Hive metastore uses Long as its type, the aforementioned exception will be thrown. Because Hive and Presto deem this kind of schema evolution is valid, this PR allows writing a int value when its table schema is long in hive metastore.
This is for non-vectorized parquet reader only.
How was this patch tested?
Manual test to create parquet files with int type in the schema and create hive table using long as its type. Then perform spark.sql("select * from table") to query all data from this table.