-
Notifications
You must be signed in to change notification settings - Fork 29k
[Spark-4041][SQL]attributes names in table scan should converted to lowercase when compare with relation attributes #2884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
1 similar comment
|
Can one of the admins verify this patch? |
|
Can you add a unit test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it would be safer if you use _.name.toLowerCase == a.name.toLowerCase.
|
@yhuai, it's hard to make a unit test for this since |
|
I think this change is generally safe. LGTM, thanks. |
|
QA tests have started for PR 2884 at commit
|
|
QA tests have finished for PR 2884 at commit
|
|
test failed due to streaming compile error, can you retest this? |
|
QA tests have started for PR 2884 at commit
|
|
QA tests have finished for PR 2884 at commit
|
|
Hm, the failure was caused by a known Jenkins configuration issue. |
|
retest this please |
|
QA tests have started for PR 2884 at commit
|
|
QA tests have finished for PR 2884 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, should this be done by name at all? Couldn't we be using an AttributeMap from Attribute->ordinal instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, column names are case insensitive in hive, we should use lowercase for names in hive module(only change here is not enough, also need convert to lowercase there https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L273).
I think using an AttributeMap can not fix this problem, how about add a lowerName for Attribute and in hive we use this method instead?
|
Added a test case for lower case issue, the test will throw NPE if not converted to lowercase |
|
Great, thanks for finding this and adding a test. Regarding the implementation, I'd like to try to avoid doing too much string munging as its generally easy to forget to do (hence the issue). Also, in general we try to avoid looking at string names anywhere other than in analysis. This is the whole idea behind having expression ids in AttributeReferences (and the idea behind AttributeMaps). Since we can't completely get away from string names when working with Hive, what do you think about this approach: https://github.com/marmbrus/spark/compare/hiveTableScanCase I think this more cleanly isolates the need to reason about case sensitivity into the analysis phase. |
|
Cool, i think this is better |
|
retest this please |
|
Test build #473 has started for PR 2884 at commit
|
|
Test build #473 has finished for PR 2884 at commit
|
|
retest this again, seems Jenkins get something wrong and failed in |
|
Test build #475 has started for PR 2884 at commit
|
|
QA tests have started for PR 2884 at commit
|
|
|
|
QA tests have finished for PR 2884 at commit
|
|
Test build #475 has finished for PR 2884 at commit
|
|
Minor comment: In the future please put SPARK-XXXX in all capitals in the title so that our merge scripts recognize it. Thanks! Thanks for working on this! Merged to master. |
In
MetastoreRelationthe attributes name is lowercase because of hive using lowercase for fields name, so we should convert attributes name in table scan lowercase inindexWhere(_.name == a.name).neededColumnIDsmay be not correct if not convert to lowercase.