-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-6607][SQL] Check invalid characters for Parquet schema and show error messages #5263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #29395 has finished for PR 5263 at commit
|
|
This is a good point. Actually all these characters However, personally I think simply replacing these characters with legitimate ones like brackets might be confusing. On the other hand, similar problems can be worked around easily by assigning an alias. So how about this:
|
|
@liancheng your suggestion is okay for me. If others have no opinion for that, I will send updates later for your suggestion. |
…suggest using Alias.
|
Test build #29475 has finished for PR 5263 at commit
|
|
@liancheng is it ready to go now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's include the attribute name in the error message to make it more clear, and reformat the code a bit:
private def checkSpecialCharacter(schema: Seq[Attribute]) = {
schema.map(_.name).foreach { name =>
if (name.matches(".*[ ,;{}()\n\t=].*")) {
sys.error(
s"""Attribute name "$name" contains invalid character(s) among " ,;{}()\n\t=".
|Please use alias to rename it.
""".stripMargin.split("\n").mkString)
}
}
}|
Now LGTM except for two minor styling issue. |
|
Thanks for working on this! Merging to master. |
|
Hey @viirya, sorry that I forgot to ask you to update the PR description. Although it has already been in the Git history, would you mind to update the description for future reference on GitHub? |
|
@liancheng ok, I updated the description. |
|
@viirya Thanks! |
'(' and ')' are special characters used in Parquet schema for type annotation. When we run an aggregation query, we will obtain attribute name such as "MAX(a)".
If we directly store the generated DataFrame as Parquet file, it causes failure when reading and parsing the stored schema string.
This pr adds the function to detect these invalid characters in field names of a Parquet schema. If any invalid characters are found, we show an error message to the user and suggest the user to add an alias to the field.