Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Mar 30, 2015

'(' and ')' are special characters used in Parquet schema for type annotation. When we run an aggregation query, we will obtain attribute name such as "MAX(a)".

If we directly store the generated DataFrame as Parquet file, it causes failure when reading and parsing the stored schema string.

This pr adds the function to detect these invalid characters in field names of a Parquet schema. If any invalid characters are found, we show an error message to the user and suggest the user to add an alias to the field.

@SparkQA
Copy link

SparkQA commented Mar 30, 2015

Test build #29395 has finished for PR 5263 at commit 1de001d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@liancheng
Copy link
Contributor

This is a good point. Actually all these characters ,;{}()\n\t= (note there is a space character at the beginning) can be problematic if they appear in field names, according to MessageTypeParser.

However, personally I think simply replacing these characters with legitimate ones like brackets might be confusing. On the other hand, similar problems can be worked around easily by assigning an alias. So how about this:

  1. Check all field names for invalid characters in convertFromAttributes
  2. Throw an error message when any invalid character is found
  3. In the error message, suggest the user to add an alias to the field explicitly

@viirya
Copy link
Member Author

viirya commented Mar 30, 2015

@liancheng your suggestion is okay for me. If others have no opinion for that, I will send updates later for your suggestion.

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29475 has finished for PR 5263 at commit 463dff4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@viirya
Copy link
Member Author

viirya commented Apr 3, 2015

@liancheng is it ready to go now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include the attribute name in the error message to make it more clear, and reformat the code a bit:

  private def checkSpecialCharacter(schema: Seq[Attribute]) = {
    schema.map(_.name).foreach { name =>
      if (name.matches(".*[ ,;{}()\n\t=].*")) {
        sys.error(
          s"""Attribute name "$name" contains invalid character(s) among " ,;{}()\n\t=".
             |Please use alias to rename it.
           """.stripMargin.split("\n").mkString)
      }
    }
  }

@liancheng
Copy link
Contributor

Now LGTM except for two minor styling issue.

@viirya viirya changed the title [SPARK-6607][SQL] Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema [SPARK-6607][SQL] Check invalid characters for Parquet schema and show error messages Apr 4, 2015
@liancheng
Copy link
Contributor

Thanks for working on this! Merging to master.

@asfgit asfgit closed this in 7bca62f Apr 4, 2015
@liancheng
Copy link
Contributor

Hey @viirya, sorry that I forgot to ask you to update the PR description. Although it has already been in the Git history, would you mind to update the description for future reference on GitHub?

@viirya
Copy link
Member Author

viirya commented Apr 6, 2015

@liancheng ok, I updated the description.

@liancheng
Copy link
Contributor

@viirya Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants