Skip to content

Conversation

@HyukjinKwon
Copy link
Member

Parquet supports some JSON and BSON datatypes. They are represented as binary for BSON and string (UTF-8) for JSON internally.

I searched a bit and found Apache drill also supports both in this way, link.

@HyukjinKwon
Copy link
Member Author

retest this please

@HyukjinKwon
Copy link
Member Author

cc @liancheng

@SparkQA
Copy link

SparkQA commented Nov 12, 2015

Test build #45724 has finished for PR 9658 at commit d5a9629.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2015

Test build #45726 has finished for PR 9658 at commit d5a9629.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, again, what's the +?

@SparkQA
Copy link

SparkQA commented Nov 12, 2015

Test build #45732 has finished for PR 9658 at commit 9f22651.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2015

Test build #45731 has finished for PR 9658 at commit 66088e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45804 has finished for PR 9658 at commit d17a2db.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45827 has finished for PR 9658 at commit d17a2db.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45835 has finished for PR 9658 at commit d17a2db.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

All the builds pass all the tests at ParquetIOSuite and I do not think it affects other modules such as ML.
I will retest this.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45843 has finished for PR 9658 at commit d17a2db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I missed this part during the last review. Please always use === instead of == for better assertion error messages.

@HyukjinKwon
Copy link
Member Author

Thanks! I changed this.

@SparkQA
Copy link

SparkQA commented Nov 16, 2015

Test build #45963 has finished for PR 9658 at commit 1152636.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

Thanks! Merging to master.

@asfgit asfgit closed this in e388b39 Nov 16, 2015
@marmbrus
Copy link
Contributor

We've added support for reading these types as strings, but we can't round trip data without losing the annotation which might be kind of confusing for users. Perhaps we should also be reading/writing this info to/from the metadata.

@HyukjinKwon
Copy link
Member Author

Please note that #9754 updated unintentionally this to clean up at mater branch however, that is supposed to be merged with branch 1.6 and for this version 1.7.
I will give an alert to you or make a PR to backport this as soon as branch 1.7 is available.

@liancheng
Copy link
Contributor

Hm, I don't quite get it... So this PR is only for master (targeting 1.7). I don't think we need to backport this one to anywhere else.

@HyukjinKwon
Copy link
Member Author

Ah. I just got confused for a bit. It doesn't need to.

@liancheng
Copy link
Contributor

@marmbrus Did you mean the metadata stored in Parquet key-value user defined metadata, or the schema metadata in StructField? For primitive JSON/BSON fields, it's a good idea to annotate them in the schema metadata. But it doesn't cover nested fields, e.g. an array field of JSON/BSON elements.

And for "support for reading these types as strings", are you referring to spark.sql.parquet.binaryAsString? Currently this option only converts Parquet binaries without any logical type annotations, thus it doesn't cover binary (JSON) or binary (BSON).

@marmbrus
Copy link
Contributor

But it doesn't cover nested fields, e.g. an array field of JSON/BSON elements.

That is an unfortunate limitation of our metadata, but it does seem like it could be worked around. Though that said this is a minor concern.

"support for reading these types as strings"

I'm just saying thats what this patch does. It just reads them in as a text/binary string of opaque bytes.

asfgit pushed a commit that referenced this pull request Nov 18, 2015
…embedded types)

Parquet supports some JSON and BSON datatypes. They are represented as binary for BSON and string (UTF-8) for JSON internally.

I searched a bit and found Apache drill also supports both in this way, [link](https://drill.apache.org/docs/parquet-format/).

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #9658 from HyukjinKwon/SPARK-11692.

(cherry picked from commit e388b39)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@HyukjinKwon HyukjinKwon deleted the SPARK-11692 branch September 23, 2016 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants