[SPARK-25102][Spark Core] Write Spark version information to Parquet … #22255

space-d-n · 2018-08-28T12:38:37Z

What changes were proposed in this pull request?

Overrided method getName from org.apache.parquet.hadoop.api.WriteSupport in org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport that returns version of Spark (to write it to Parquet file footer later).

…file footers

space-d-n · 2018-08-29T06:37:27Z

@dbtsai Hello, I'm sorry for asking you directly, but for some reason jenkins did not generate message: "Can one of the admins verify this patch?". I just saw that you've reviewed some other PR. That's my first PR, so maybe I've done something wrong, while creating it. I will be grateful for review or any other advice.

dbtsai · 2018-08-29T18:35:15Z

Is there any other project writing this into the footer? Tests on reading this back?

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala

HyukjinKwon · 2018-08-30T01:30:39Z

I would also rather write the justification for this change, for instance, linking the usage of this name in Parquet side, potential usage, etc.

…file footer (Added test on reading writer.model.name from footer metadata)

space-d-n · 2018-08-30T07:10:57Z

Hello, @dbtsai, @HyukjinKwon . I added test on reading writer.model.name to PR. Justification for this change is below.
This is original issue in apache jira:
https://issues.apache.org/jira/browse/SPARK-25102
and it was referring to this one:
https://issues.apache.org/jira/browse/PARQUET-352
where the justification was given (it will be possible to identify files written by object models incorrectly). Also here is the link to Parquet repository with corresponding code changes (justification is also provided there):
apache/parquet-java@dcd1c33
And i found another case in which possibly this change can be useful:
dask/fastparquet#352

HyukjinKwon · 2018-08-30T07:13:35Z

Hi @rdblue, is it roughly good to do here in Spark?

rdblue · 2018-08-30T19:37:52Z

I don't think this fits the intent of the model name. The model name is intended to encode what the data model was that was written to Parquet. I can write Avro records to a Parquet file, for example, and we identify that using "avro" (and this could be done in Spark). That could be used if we need to interpret the data differently from a model, but it probably shouldn't include a version of that data model. The data model doesn't change with a version bump, so I think these should be logically separate.

It would be reasonable to add a "spark.version" property with this information. Other data models add properties to the file's key-value metadata for their own use. Avro adds its schema, for example.

space-d-n · 2018-08-31T06:04:58Z

I got your idea now. Apparently I was a little confused because of the description of tickets.
I can try to implement these (writing info about writer.model like "avro" etc in Spark), if you give me some directions on how can i do it and where should i make changes.
Also I can add "spark.version" property, but if I got everything right, we'll need to open new issue in parquet to do this, am I right?

rdblue · 2018-09-03T17:51:25Z

@npoberezkin, Parquet already supports custom key-value metadata in the file footer. The Spark version would go there.

dongjoon-hyun · 2018-09-25T23:26:47Z

Hi, @npoberezkin . Thank you for your first contribution. Could you update your PR to use custom key-value metadata according to the above advice of @rdblue ? Also, please use tag [SQL] instead of [Spark Core] in the PR title.

gatorsmile · 2018-11-02T04:32:27Z

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L902

@rdblue Can we use created_by?

  /** String for application that wrote this file.  This should be in the format
   * <Application> version <App Version> (build <App Build Hash>).
   * e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
   **/
  6: optional string created_by

gatorsmile · 2018-11-02T04:33:05Z

@dongjoon-hyun Do you want to take this over?

gatorsmile · 2018-11-02T04:34:13Z

Also cc @hvanhovell

dongjoon-hyun · 2018-11-02T06:57:48Z

Sure, @gatorsmile .

dongjoon-hyun · 2018-11-02T07:03:37Z

BTW, @rdblue recommended key_value_metadata. Are we going to created_by instead of key_value_metadata? Could you give me some advice, @gatorsmile and @rdblue ?

  /** Optional key/value metadata **/
  5: optional list<KeyValue> key_value_metadata

  /** String for application that wrote this file.  This should be in the format
   * <Application> version <App Version> (build <App Build Hash>).
   * e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
   **/
  6: optional string created_by

dongjoon-hyun · 2018-11-02T07:15:53Z

Currently, we put the metadata like the following.

file:        file:/tmp/p/part-00005-dbb9a9ab-0d6a-49df-9f39-397c8505f22b-c000.snappy.parquet
creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra:       org.apache.spark.sql.parquet.row.metadata = {
  "type":"struct",
  "fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]
}

For the hive table, it looks like the following. So, I prefer to add spark.sql.create.version=2.4.0 to key_value_metadata. I'll make a PR in this way.

parameters:{
  spark.sql.sources.schema.part.0={
    "type":"struct",
    "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761, 
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}

dongjoon-hyun · 2018-11-02T08:05:43Z

That will go like the following.

file:        file:/tmp/p/part-00007-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra:       org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}

dongjoon-hyun · 2018-11-03T06:30:56Z

It seems to cause some inconsistency if we choose one of org.apache.spark.sql.create.version or spark.sql.create.version as a key?

If we choose spark.sql.create.version as a key, in Parquet, it will look like the following.

extra:       spark.sql.create.version = 3.0.0-SNAPSHOT
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}

If we choose org.apache.spark.sql.create.version, it's different from Hive table property.

I'll ignore the consistency of (2) for backward compatibility.

gatorsmile · 2018-11-03T07:36:59Z

Just to confirm it. created_by is set to parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)?

dongjoon-hyun · 2018-11-03T07:39:48Z

That is the value used by Parquet-MR library. We had better not to touch it. Parquet MR reader can work differently based on that versions to handle some older Parquet writer bugs.

dongjoon-hyun · 2018-11-03T07:40:34Z

Hi, All.

New PR is made. Please move to #22932 for further discussion.

## What changes were proposed in this pull request? Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`. ``` parameters:{ spark.sql.sources.schema.part.0={ "type":"struct", "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}] }, transient_lastDdlTime=1541142761, spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.4.0 } ``` This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version` because we used `org.apache.` prefix in Parquet metadata already. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change Hive Table property for backward compatibility. After this PR, ORC and Parquet file generated by Spark will have the following metadata. **ORC (`native` and `hive` implmentation)** ``` $ orc-tools meta /tmp/o File Version: 0.12 with ... ... User Metadata: org.apache.spark.sql.create.version=3.0.0 ``` **PARQUET** ``` $ parquet-tools meta /tmp/p ... creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) extra: org.apache.spark.sql.create.version = 3.0.0 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]} ``` ## How was this patch tested? Pass the Jenkins with newly added test cases. This closes apache#22255. Closes apache#22932 from dongjoon-hyun/SPARK-25102. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

[SPARK-25102][Spark Core] Write Spark version information to Parquet …

ec9c130

…file footers

HyukjinKwon reviewed Aug 30, 2018

View reviewed changes

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala Outdated Show resolved Hide resolved

[SPARK-25102][Spark Core] Write Spark version information to Parquet …

9792185

…file footer (Added test on reading writer.model.name from footer metadata)

dongjoon-hyun mentioned this pull request Nov 3, 2018

[SPARK-25102][SQL] Write Spark version to ORC/Parquet file metadata #22932

Closed

asfgit closed this in d66a4e8 Nov 10, 2018

[SPARK-25102][Spark Core] Write Spark version information to Parquet … #22255

[SPARK-25102][Spark Core] Write Spark version information to Parquet … #22255

Uh oh!

Conversation

space-d-n commented Aug 28, 2018

What changes were proposed in this pull request?

Uh oh!

space-d-n commented Aug 29, 2018

Uh oh!

dbtsai commented Aug 29, 2018

Uh oh!

Uh oh!

HyukjinKwon commented Aug 30, 2018

Uh oh!

space-d-n commented Aug 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Aug 30, 2018

Uh oh!

rdblue commented Aug 30, 2018

Uh oh!

space-d-n commented Aug 31, 2018

Uh oh!

rdblue commented Sep 3, 2018

Uh oh!

dongjoon-hyun commented Sep 25, 2018

Uh oh!

gatorsmile commented Nov 2, 2018

Uh oh!

gatorsmile commented Nov 2, 2018

Uh oh!

gatorsmile commented Nov 2, 2018

Uh oh!

dongjoon-hyun commented Nov 2, 2018

Uh oh!

dongjoon-hyun commented Nov 2, 2018

Uh oh!

dongjoon-hyun commented Nov 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Nov 2, 2018

Uh oh!

dongjoon-hyun commented Nov 3, 2018

Uh oh!

gatorsmile commented Nov 3, 2018

Uh oh!

dongjoon-hyun commented Nov 3, 2018

Uh oh!

dongjoon-hyun commented Nov 3, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

space-d-n commented Aug 30, 2018 •

edited

Loading

dongjoon-hyun commented Nov 2, 2018 •

edited

Loading