Skip to content

Conversation

@space-d-n
Copy link

What changes were proposed in this pull request?

Overrided method getName from org.apache.parquet.hadoop.api.WriteSupport in org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport that returns version of Spark (to write it to Parquet file footer later).

@space-d-n
Copy link
Author

@dbtsai Hello, I'm sorry for asking you directly, but for some reason jenkins did not generate message: "Can one of the admins verify this patch?". I just saw that you've reviewed some other PR. That's my first PR, so maybe I've done something wrong, while creating it. I will be grateful for review or any other advice.

@dbtsai
Copy link
Member

dbtsai commented Aug 29, 2018

Is there any other project writing this into the footer? Tests on reading this back?

@HyukjinKwon
Copy link
Member

I would also rather write the justification for this change, for instance, linking the usage of this name in Parquet side, potential usage, etc.

…file footer (Added test on reading writer.model.name from footer metadata)
@space-d-n
Copy link
Author

space-d-n commented Aug 30, 2018

Hello, @dbtsai, @HyukjinKwon . I added test on reading writer.model.name to PR. Justification for this change is below.
This is original issue in apache jira:
https://issues.apache.org/jira/browse/SPARK-25102
and it was referring to this one:
https://issues.apache.org/jira/browse/PARQUET-352
where the justification was given (it will be possible to identify files written by object models incorrectly). Also here is the link to Parquet repository with corresponding code changes (justification is also provided there):
apache/parquet-java@dcd1c33
And i found another case in which possibly this change can be useful:
dask/fastparquet#352

@HyukjinKwon
Copy link
Member

Hi @rdblue, is it roughly good to do here in Spark?

@rdblue
Copy link
Contributor

rdblue commented Aug 30, 2018

I don't think this fits the intent of the model name. The model name is intended to encode what the data model was that was written to Parquet. I can write Avro records to a Parquet file, for example, and we identify that using "avro" (and this could be done in Spark). That could be used if we need to interpret the data differently from a model, but it probably shouldn't include a version of that data model. The data model doesn't change with a version bump, so I think these should be logically separate.

It would be reasonable to add a "spark.version" property with this information. Other data models add properties to the file's key-value metadata for their own use. Avro adds its schema, for example.

@space-d-n
Copy link
Author

I got your idea now. Apparently I was a little confused because of the description of tickets.
I can try to implement these (writing info about writer.model like "avro" etc in Spark), if you give me some directions on how can i do it and where should i make changes.
Also I can add "spark.version" property, but if I got everything right, we'll need to open new issue in parquet to do this, am I right?

@rdblue
Copy link
Contributor

rdblue commented Sep 3, 2018

@npoberezkin, Parquet already supports custom key-value metadata in the file footer. The Spark version would go there.

@dongjoon-hyun
Copy link
Member

Hi, @npoberezkin . Thank you for your first contribution. Could you update your PR to use custom key-value metadata according to the above advice of @rdblue ? Also, please use tag [SQL] instead of [Spark Core] in the PR title.

@gatorsmile
Copy link
Member

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L902

@rdblue Can we use created_by?

  /** String for application that wrote this file.  This should be in the format
   * <Application> version <App Version> (build <App Build Hash>).
   * e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
   **/
  6: optional string created_by

@gatorsmile
Copy link
Member

@dongjoon-hyun Do you want to take this over?

@gatorsmile
Copy link
Member

Also cc @hvanhovell

@dongjoon-hyun
Copy link
Member

Sure, @gatorsmile .

@dongjoon-hyun
Copy link
Member

BTW, @rdblue recommended key_value_metadata. Are we going to created_by instead of key_value_metadata? Could you give me some advice, @gatorsmile and @rdblue ?

  /** Optional key/value metadata **/
  5: optional list<KeyValue> key_value_metadata

  /** String for application that wrote this file.  This should be in the format
   * <Application> version <App Version> (build <App Build Hash>).
   * e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
   **/
  6: optional string created_by

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 2, 2018

Currently, we put the metadata like the following.

file:        file:/tmp/p/part-00005-dbb9a9ab-0d6a-49df-9f39-397c8505f22b-c000.snappy.parquet
creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra:       org.apache.spark.sql.parquet.row.metadata = {
  "type":"struct",
  "fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]
}

For the hive table, it looks like the following. So, I prefer to add spark.sql.create.version=2.4.0 to key_value_metadata. I'll make a PR in this way.

parameters:{
  spark.sql.sources.schema.part.0={
    "type":"struct",
    "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761, 
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}

@dongjoon-hyun
Copy link
Member

That will go like the following.

file:        file:/tmp/p/part-00007-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra:       org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}

@dongjoon-hyun
Copy link
Member

It seems to cause some inconsistency if we choose one of org.apache.spark.sql.create.version or spark.sql.create.version as a key?

  1. If we choose spark.sql.create.version as a key, in Parquet, it will look like the following.
extra:       spark.sql.create.version = 3.0.0-SNAPSHOT
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
  1. If we choose org.apache.spark.sql.create.version, it's different from Hive table property.

I'll ignore the consistency of (2) for backward compatibility.

@gatorsmile
Copy link
Member

Just to confirm it. created_by is set to parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)?

@dongjoon-hyun
Copy link
Member

That is the value used by Parquet-MR library. We had better not to touch it. Parquet MR reader can work differently based on that versions to handle some older Parquet writer bugs.

@dongjoon-hyun
Copy link
Member

Hi, All.

New PR is made. Please move to #22932 for further discussion.

@asfgit asfgit closed this in d66a4e8 Nov 10, 2018
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`.
```
parameters:{
  spark.sql.sources.schema.part.0={
    "type":"struct",
    "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761,
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}
```

This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version` because we used `org.apache.` prefix in Parquet metadata already. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change Hive Table property for backward compatibility.

After this PR, ORC and Parquet file generated by Spark will have the following metadata.

**ORC (`native` and `hive` implmentation)**
```
$ orc-tools meta /tmp/o
File Version: 0.12 with ...
...
User Metadata:
  org.apache.spark.sql.create.version=3.0.0
```

**PARQUET**
```
$ parquet-tools meta /tmp/p
...
creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra:       org.apache.spark.sql.create.version = 3.0.0
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
```

## How was this patch tested?

Pass the Jenkins with newly added test cases.

This closes apache#22255.

Closes apache#22932 from dongjoon-hyun/SPARK-25102.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants