-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25102][Spark Core] Write Spark version information to Parquet … #22255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25102][Spark Core] Write Spark version information to Parquet … #22255
Conversation
|
@dbtsai Hello, I'm sorry for asking you directly, but for some reason jenkins did not generate message: "Can one of the admins verify this patch?". I just saw that you've reviewed some other PR. That's my first PR, so maybe I've done something wrong, while creating it. I will be grateful for review or any other advice. |
|
Is there any other project writing this into the footer? Tests on reading this back? |
.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala
Outdated
Show resolved
Hide resolved
|
I would also rather write the justification for this change, for instance, linking the usage of this name in Parquet side, potential usage, etc. |
…file footer (Added test on reading writer.model.name from footer metadata)
|
Hello, @dbtsai, @HyukjinKwon . I added test on reading writer.model.name to PR. Justification for this change is below. |
|
Hi @rdblue, is it roughly good to do here in Spark? |
|
I don't think this fits the intent of the model name. The model name is intended to encode what the data model was that was written to Parquet. I can write Avro records to a Parquet file, for example, and we identify that using "avro" (and this could be done in Spark). That could be used if we need to interpret the data differently from a model, but it probably shouldn't include a version of that data model. The data model doesn't change with a version bump, so I think these should be logically separate. It would be reasonable to add a "spark.version" property with this information. Other data models add properties to the file's key-value metadata for their own use. Avro adds its schema, for example. |
|
I got your idea now. Apparently I was a little confused because of the description of tickets. |
|
@npoberezkin, Parquet already supports custom key-value metadata in the file footer. The Spark version would go there. |
|
Hi, @npoberezkin . Thank you for your first contribution. Could you update your PR to use custom key-value metadata according to the above advice of @rdblue ? Also, please use tag |
|
@rdblue Can we use created_by? |
|
@dongjoon-hyun Do you want to take this over? |
|
Also cc @hvanhovell |
|
Sure, @gatorsmile . |
|
BTW, @rdblue recommended key_value_metadata. Are we going to |
|
Currently, we put the metadata like the following. For the hive table, it looks like the following. So, I prefer to add |
|
That will go like the following. |
|
It seems to cause some inconsistency if we choose one of
I'll ignore the consistency of (2) for backward compatibility. |
|
Just to confirm it. |
|
That is the value used by Parquet-MR library. We had better not to touch it. Parquet MR reader can work differently based on that versions to handle some older Parquet writer bugs. |
|
Hi, All. New PR is made. Please move to #22932 for further discussion. |
## What changes were proposed in this pull request?
Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`.
```
parameters:{
spark.sql.sources.schema.part.0={
"type":"struct",
"fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
},
transient_lastDdlTime=1541142761,
spark.sql.sources.schema.numParts=1,
spark.sql.create.version=2.4.0
}
```
This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version` because we used `org.apache.` prefix in Parquet metadata already. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change Hive Table property for backward compatibility.
After this PR, ORC and Parquet file generated by Spark will have the following metadata.
**ORC (`native` and `hive` implmentation)**
```
$ orc-tools meta /tmp/o
File Version: 0.12 with ...
...
User Metadata:
org.apache.spark.sql.create.version=3.0.0
```
**PARQUET**
```
$ parquet-tools meta /tmp/p
...
creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra: org.apache.spark.sql.create.version = 3.0.0
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
```
## How was this patch tested?
Pass the Jenkins with newly added test cases.
This closes apache#22255.
Closes apache#22932 from dongjoon-hyun/SPARK-25102.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
What changes were proposed in this pull request?
Overrided method getName from org.apache.parquet.hadoop.api.WriteSupport in org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport that returns version of Spark (to write it to Parquet file footer later).