Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Write Spark version into Avro file metadata

Why are the changes needed?

The version info is very useful for backward compatibility. This is also done in parquet/orc.

Does this PR introduce any user-facing change?

no

How was this patch tested?

new test

@cloud-fan
Copy link
Contributor Author

cc @MaxGekk @dongjoon-hyun

@SparkQA
Copy link

SparkQA commented Apr 2, 2020

Test build #120725 has finished for PR 28102 at commit 5aaf2db.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class SparkAvroKeyOutputFormat extends AvroKeyOutputFormat<GenericRecord>
  • static class SparkRecordWriterFactory extends RecordWriterFactory<GenericRecord>

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @cloud-fan .

this.mAvroFileWriter.create(writerSchema, outputStream);
}

public void write(AvroKey<T> record, NullWritable ignore) throws IOException {
Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan . Do we need to check the effect in terms of performance because this is an additional wrapper technically? Oh, never mind.

@SparkQA
Copy link

SparkQA commented Apr 3, 2020

Test build #120749 has finished for PR 28102 at commit 0fe7bfd.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

class SparkAvroKeyRecordWriter<T> extends RecordWriter<AvroKey<T>, NullWritable>
implements Syncable {
private final DataFileWriter<T> mAvroFileWriter;
public SparkAvroKeyRecordWriter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix this?

/home/jenkins/workspace/SparkPullRequestBuilder@3/external/avro/src/main/java/org/apache/spark/sql/avro/SparkAvroKeyOutputFormat.java:65: Redundant 'public' modifier.

Map<String, String> metadata) throws IOException {
this.mAvroFileWriter = new DataFileWriter(dataModel.createDatumWriter(writerSchema));
for (Map.Entry<String, String> entry : metadata.entrySet()) {
this.mAvroFileWriter.setMeta(entry.getKey(), entry.getValue());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it is better to check if conflicting with reserved meta by isReservedMeta?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already checked in the method setMeta.

override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = {
new Path(path)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we don't change this? Keeping a blank line between methods is actually a valid style (https://github.com/databricks/scala-style-guide#blank-lines-vertical-whitespace)

@SparkQA
Copy link

SparkQA commented Apr 3, 2020

Test build #120758 has finished for PR 28102 at commit dbe2e87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

The last commit just adds blank lines. Thanks for the review, merging to master/3.0!

cloud-fan added a commit that referenced this pull request Apr 3, 2020
### What changes were proposed in this pull request?

Write Spark version into Avro file metadata

### Why are the changes needed?

The version info is very useful for backward compatibility. This is also done in parquet/orc.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

new test

Closes #28102 from cloud-fan/avro.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6b1ca88)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan cloud-fan closed this in 6b1ca88 Apr 3, 2020
@SparkQA
Copy link

SparkQA commented Apr 3, 2020

Test build #120769 has finished for PR 28102 at commit 3ee3d82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

cloud-fan added a commit to cloud-fan/spark that referenced this pull request Apr 8, 2020
Write Spark version into Avro file metadata

The version info is very useful for backward compatibility. This is also done in parquet/orc.

no

new test

Closes apache#28102 from cloud-fan/avro.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit to cloud-fan/spark that referenced this pull request Apr 8, 2020
Write Spark version into Avro file metadata

The version info is very useful for backward compatibility. This is also done in parquet/orc.

no

new test

Closes apache#28102 from cloud-fan/avro.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

Write Spark version into Avro file metadata

### Why are the changes needed?

The version info is very useful for backward compatibility. This is also done in parquet/orc.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

new test

Closes apache#28102 from cloud-fan/avro.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants