Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid Avro file produced using SequenceWriter #339

Open
willsoto opened this issue Sep 8, 2022 · 7 comments
Open

Invalid Avro file produced using SequenceWriter #339

willsoto opened this issue Sep 8, 2022 · 7 comments
Labels

Comments

@willsoto
Copy link
Contributor

willsoto commented Sep 8, 2022

While documentation on writing Avro to a file is sparse, I have managed to piece some stuff together but I am still getting an error.

Here is some sample code:

final var avroFactory = AvroFactory.builderWithApacheDecoder().enable(AvroGenerator.Feature.AVRO_FILE_OUTPUT).build();

final var generator = new AvroSchemaGenerator().enableLogicalTypes();

final var mapper = AvroMapper.builder(avroFactory).addModule(new AvroJavaTimeModule()).build();
mapper.acceptJsonFormatVisitor(Thing.class, generator);

final var avroSchema = generator.getGeneratedSchema();

final var file = Files.createTempFile("something", ".avro").toFile();

final var out = new ByteArrayOutputStream();
final var writer = mapper.writer(avroSchema).writeValues(out);

// in a loop
writer.write(thing);

// after loop
writer.close();

try (FileOutputStream outputStream = new FileOutputStream(file)) {
  out.writeTo(outputStream);
}

When checking the resultant file using avro-tools, I get the following error:

avro-tools tojson something.avro

22/09/08 18:36:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
	at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:224)
	at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:97)
	at org.apache.avro.tool.Main.run(Main.java:67)
	at org.apache.avro.tool.Main.main(Main.java:56)
Caused by: java.io.IOException: Invalid sync!
	at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:319)
	at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:213)
	... 3 mor

According to some searching, the Invalid sync! error occurs when the file hasn't been stitched together properly, but it's unclear to me what I need to do in code to get that to happen. I've looked through most of the Avro tests in this repo and I cannot find one that actually writes to a file and then de-serializes from that file.

I am not sure if I have stumbled into an actual bug here or not, but I am happy to try and write a test case if this code does seem correct since that would imply it's a bug?

Thanks in advance.

Edit:

I've also tried the following:

final var file = Files.createTempFile("something", ".avro").toFile();
final SequenceWriter writer = mapper.writer(avroSchema).writeValues(file);

In which case I get the following error at that line:

java.lang.UnsupportedOperationException: Generator of type com.fasterxml.jackson.core.json.UTF8JsonGenerator does not support schema of type 'avro'

	at com.fasterxml.jackson.core.JsonGenerator.setSchema(JsonGenerator.java:592)
	at com.fasterxml.jackson.databind.ObjectWriter$GeneratorSettings.initialize(ObjectWriter.java:1393)
	at com.fasterxml.jackson.databind.ObjectWriter._configureGenerator(ObjectWriter.java:1258)
	at com.fasterxml.jackson.databind.ObjectWriter.createGenerator(ObjectWriter.java:717)
	at com.fasterxml.jackson.databind.ObjectWriter.writeValues(ObjectWriter.java:753)
@cowtowncoder
Copy link
Member

I think the problem may be Avro oddity where in data encoding as File requires use of header which is otherwise not used (or allowed) at all.
It would be good to support "File" variant and there may already be an issue filed for it. But no work.
It's bit tricky wrt API since Jackson does not have concept separating by input/output source type (the idea of different encoding for File seems specifically peculiar and ... well, bad idea, IMO).

@willsoto
Copy link
Contributor Author

willsoto commented Sep 9, 2022

Ah okay...given everything I found I thought this was well supported - especially because of this particular bit AvroGenerator.Feature.AVRO_FILE_OUTPUT.

That particular feature is documented in JavaDoc and I found this as well:

// 21-Feb-2017, tatu: As per [dataformats-binary#15], need to ensure schema gets
// written, if using "File" format (not raw "rpc" one)
if (_generator.isEnabled(Feature.AVRO_FILE_OUTPUT)) {
OutputStream outputStream = (OutputStream) _generator.getOutputTarget();
DatumWriter<Object> datumWriter = new NonBSGenericDatumWriter<>(_schema);
DataFileWriter<Object> dataFileWriter = new DataFileWriter<>(datumWriter);
dataFileWriter.create(_schema, outputStream);
dataFileWriter.append(rootValue);
dataFileWriter.close();
return;
}

@cowtowncoder
Copy link
Member

@willsoto Hmmh. I had actually forgotten about this being implemented. But had I read your example in detail, it would have been there.

I assume you have also tried disabling that to see what difference it makes? Is there matching reader (deserialization side) setting to go with it?
Apologies for asking questions I should know answer for but I figured you have been investigating this and have good context.

@willsoto
Copy link
Contributor Author

willsoto commented Sep 9, 2022

No worries! Appreciate you taking the time to help me out 😄

I assume you have also tried disabling that to see what difference it makes?

If I understand the question, I initially just tried the examples pretty much copy+pasted from the documentation so I didn't even know there was this AvroGenerator.Feature.AVRO_FILE_OUTPUT setting. It took quite a bit of searching to stumble upon it. In terms of example code, if you just remove the AvroFactory stuff, that is what I was trying initially.

Is there matching reader (de-serialization side) setting to go with it?

Not sure honestly. The way I've been testing is writing the file and then attempting to open it with avro-tools to prove it's valid and de-serializable.

@cowtowncoder
Copy link
Member

Ok that makes sense.

Adding example files into a (new) unit test would be nice too. One challenge wrt Avro tho is that without file header it has zero metadata to detect valid data. This is unlike almost every other format, even protobuf has type tags etc for some level of self-descriptiveness.

@willsoto
Copy link
Contributor Author

willsoto commented Sep 9, 2022

I'll try and add a test case this weekend.

Does the code I provided at least seem like it should work? I am curious if we can minimize the reproduction even further.

@cowtowncoder
Copy link
Member

Oh. The part that possibly (likely?) will not work is the use of writeValues() (and SequenceWriter it creates) -- I suspect you cannot simply append root-level values in Avro, unlike in some other formats. So you may need to instead create a container (List) with matching root-level Avro type to describe the full type. But then again... Avro is designed for data streams so I am not 100% sure (it has been a while since I worked actively on this format module).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants