Skip to content
This repository has been archived by the owner on Apr 22, 2022. It is now read-only.

Avro deserialization improvements #26

Open
basvdl opened this issue Jan 24, 2019 · 0 comments
Open

Avro deserialization improvements #26

basvdl opened this issue Jan 24, 2019 · 0 comments
Labels
de-assessment DataEngineer assessment

Comments

@basvdl
Copy link
Contributor

basvdl commented Jan 24, 2019

Originally commented by @barend on #23. New issue created to not lose this valuable information.

I'm not a huge fan of the class design with a `var` Schema. The underlying Java serialization kind of forces your hand here, and it's not easy to change. There are some unclear/undefined semantics here. For example, the `reader` field during deserialization:

1. the JVM creates a blank instance of this class, using a default (no-args) constructor that's not visible in this file (I think you inherit it from `Serializable` through some Scala magic, but I'm not even sure).
2. the field `reader` gets initialised to a new GenericDatumReader with a `null` schema. How does it respond to nulls? Can we be sure this behaviour doesn't change when the Avro library is updated?
3. the `readObject()` method runs and replaces the schema and reader variables.

Particularly bad client code could even find a way to invoke `writeObject` while `readObject` isn't done and attempt to toString a null value, but I suppose that's mostly hypothetical. You'd have to bend over backwards to force that error.

### Advice

- Java serialization is a lot trickier than it looks or was ever intended to be. I advise to **always** add a unit test for every serializeable class that pushes an instance through an ObjectOutputStream into an ObjectInputStream and makes sure that it gets reconstructed correctly.
- As an alternative to writeObject/readObject you can use writeReplace/readResolve to implement a [Memento Pattern](https://www.oodesign.com/memento-pattern.html), giving you more control over the serialised form. It's still a bit intricate, but it will let you define the class using `val`s for the schema and the reader.
- The book ["Effective Java (third edition)"](https://www.goodreads.com/book/show/34927404-effective-java) has detailed advice on how to handle this. This book is pretty wonderful and much recommended. It's so good, I lent my copy to someone at Intergamma and never got it back 😞 .
- Basically, there's a reason Java Serialization isn't very popular these days; avoid using it when you can (use JSON, or Protobuf, or anything else). Spark forces your hand here, it requires serialization in order to distribute executable program code to the worker nodes. Thankfully, most of the time the serializeable classes define only a stateless function (such as the udf below), and all these weird state and order-of-initialization issues don't apply. This class happens to be an exception to that rule.

Originally posted by @barend in #23

@krisgeus krisgeus added the de-assessment DataEngineer assessment label Jan 24, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
de-assessment DataEngineer assessment
Projects
None yet
Development

No branches or pull requests

2 participants