Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bypass generic data step and serialize directly to binary #160

Closed
Chuckame opened this issue Aug 29, 2023 · 10 comments · Fixed by #215
Closed

Bypass generic data step and serialize directly to binary #160

Chuckame opened this issue Aug 29, 2023 · 10 comments · Fixed by #215
Labels
enhancement New feature or request

Comments

@Chuckame
Copy link
Contributor

Chuckame commented Aug 29, 2023

Currently, the library is just interfacing as it can to the official apache avro library using GenericRecord and other GenericData stuff.

On avro4k codebase simplicity, it seems easy while it's not really since the codebase is doing a lot of adaptation to let the GenericData happy with the generic stuff we generated.

On performance side, we are checking, converting and adapting data to fit GenericData stuff, and then this adapted stuff is also tried to be re-adapted and checked, to be then serialized to binary. Also, all is runtime specific while kotlinx serialization is mainly prepared at build time (except contextual).

On user side, each user that just want to do avro format (and not directly use the generated GenericRecords) will have to call the apache avro library to serialize it to binary or json, or to use the libraries helping this stuff.

On compatibility and spec-fitting, we are like covering all the avro spec without knowing how it works. We just put the record from Avro.toRecord() into the apache avro lib, and 🎉 it works.
On tests, we are currently not in testing the avro format compliance, but the generated GenericRecords with another generic record.

And the last axis, kotlin multi platform. Since a big part of the codebase and the mechanisms is highly linked to the apache avro Java lib, the multiplaform dream seems complicated to reach.

Now, how to improve that? Let kotlinx serialization do his best and why it has been created : easily encode whatever the kotlin object.
Have a look to the avro Encoder methods : https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/io/Encoder.java

  • wtiteInt
  • writeBoolean
  • writeString
  • writeArray[Start/End]
  • writeMap[Start/End]
  • etc...

All the methods fits exactly to kotlinx encoders, and the way of calling them fits perfectly with kotlinx workflow :

  • beginCollection -> writeArrayStart/writeMapStart + setItemCount
  • endStructure -> writeArrayEnd/writeMapEnd
  • encodeElement -> startItem
  • all the encodeXX for primitives -> same in avro encoder

The other advantage is that there is absolutely no primitive autoboxing, also no more callbacks hell thanks to direct use of avro Encoder as output.

@thake, any comment on this? You are in this project since a long time. This will be a big refactoring, while it could be game changer.

Plan for making the changes:

  • finish the kotlin native feature PR that is simplifying the codebase and using the latest features of kotlinx serialization
  • adapt all tests to not compare generic records, but binary data
  • modify encoders to use avro Encoder from configuration (changeable with Binary, json, and maybe others)
  • modify decoders to use avro Decoder from configuration also
@Chuckame Chuckame added the enhancement New feature or request label Aug 29, 2023
@thake
Copy link
Member

thake commented Aug 30, 2023

@Chuckame, thanks for the comprehensive issue for this really important topic! I also had this in mind and think this is the way to go.

I think it would be cool to get rid of the avro dependency and create an extension library that maps our types to avro types. This would open the doors for multiplatform of the core library while still supporting users that rely on avro types.

@thake
Copy link
Member

thake commented Sep 8, 2023

Just as a little heads up, I've created a new branch https://github.com/avro-kotlin/avro4k/tree/way-to-multiplatform, which implements a PoC that directly encodes to avro binary without using the avro library. So far, everything looks very promising.

I've created a small benchmark that compares the speed of Avro4k (old), Avro4kDirect (new), and Jackson. Here are the results:

Benchmark                       Mode  Cnt       Score       Error  Units
Avro4kBenchmark.clients        thrpt    3   29557,032 ±  3002,497  ops/s
Avro4kBenchmark.users          thrpt    3  290324,243 ± 23988,233  ops/s
Avro4kDirectBenchmark.clients  thrpt    3  583662,180 ± 68148,244  ops/s
Avro4kDirectBenchmark.users    thrpt    3  425400,322 ± 29352,531  ops/s
JacksonBenchmark.clients       thrpt    3  239120,408 ± 22238,596  ops/s
JacksonBenchmark.users         thrpt    3  304450,006 ± 63307,328  ops/s

I will publish the code for the benchmark soon.

Open stuff for the branch:

  • deserialization
  • Write tests against binary data / other serialization library
  • Adapt Avro4k-API to the new way of serialization
  • Preparing API to become multiplatform (i.e. no Java dependencies in API)

@Chuckame
Copy link
Contributor Author

Chuckame commented Sep 8, 2023

So cool! And wait for my refacto to be merged, where it should improve a lot of little things 👼 And mostly the encoder/decoder architecture that was simplified

@Chuckame
Copy link
Contributor Author

Chuckame commented Sep 8, 2023

Can you create the PR as Draft to easily open discussions ? Done !

@thake
Copy link
Member

thake commented Sep 8, 2023

The sources of the benchmark used for the table above can now be found at https://github.com/avro-kotlin/kotlin-avro-benchmark.

@Chuckame
Copy link
Contributor Author

Chuckame commented Sep 8, 2023

Can you add a little of polymorphism? With using the sealed root type as the serializer descriptor for avro4k? E.g Avro.decode(RootType.serializer(), childInstance) that way it have to lookup the good descriptor for the implementation during decoding. I think Jackson do the same, where you can force the root type to be used as the root serializer

@thake
Copy link
Member

thake commented Sep 8, 2023

I don't actually get what you mean, sry. Can you provide a PR?

@Chuckame
Copy link
Contributor Author

@thake Just a global comment on the recent work on avro4k, but we should talk about the changes we want to do inside the codebase and synchronise. Because as I can see, I reworked all the codebase as already said, but you are also moving a lot of stuff (tests, classes). I'm just afraid about rebases and duplicate work. Are you available in next few days to meet (discord, zoom, google meet, ..) to clarify and maybe make some roadmap ?

@Chuckame
Copy link
Contributor Author

#186 introduced encodeToByteArray and decodeFromByteArray so we are now able of implementing direct binary implementation without changing the API

@Chuckame Chuckame linked a pull request May 23, 2024 that will close this issue
@Chuckame
Copy link
Contributor Author

Released in v2.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants