feat(spec): standardizing fury cross-language serialization specification #1413

chaokunyang · 2024-03-18T13:16:45Z

What does this PR do?

This PR standardizes fury cross-language serialization specification. It comes with following changes:

Remove type tag from the protocol since it introduce space and performance overhead to the implementation. The type tag version can be seen in https://github.com/apache/incubator-fury/blob/6ea2e0b83d5449d63ca62296ff0dfd67b96c5bc5/docs/protocols/xlang_object_graph_spec.md .
Fury preserves 0~63 for internal types, but let users register type by id from 0(added by 64 automatically） to setup type mapping between languages.
Streamline the type systems, only bool/byte/i16/i32/i64/half-float/float/double/string/enum/list/set/map/Duration/Timestamp/decimal/binary/array/tensor/sparse/tensor/arrow/record/batch/arrow/table are allowed.
Formulized the binary format for above types.
Add type disambiguation: the deserialization are determined by data type in serialized binary and target type jointly.
Introduce meta string encoding algorithm for field name to reduce space cost by 3/8.
Introduce schema consist mode format for struct.
Introduce schema envolution mode for struct:
- this mode can embeed meta in the data or share across multiple messages,
- it can avoid the cost of type tag comparison in frameworks like protobuf

This protocol also supports object inheriance for xlang serializaiton. This is a feature request that users has been discussed for a long time in protobuf/flatbuffer:

Although there are some languages such as rust/golang doesn't support inheriance, there are many cases only langauges like java/c#/python/javascript are involved, and the support for inheriance is not complexed in the protocol level, so we added the inheriance support in the protocol. And in languages such as rust/golang, we can use some annotation to mark composition field as parent class for serialization layout, or we can disable inheriance foor such languages at the protocol level.

The protocol support polymorphic natively by type id, so I don't include types such as OneOf/Union. With this protocol, you can even serialize multiple rust dyn trait object which implement same trait., and get exactly the same objects when deserialization.

Related issue

This PR Closes #1418

chaokunyang · 2024-03-22T12:36:02Z

Hi @theweipeng @PragmaTwice @bytemain @LiangliangSui , I finished the fury cross-language serialization specification in this PR. It would be great it you can help review this. This is lots of work. I take a deep look at all kinds serialization framework and made lots of compromises. Any suggestions would be really appreciative.

chaokunyang · 2024-03-22T13:08:40Z

Rust and C++ need more consideration, since they don't support dynamic codegen. We need to generate all code at compile-time using meta programming. The generated code may not be the best code for execution.

For type evolution, the serializer will encode the type meta into the serialized data. The deserializer can compare this meta with the class in the current process, and use the diff to generate the serializer for deserialization.

For java/javascript/python, we can use the diff to generate serializer code at runtime and load it as class/function for deserialization. In this way, the type evolution will be as fast as type consist mode.
For C++/Rust, we can't generate the serializer code at runtime. So we need to generate the code at compile-time. But at that time, we don't know the type schema in other processes. So we can't generate the serializer code for such inconsistent types. We may need to generate the code which has a loop and compare field name one by one to decide whether deserialize and assign the field or deserialize and skip the field. One lucky thing is that we can cache the type meta, and generate a 64-bit id for every field name, and generate the id for field names for the current type at build time, so we can convert the field name comparison into a long comparison. And it will be fast too. @PragmaTwice

chaokunyang · 2024-03-22T13:12:06Z

The spec must allow the languages like rust/c++ which don't have runtime reflection work and runfast, and let languages like javasccript which has a weak type systems to work well too.

chaokunyang · 2024-03-23T04:25:55Z

I updated the implementation guide, the implementation for static languages can use switch/jump for type evolution deserialization:

Assume current type has n fields, peer type has n1 fields.
Generate an auto growing field id from 0 for every sorted field in current type at the compile time.
Compare the received type meta with current type, generate same id if the field name is same, otherwise generate an
auto growing id starting from n, cache this meta at runtime.
Iterate the fields of received type meta, use a switch to compare the field id to deserialize data
and assign/skip field value. Continuous field id will be optimized into jump in switch block, so it will
very fast.

Here is an example, suppose process A has a class Foo with version 1 defined as Foo1, process B has a class Foo
with version 2 defined as Foo2:

// class Foo with version 1
class Foo1 {
  int32_t v1; // id 0
  std::string v2; // id 1
}
// class Foo with version 2
class Foo2 {
  // id 0, but will have id 2 in process A
  bool v0;
  // id 1, but will have id 0 in process A
  int32_t v1;
  // id 2, but will have id 3 in process A
  int64_t long_value;
  // id 3, but will have id 1 in process A
  std::string v2;
  // id 4, but will have id 4 in process A
  std::vector<std::string> list;
}

When process A received serialized Foo2 from process B, here is how it deserialize the data:

Foo1 &foo1 = xxx;
std::vector<fury::FieldInfo> &field_infos = type_meta.field_infos;
for (auto &field_info : field_infos) {
  switch (field_info.field_id) {
    case 0:
      foo1.v1 = buffer.read_varint32();
      break;
    case 1:
      foo1.v2 = fury.read_string();
      break;
    default:
      fury.skip_data(field_info);
  }
}

docs/protocols/xlang_object_graph_spec.md

PragmaTwice · 2024-03-24T12:43:16Z

docs/protocols/xlang_object_graph_spec.md

+ fractions of seconds at nanosecond resolution.
+ - Timestamp: a point in time independent of any calendar/timezone, as a count of seconds and fractions of
+ seconds at nanosecond resolution. The count is relative to an epoch at UTC midnight on January 1, 1970.
+- decimal: exact decimal value represented as an integer value in two's complement.


What's kind of decimal? decimal32/decimal64 in IEEE754? or an unlimited-precison decimal?

An unlimited-precison decimal, the precison will be written in the data.

Arrow support decimal128/decimal256 only. I'm not sure whether fury should add such limit.

If we allow then it's same as float64/float32, which is very efficient and take fixed storage.

If we have unlimited-precision decimal, should we also have bigint and unlimited-precision float?

The decimal will be serialized like:

buffer.writeVarUint32(value.scale()); buffer.writeVarUint32(value.precision()); buffer.writeVarUint32(bytes.length); buffer.writeBytes(value.unscaledValue().toByteArray());

Could you elaborate how decimal128/decimal256 save space or provide better performance?
One thing I can see is that we can skip write buffer.writeVarUint32(bytes.length), we can take binary took up 128/256 bytes. but other data are still unavoidable.

And if the unscaledValue is small, take new BigDecimal(BigInteger.valueOf(100), 200, MathContext.DECIMAL128) for an example, it took only 6 bytes, but when using decimal128 format, it will take 131 bytes?

docs/protocols/xlang_object_graph_spec.md

theweipeng

Great!

LiangliangSui

Great!, I don’t have any comments anymore, thank you for your hard work.

## What does this PR do? This PR implements meta string encoding described in [fury java serialization spec](https://fury.apache.org/docs/specification/fury_java_serialization_spec#meta-string) and [xlang serialization spec](https://fury.apache.org/docs/specification/fury_xlang_serialization_spec#meta-string) We have `3/8` space saveing for most string: ```java // utf8 use 30 bytes, we use only 19 bytes assertEquals(encoder.encode("org.apache.fury.benchmark.data").getBytes().length, 19); // utf8 use 12 bytes, we use only 9 bytes. assertEquals(encoder.encode("MediaContent").getBytes().length, 9); ``` The integration with ClassResolver is left in another PR. ## Related issues #1240 #1413 ## Does this PR introduce any user-facing change?  - [ ] Does this PR introduce any public API change? - [ ] Does this PR introduce any binary protocol compatibility change? ## Benchmark

## What does this PR do? This PR implements meta string encoding described in [xlang serialization spec](https://fury.apache.org/docs/specification/fury_xlang_serialization_spec#meta-string) ## Related issues - #1240 - #1413 - #1514 ## Does this PR introduce any user-facing change? - [No] Does this PR introduce any public API change? - [No] Does this PR introduce any binary protocol compatibility change? --------- Co-authored-by: Shawn Yang <chaokunyang@apache.org>

## What does this PR do? This PR removes list/map header from type meta spec. Such header may be computed ahead sometimes. But it may need to compute based the data at runtime. If we write header into type meta, we may still need to compute and write header when serialization. This will introduce extra space cost. Instead, we can write all such header at runtime, and when creating/generating serializer, we can compute part of such header ahead, and let remaing parts of header computed at serialization. When deserialization, the deserializer can generate different deserializer based on read header, which will be more efficient. ## Related issues #1413 ## Does this PR introduce any user-facing change?  - [ ] Does this PR introduce any public API change? - [ ] Does this PR introduce any binary protocol compatibility change? ## Benchmark

## What does this PR do? This PR implements type meta encoding for java proposed in #1240 . The type meta encoding in xlang spec proposed in #1413 will be finished in another PR based on this PR. The spec has been updated too: type meta header ``` | 8 bytes meta header | meta size | variable bytes | variable bytes | variable bytes | +-------------------------------+-----------|--------------------+-------------------+----------------+ | 7 bytes hash + 1 bytes header | 1~2 bytes | current class meta | parent class meta | ... | ``` And the encoding for packge/class/field name has been updated to: ``` - Package name encoding(omitted when class is registered): - encoding algorithm: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL` - Header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63` will be used to indicate size `0~62`, the value `63` the size need more byte to read, the encoding will encode `size - 62` as a varint next. - Class name encoding(omitted when class is registered): - encoding algorithm: `UTF8/LOWER_UPPER_DIGIT_SPECIAL/FIRST_TO_LOWER_SPECIAL/ALL_TO_LOWER_SPECIAL` - header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63` will be used to indicate size `1~64`, the value `63` the size need more byte to read, the encoding will encode `size - 63` as a varint next. - Field info: - header(8 bits): `3 bits size + 2 bits field name encoding + polymorphism flag + nullability flag + ref tracking flag`. Users can use annotation to provide those info. - 2 bits field name encoding: - encoding: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID` - If tag id is used, i.e. field name is written by an unsigned varint tag id. 2 bits encoding will be `11`. - size of field name: - The `3 bits size: 0~7` will be used to indicate length `1~7`, the value `6` the size read more bytes, the encoding will encode `size - 7` as a varint next. - If encoding is `TAG_ID`, then num_bytes of field name will be used to store tag id. - Field name: If type id is set, type id will be used instead. Otherwise meta string encoding length and data will be written instead. ``` ## Meta size Before this PR: ```java class org.apache.fury.benchmark.data.MediaContent 78 class org.apache.fury.benchmark.data.Media 208 class org.apache.fury.benchmark.data.Image 114 ``` With this PR: ```java class org.apache.fury.benchmark.data.MediaContent 53 class org.apache.fury.benchmark.data.Media 114 class org.apache.fury.benchmark.data.Image 68 ``` The size of class meta reduced by half, which is a great gain. The size can be reduded more if we introduce field name hash, but it's not related to this PR. We can discuss it in another PR. ## Related issues #1240 #203 #202 ## Does this PR introduce any user-facing change?  - [ ] Does this PR introduce any public API change? - [ ] Does this PR introduce any binary protocol compatibility change? ## Benchmark

chaokunyang requested review from theweipeng and PragmaTwice as code owners March 18, 2024 13:16

chaokunyang requested review from theweipeng and PragmaTwice and removed request for theweipeng and PragmaTwice March 18, 2024 13:16

chaokunyang marked this pull request as draft March 18, 2024 13:17

chaokunyang marked this pull request as ready for review March 22, 2024 11:24

chaokunyang requested a review from pjfanning March 22, 2024 13:12

chaokunyang mentioned this pull request Mar 22, 2024

RoadMap #1017

Open

chaokunyang mentioned this pull request Mar 23, 2024

refactor(java): Streamline ClassResolver responsibilities #1417

Open

PragmaTwice reviewed Mar 24, 2024

View reviewed changes