Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(spec): standardizing fury cross-language serialization specification #1413

Merged
merged 41 commits into from
Mar 30, 2024

Conversation

chaokunyang
Copy link
Collaborator

@chaokunyang chaokunyang commented Mar 18, 2024

What does this PR do?

This PR standardizes fury cross-language serialization specification. It comes with following changes:

  • Remove type tag from the protocol since it introduce space and performance overhead to the implementation. The type tag version can be seen in https://github.com/apache/incubator-fury/blob/6ea2e0b83d5449d63ca62296ff0dfd67b96c5bc5/docs/protocols/xlang_object_graph_spec.md .
  • Fury preserves 0~63 for internal types, but let users register type by id from 0(added by 64 automatically) to setup type mapping between languages.
  • Streamline the type systems, only bool/byte/i16/i32/i64/half-float/float/double/string/enum/list/set/map/Duration/Timestamp/decimal/binary/array/tensor/sparse/tensor/arrow/record/batch/arrow/table are allowed.
  • Formulized the binary format for above types.
  • Add type disambiguation: the deserialization are determined by data type in serialized binary and target type jointly.
  • Introduce meta string encoding algorithm for field name to reduce space cost by 3/8.
  • Introduce schema consist mode format for struct.
  • Introduce schema envolution mode for struct:
    • this mode can embeed meta in the data or share across multiple messages,
    • it can avoid the cost of type tag comparison in frameworks like protobuf

This protocol also supports object inheriance for xlang serializaiton. This is a feature request that users has been discussed for a long time in protobuf/flatbuffer:

Although there are some languages such as rust/golang doesn't support inheriance, there are many cases only langauges like java/c#/python/javascript are involved, and the support for inheriance is not complexed in the protocol level, so we added the inheriance support in the protocol. And in languages such as rust/golang, we can use some annotation to mark composition field as parent class for serialization layout, or we can disable inheriance foor such languages at the protocol level.

The protocol support polymorphic natively by type id, so I don't include types such as OneOf/Union. With this protocol, you can even serialize multiple rust dyn trait object which implement same trait., and get exactly the same objects when deserialization.

Related issue

This PR Closes #1418

@chaokunyang chaokunyang requested review from theweipeng and PragmaTwice and removed request for theweipeng and PragmaTwice March 18, 2024 13:16
@chaokunyang chaokunyang marked this pull request as draft March 18, 2024 13:17
@chaokunyang chaokunyang marked this pull request as ready for review March 22, 2024 11:24
@chaokunyang
Copy link
Collaborator Author

Hi @theweipeng @PragmaTwice @bytemain @LiangliangSui , I finished the fury cross-language serialization specification in this PR. It would be great it you can help review this. This is lots of work. I take a deep look at all kinds serialization framework and made lots of compromises. Any suggestions would be really appreciative.

@chaokunyang
Copy link
Collaborator Author

chaokunyang commented Mar 22, 2024

Rust and C++ need more consideration, since they don't support dynamic codegen. We need to generate all code at compile-time using meta programming. The generated code may not be the best code for execution.

For type evolution, the serializer will encode the type meta into the serialized data. The deserializer can compare this meta with the class in the current process, and use the diff to generate the serializer for deserialization.

  • For java/javascript/python, we can use the diff to generate serializer code at runtime and load it as class/function for deserialization. In this way, the type evolution will be as fast as type consist mode.
  • For C++/Rust, we can't generate the serializer code at runtime. So we need to generate the code at compile-time. But at that time, we don't know the type schema in other processes. So we can't generate the serializer code for such inconsistent types. We may need to generate the code which has a loop and compare field name one by one to decide whether deserialize and assign the field or deserialize and skip the field. One lucky thing is that we can cache the type meta, and generate a 64-bit id for every field name, and generate the id for field names for the current type at build time, so we can convert the field name comparison into a long comparison. And it will be fast too. @PragmaTwice

@chaokunyang
Copy link
Collaborator Author

chaokunyang commented Mar 22, 2024

The spec must allow the languages like rust/c++ which don't have runtime reflection work and runfast, and let languages like javasccript which has a weak type systems to work well too.

@chaokunyang chaokunyang mentioned this pull request Mar 22, 2024
@chaokunyang
Copy link
Collaborator Author

I updated the implementation guide, the implementation for static languages can use switch/jump for type evolution deserialization:

  • Assume current type has n fields, peer type has n1 fields.
  • Generate an auto growing field id from 0 for every sorted field in current type at the compile time.
  • Compare the received type meta with current type, generate same id if the field name is same, otherwise generate an
    auto growing id starting from n, cache this meta at runtime.
  • Iterate the fields of received type meta, use a switch to compare the field id to deserialize data
    and assign/skip field value. Continuous field id will be optimized into jump in switch block, so it will
    very fast.

Here is an example, suppose process A has a class Foo with version 1 defined as Foo1, process B has a class Foo
with version 2 defined as Foo2:

// class Foo with version 1
class Foo1 {
  int32_t v1; // id 0
  std::string v2; // id 1
}
// class Foo with version 2
class Foo2 {
  // id 0, but will have id 2 in process A
  bool v0;
  // id 1, but will have id 0 in process A
  int32_t v1;
  // id 2, but will have id 3 in process A
  int64_t long_value;
  // id 3, but will have id 1 in process A
  std::string v2;
  // id 4, but will have id 4 in process A
  std::vector<std::string> list;
}

When process A received serialized Foo2 from process B, here is how it deserialize the data:

Foo1 &foo1 = xxx;
std::vector<fury::FieldInfo> &field_infos = type_meta.field_infos;
for (auto &field_info : field_infos) {
  switch (field_info.field_id) {
    case 0:
      foo1.v1 = buffer.read_varint32();
      break;
    case 1:
      foo1.v2 = fury.read_string();
      break;
    default:
      fury.skip_data(field_info);
  }
}

fractions of seconds at nanosecond resolution.
- Timestamp: a point in time independent of any calendar/timezone, as a count of seconds and fractions of
seconds at nanosecond resolution. The count is relative to an epoch at UTC midnight on January 1, 1970.
- decimal: exact decimal value represented as an integer value in two's complement.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's kind of decimal? decimal32/decimal64 in IEEE754? or an unlimited-precison decimal?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An unlimited-precison decimal, the precison will be written in the data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow support decimal128/decimal256 only. I'm not sure whether fury should add such limit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allow then it's same as float64/float32, which is very efficient and take fixed storage.

If we have unlimited-precision decimal, should we also have bigint and unlimited-precision float?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decimal will be serialized like:

buffer.writeVarUint32(value.scale());
buffer.writeVarUint32(value.precision());
buffer.writeVarUint32(bytes.length);
buffer.writeBytes(value.unscaledValue().toByteArray());

Could you elaborate how decimal128/decimal256 save space or provide better performance?
One thing I can see is that we can skip write buffer.writeVarUint32(bytes.length), we can take binary took up 128/256 bytes. but other data are still unavoidable.

And if the unscaledValue is small, take new BigDecimal(BigInteger.valueOf(100), 200, MathContext.DECIMAL128) for an example, it took only 6 bytes, but when using decimal128 format, it will take 131 bytes?

docs/protocols/xlang_object_graph_spec.md Outdated Show resolved Hide resolved
docs/protocols/xlang_object_graph_spec.md Outdated Show resolved Hide resolved
docs/protocols/xlang_object_graph_spec.md Outdated Show resolved Hide resolved
Copy link
Member

@theweipeng theweipeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

Copy link
Contributor

@LiangliangSui LiangliangSui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!, I don’t have any comments anymore, thank you for your hard work.

@chaokunyang chaokunyang merged commit c4b4f38 into apache:main Mar 30, 2024
1 check passed
@chaokunyang chaokunyang deleted the fury_xlang_serialization_spec branch March 31, 2024 17:50
chaokunyang added a commit that referenced this pull request Apr 15, 2024
<!--
**Thanks for contributing to Fury.**

**If this is your first time opening a PR on fury, you can refer to
[CONTRIBUTING.md](https://github.com/apache/incubator-fury/blob/main/CONTRIBUTING.md).**

Contribution Checklist

- The **Apache Fury (incubating)** community has restrictions on the
naming of pr titles. You can also find instructions in
[CONTRIBUTING.md](https://github.com/apache/incubator-fury/blob/main/CONTRIBUTING.md).

- Fury has a strong focus on performance. If the PR you submit will have
an impact on performance, please benchmark it first and provide the
benchmark result here.
-->

## What does this PR do?
This PR implements meta string encoding described in [fury java
serialization
spec](https://fury.apache.org/docs/specification/fury_java_serialization_spec#meta-string)
and [xlang serialization
spec](https://fury.apache.org/docs/specification/fury_xlang_serialization_spec#meta-string)

We have `3/8` space saveing for most string:
```java
    // utf8 use 30 bytes, we use only 19 bytes
    assertEquals(encoder.encode("org.apache.fury.benchmark.data").getBytes().length, 19);
    // utf8 use 12 bytes, we use only 9 bytes.
    assertEquals(encoder.encode("MediaContent").getBytes().length, 9);
```


The integration with ClassResolver is left in another PR.

## Related issues
#1240 
#1413 


## Does this PR introduce any user-facing change?

<!--
If any user-facing interface changes, please [open an
issue](https://github.com/apache/incubator-fury/issues/new/choose)
describing the need to do so and update the document if necessary.
-->

- [ ] Does this PR introduce any public API change?
- [ ] Does this PR introduce any binary protocol compatibility change?


## Benchmark

<!--
When the PR has an impact on performance (if you don't know whether the
PR will have an impact on performance, you can submit the PR first, and
if it will have impact on performance, the code reviewer will explain
it), be sure to attach a benchmark data here.
-->
chaokunyang added a commit that referenced this pull request Apr 27, 2024
## What does this PR do?

This PR implements meta string encoding described in [xlang
serialization
spec](https://fury.apache.org/docs/specification/fury_xlang_serialization_spec#meta-string)


## Related issues
- #1240 
- #1413
- #1514



## Does this PR introduce any user-facing change?

- [No] Does this PR introduce any public API change?
- [No] Does this PR introduce any binary protocol compatibility change?

---------

Co-authored-by: Shawn Yang <chaokunyang@apache.org>
theweipeng pushed a commit that referenced this pull request Apr 28, 2024
## What does this PR do?

This PR removes list/map header from type meta spec. 

Such header may be computed ahead sometimes. But it may need to compute
based the data at runtime. If we write header into type meta, we may
still need to compute and write header when serialization. This will
introduce extra space cost.

Instead, we can write all such header at runtime, and when
creating/generating serializer, we can compute part of such header
ahead, and let remaing parts of header computed at serialization.

When deserialization, the deserializer can generate different
deserializer based on read header, which will be more efficient.

## Related issues
#1413


## Does this PR introduce any user-facing change?

<!--
If any user-facing interface changes, please [open an
issue](https://github.com/apache/incubator-fury/issues/new/choose)
describing the need to do so and update the document if necessary.
-->

- [ ] Does this PR introduce any public API change?
- [ ] Does this PR introduce any binary protocol compatibility change?


## Benchmark

<!--
When the PR has an impact on performance (if you don't know whether the
PR will have an impact on performance, you can submit the PR first, and
if it will have impact on performance, the code reviewer will explain
it), be sure to attach a benchmark data here.
-->
chaokunyang added a commit that referenced this pull request May 2, 2024
## What does this PR do?

This PR implements type meta encoding for java proposed in #1240 .

The type meta encoding in xlang spec proposed in #1413 will be finished
in another PR based on this PR.

The spec has been updated too:

type meta header
```
|      8 bytes meta header      | meta size |   variable bytes   |  variable bytes   | variable bytes |
+-------------------------------+-----------|--------------------+-------------------+----------------+
| 7 bytes hash + 1 bytes header | 1~2 bytes | current class meta | parent class meta |      ...       |
```

And the encoding for packge/class/field name has been updated to:
```
- Package name encoding(omitted when class is registered):
    - encoding algorithm: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL`
    - Header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63`  will be used to indicate size `0~62`,
      the value `63` the size need more byte to read, the encoding will encode `size - 62` as a varint next.
- Class name encoding(omitted when class is registered):
    - encoding algorithm: `UTF8/LOWER_UPPER_DIGIT_SPECIAL/FIRST_TO_LOWER_SPECIAL/ALL_TO_LOWER_SPECIAL`
    - header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63`  will be used to indicate size `1~64`,
      the value `63` the size need more byte to read, the encoding will encode `size - 63` as a varint next.
- Field info:
    - header(8
      bits): `3 bits size + 2 bits field name encoding + polymorphism flag + nullability flag + ref tracking flag`.
      Users can use annotation to provide those info.
        - 2 bits field name encoding:
            - encoding: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID`
            - If tag id is used, i.e. field name is written by an unsigned varint tag id. 2 bits encoding will be `11`.
        - size of field name:
            - The `3 bits size: 0~7`  will be used to indicate length `1~7`, the value `6` the size read more bytes,
              the encoding will encode `size - 7` as a varint next.
            - If encoding is `TAG_ID`, then num_bytes of field name will be used to store tag id.
    - Field name: If type id is set, type id will be used instead. Otherwise meta string encoding length and data will
      be written instead.
```

## Meta size
Before this PR:
```java
class org.apache.fury.benchmark.data.MediaContent 78
class org.apache.fury.benchmark.data.Media 208
class org.apache.fury.benchmark.data.Image 114
```

With this PR:
```java
class org.apache.fury.benchmark.data.MediaContent 53
class org.apache.fury.benchmark.data.Media 114
class org.apache.fury.benchmark.data.Image 68
```

The size of class meta reduced by half, which is a great gain.

The size can be reduded more if we introduce field name hash, but it's
not related to this PR. We can discuss it in another PR.

## Related issues

#1240 
#203 
#202 


## Does this PR introduce any user-facing change?

<!--
If any user-facing interface changes, please [open an
issue](https://github.com/apache/incubator-fury/issues/new/choose)
describing the need to do so and update the document if necessary.
-->

- [ ] Does this PR introduce any public API change?
- [ ] Does this PR introduce any binary protocol compatibility change?


## Benchmark

<!--
When the PR has an impact on performance (if you don't know whether the
PR will have an impact on performance, you can submit the PR first, and
if it will have impact on performance, the code reviewer will explain
it), be sure to attach a benchmark data here.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Spec][Doc] fury cross-language serialization specification proposal
5 participants