-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(spec): standardizing fury cross-language serialization specification #1413
feat(spec): standardizing fury cross-language serialization specification #1413
Conversation
Hi @theweipeng @PragmaTwice @bytemain @LiangliangSui , I finished the fury cross-language serialization specification in this PR. It would be great it you can help review this. This is lots of work. I take a deep look at all kinds serialization framework and made lots of compromises. Any suggestions would be really appreciative. |
Rust and C++ need more consideration, since they don't support dynamic codegen. We need to generate all code at compile-time using meta programming. The generated code may not be the best code for execution. For type evolution, the serializer will encode the type meta into the serialized data. The deserializer can compare this meta with the class in the current process, and use the diff to generate the serializer for deserialization.
|
The spec must allow the languages like rust/c++ which don't have runtime reflection work and runfast, and let languages like javasccript which has a weak type systems to work well too. |
I updated the implementation guide, the implementation for static languages can use
Here is an example, suppose process A has a class // class Foo with version 1
class Foo1 {
int32_t v1; // id 0
std::string v2; // id 1
}
// class Foo with version 2
class Foo2 {
// id 0, but will have id 2 in process A
bool v0;
// id 1, but will have id 0 in process A
int32_t v1;
// id 2, but will have id 3 in process A
int64_t long_value;
// id 3, but will have id 1 in process A
std::string v2;
// id 4, but will have id 4 in process A
std::vector<std::string> list;
} When process A received serialized Foo1 &foo1 = xxx;
std::vector<fury::FieldInfo> &field_infos = type_meta.field_infos;
for (auto &field_info : field_infos) {
switch (field_info.field_id) {
case 0:
foo1.v1 = buffer.read_varint32();
break;
case 1:
foo1.v2 = fury.read_string();
break;
default:
fury.skip_data(field_info);
}
} |
fractions of seconds at nanosecond resolution. | ||
- Timestamp: a point in time independent of any calendar/timezone, as a count of seconds and fractions of | ||
seconds at nanosecond resolution. The count is relative to an epoch at UTC midnight on January 1, 1970. | ||
- decimal: exact decimal value represented as an integer value in two's complement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's kind of decimal? decimal32/decimal64 in IEEE754? or an unlimited-precison decimal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An unlimited-precison decimal, the precison will be written in the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow support decimal128/decimal256 only. I'm not sure whether fury should add such limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow then it's same as float64/float32, which is very efficient and take fixed storage.
If we have unlimited-precision decimal, should we also have bigint and unlimited-precision float?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decimal will be serialized like:
buffer.writeVarUint32(value.scale());
buffer.writeVarUint32(value.precision());
buffer.writeVarUint32(bytes.length);
buffer.writeBytes(value.unscaledValue().toByteArray());
Could you elaborate how decimal128/decimal256 save space or provide better performance?
One thing I can see is that we can skip write buffer.writeVarUint32(bytes.length)
, we can take binary took up 128/256 bytes. but other data are still unavoidable.
And if the unscaledValue is small, take new BigDecimal(BigInteger.valueOf(100), 200, MathContext.DECIMAL128)
for an example, it took only 6 bytes, but when using decimal128 format, it will take 131 bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!, I don’t have any comments anymore, thank you for your hard work.
<!-- **Thanks for contributing to Fury.** **If this is your first time opening a PR on fury, you can refer to [CONTRIBUTING.md](https://github.com/apache/incubator-fury/blob/main/CONTRIBUTING.md).** Contribution Checklist - The **Apache Fury (incubating)** community has restrictions on the naming of pr titles. You can also find instructions in [CONTRIBUTING.md](https://github.com/apache/incubator-fury/blob/main/CONTRIBUTING.md). - Fury has a strong focus on performance. If the PR you submit will have an impact on performance, please benchmark it first and provide the benchmark result here. --> ## What does this PR do? This PR implements meta string encoding described in [fury java serialization spec](https://fury.apache.org/docs/specification/fury_java_serialization_spec#meta-string) and [xlang serialization spec](https://fury.apache.org/docs/specification/fury_xlang_serialization_spec#meta-string) We have `3/8` space saveing for most string: ```java // utf8 use 30 bytes, we use only 19 bytes assertEquals(encoder.encode("org.apache.fury.benchmark.data").getBytes().length, 19); // utf8 use 12 bytes, we use only 9 bytes. assertEquals(encoder.encode("MediaContent").getBytes().length, 9); ``` The integration with ClassResolver is left in another PR. ## Related issues #1240 #1413 ## Does this PR introduce any user-facing change? <!-- If any user-facing interface changes, please [open an issue](https://github.com/apache/incubator-fury/issues/new/choose) describing the need to do so and update the document if necessary. --> - [ ] Does this PR introduce any public API change? - [ ] Does this PR introduce any binary protocol compatibility change? ## Benchmark <!-- When the PR has an impact on performance (if you don't know whether the PR will have an impact on performance, you can submit the PR first, and if it will have impact on performance, the code reviewer will explain it), be sure to attach a benchmark data here. -->
## What does this PR do? This PR implements meta string encoding described in [xlang serialization spec](https://fury.apache.org/docs/specification/fury_xlang_serialization_spec#meta-string) ## Related issues - #1240 - #1413 - #1514 ## Does this PR introduce any user-facing change? - [No] Does this PR introduce any public API change? - [No] Does this PR introduce any binary protocol compatibility change? --------- Co-authored-by: Shawn Yang <chaokunyang@apache.org>
## What does this PR do? This PR removes list/map header from type meta spec. Such header may be computed ahead sometimes. But it may need to compute based the data at runtime. If we write header into type meta, we may still need to compute and write header when serialization. This will introduce extra space cost. Instead, we can write all such header at runtime, and when creating/generating serializer, we can compute part of such header ahead, and let remaing parts of header computed at serialization. When deserialization, the deserializer can generate different deserializer based on read header, which will be more efficient. ## Related issues #1413 ## Does this PR introduce any user-facing change? <!-- If any user-facing interface changes, please [open an issue](https://github.com/apache/incubator-fury/issues/new/choose) describing the need to do so and update the document if necessary. --> - [ ] Does this PR introduce any public API change? - [ ] Does this PR introduce any binary protocol compatibility change? ## Benchmark <!-- When the PR has an impact on performance (if you don't know whether the PR will have an impact on performance, you can submit the PR first, and if it will have impact on performance, the code reviewer will explain it), be sure to attach a benchmark data here. -->
## What does this PR do? This PR implements type meta encoding for java proposed in #1240 . The type meta encoding in xlang spec proposed in #1413 will be finished in another PR based on this PR. The spec has been updated too: type meta header ``` | 8 bytes meta header | meta size | variable bytes | variable bytes | variable bytes | +-------------------------------+-----------|--------------------+-------------------+----------------+ | 7 bytes hash + 1 bytes header | 1~2 bytes | current class meta | parent class meta | ... | ``` And the encoding for packge/class/field name has been updated to: ``` - Package name encoding(omitted when class is registered): - encoding algorithm: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL` - Header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63` will be used to indicate size `0~62`, the value `63` the size need more byte to read, the encoding will encode `size - 62` as a varint next. - Class name encoding(omitted when class is registered): - encoding algorithm: `UTF8/LOWER_UPPER_DIGIT_SPECIAL/FIRST_TO_LOWER_SPECIAL/ALL_TO_LOWER_SPECIAL` - header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63` will be used to indicate size `1~64`, the value `63` the size need more byte to read, the encoding will encode `size - 63` as a varint next. - Field info: - header(8 bits): `3 bits size + 2 bits field name encoding + polymorphism flag + nullability flag + ref tracking flag`. Users can use annotation to provide those info. - 2 bits field name encoding: - encoding: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID` - If tag id is used, i.e. field name is written by an unsigned varint tag id. 2 bits encoding will be `11`. - size of field name: - The `3 bits size: 0~7` will be used to indicate length `1~7`, the value `6` the size read more bytes, the encoding will encode `size - 7` as a varint next. - If encoding is `TAG_ID`, then num_bytes of field name will be used to store tag id. - Field name: If type id is set, type id will be used instead. Otherwise meta string encoding length and data will be written instead. ``` ## Meta size Before this PR: ```java class org.apache.fury.benchmark.data.MediaContent 78 class org.apache.fury.benchmark.data.Media 208 class org.apache.fury.benchmark.data.Image 114 ``` With this PR: ```java class org.apache.fury.benchmark.data.MediaContent 53 class org.apache.fury.benchmark.data.Media 114 class org.apache.fury.benchmark.data.Image 68 ``` The size of class meta reduced by half, which is a great gain. The size can be reduded more if we introduce field name hash, but it's not related to this PR. We can discuss it in another PR. ## Related issues #1240 #203 #202 ## Does this PR introduce any user-facing change? <!-- If any user-facing interface changes, please [open an issue](https://github.com/apache/incubator-fury/issues/new/choose) describing the need to do so and update the document if necessary. --> - [ ] Does this PR introduce any public API change? - [ ] Does this PR introduce any binary protocol compatibility change? ## Benchmark <!-- When the PR has an impact on performance (if you don't know whether the PR will have an impact on performance, you can submit the PR first, and if it will have impact on performance, the code reviewer will explain it), be sure to attach a benchmark data here. -->
What does this PR do?
This PR standardizes fury cross-language serialization specification. It comes with following changes:
type tag
version can be seen in https://github.com/apache/incubator-fury/blob/6ea2e0b83d5449d63ca62296ff0dfd67b96c5bc5/docs/protocols/xlang_object_graph_spec.md .0~63
for internal types, but let users register type by id from0
(added by 64 automatically) to setup type mapping between languages.bool/byte/i16/i32/i64/half-float/float/double/string/enum/list/set/map/Duration/Timestamp/decimal/binary/array/tensor/sparse/tensor/arrow/record/batch/arrow/table
are allowed.This protocol also supports object inheriance for xlang serializaiton. This is a feature request that users has been discussed for a long time in protobuf/flatbuffer:
Although there are some languages such as
rust/golang
doesn't support inheriance, there are many cases only langauges likejava/c#/python/javascript
are involved, and the support for inheriance is not complexed in the protocol level, so we added the inheriance support in the protocol. And in languages such asrust/golang
, we can use some annotation to mark composition field as parent class for serialization layout, or we can disable inheriance foor such languages at the protocol level.The protocol support polymorphic natively by type id, so I don't include types such as
OneOf/Union
. With this protocol, you can even serialize multiple rustdyn trait
object which implement same trait., and get exactly the same objects when deserialization.Related issue
This PR Closes #1418