Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(java): standardizing fury java spec #1240

Merged
merged 35 commits into from
Feb 28, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
47e0866
fury java spec
chaokunyang Dec 18, 2023
ba25980
refine fury java spec
chaokunyang Dec 20, 2023
be7c25a
Merge remote-tracking branch 'ant/main' into refine_fury_java_spec
chaokunyang Dec 20, 2023
a8fce2f
add map format spec
chaokunyang Dec 21, 2023
857a4c1
add object meta
chaokunyang Dec 21, 2023
1eebe57
remove .DS_Store
chaokunyang Dec 21, 2023
132ffc9
refine notion for object field meta
chaokunyang Dec 21, 2023
a5587d4
add spec for format specification
chaokunyang Dec 21, 2023
cbec1c2
refine class meta doc
chaokunyang Dec 22, 2023
bb79de3
refine meta shar doc
chaokunyang Dec 22, 2023
a4d6b9a
refine byte order
chaokunyang Dec 23, 2023
7d5b7d3
refine Meta share syntax
chaokunyang Dec 23, 2023
e8b1c6b
fix syntax
chaokunyang Dec 25, 2023
3da3279
Merge remote-tracking branch 'ant/main' into refine_fury_java_spec
chaokunyang Dec 27, 2023
a7c6f37
Merge remote-tracking branch 'ant/main' into refine_fury_java_spec
chaokunyang Jan 15, 2024
82af166
refine spec
chaokunyang Jan 15, 2024
31dc85b
refine object spec
chaokunyang Jan 15, 2024
0b70c9c
add small string size opt
chaokunyang Jan 21, 2024
ba55685
refactor class meta
chaokunyang Feb 26, 2024
2dd9bf8
update others layers class meta
chaokunyang Feb 26, 2024
cc7f8e8
update others layers class meta
chaokunyang Feb 26, 2024
be15667
update flags
chaokunyang Feb 26, 2024
a9d0c27
refine doc
chaokunyang Feb 27, 2024
6b41026
refine map spec
chaokunyang Feb 27, 2024
fd67ca4
Merge remote-tracking branch 'ant/main' into refine_fury_java_spec
chaokunyang Feb 27, 2024
bb98c4f
Update docs/protocols/java_object_graph_spec.md
chaokunyang Feb 27, 2024
5219aec
Update docs/protocols/java_object_graph_spec.md
chaokunyang Feb 27, 2024
894f002
Update docs/protocols/java_object_graph_spec.md
chaokunyang Feb 27, 2024
bef7b2f
update doce
chaokunyang Feb 27, 2024
2564342
Update docs/protocols/java_object_graph_spec.md
chaokunyang Feb 28, 2024
747980f
update doce
chaokunyang Feb 28, 2024
e192432
Update docs/protocols/java_object_graph_spec.md
chaokunyang Feb 28, 2024
717acb9
Update docs/protocols/java_object_graph_spec.md
chaokunyang Feb 28, 2024
80a652a
Update docs/protocols/java_object_graph_spec.md
chaokunyang Feb 28, 2024
2752ab6
Update docs/protocols/java_object_graph_spec.md
chaokunyang Feb 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/protocols/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Serialization Protocols
- For Java Object Graph Protocol, see [java_object_graph_format](java_object_graph.md) doc.
- For Cross Language Object Graph Protocol, see [xlang_object_graph_format](./xlang_object_graph.md) doc.
- For Row Format Protocol, see [row format](./row_format.md) doc.
- For Java Object Graph Protocol, see [java_object_graph_format_spec](java_object_graph_spec.md) doc.
- For Cross Language Object Graph Protocol, see [xlang_object_graph_format_spec](./xlang_object_graph.md) doc.
- For Row Format Protocol, see [row format_spec](./row_format.md) doc.
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,26 @@

## Spec overview

The data are serialized using little endian order overall. If bytes swap is costly, the byte order will be encoded as a
flag in data.
Fury Java Serialization is an automatic object serialization framework that supports reference and polymorphism. Fury
will
convert an object from/to fury java serialization binary format. Fury has two core concepts for java serialization:

- **Fury Java Binary format**
- Framework to convert object to/from Fury Java Binary format

The overall format are:
The serialization format is a dynamic binary format. The dynamics and reference/polymorphism support make Fury flexible,
much more easy to use, but
also introduce more complexities compared to static serialization frameworks. So the format will be more complex.

Here is the overall format:

```
| fury header | object ref meta | object class meta | object value data |
```

The data are serialized using little endian order overall. If bytes swap is costly, the byte order will be encoded as a
chaokunyang marked this conversation as resolved.
Show resolved Hide resolved
flag in data.

## Fury header

Fury header consists starts one byte:
Expand Down Expand Up @@ -44,17 +55,23 @@ When reference tracking is disabled globally or only for some type, or for some
field of a class, only `NULL FLAG` and ` NOT_NULL VALUE FLAG` will be used.

chaokunyang marked this conversation as resolved.
Show resolved Hide resolved
## Class Meta
Fury support register class by an optional id, the registration can be used to security check, and identify the class.
If the class is registered, the class will have a users provided or a auto-growing id unsigned int `class_id`.

Depending on whether meta share mode is enabled, Fury will write class meta differently.
Depending on whether meta share mode or registration is enabled for current class, Fury will write class meta differently.

### Schema consistent

If schema consistent mode is enabled globally or enabled for current class, class meta will be written as follows:

- If class is registered, it will be written as a little-endian unsigned int: `class_id << 1` using fury unsigned int
theweipeng marked this conversation as resolved.
Show resolved Hide resolved
format.
- If class is not registered, fury will write one byte `0b1` first, the little bit is different first bit of encoded
class id, which is `0`. Fury can use this information to determine whether read class by class id.
- If class is not registered, fury will write one byte `0b01/0b11` first, then write class name.
- The higher bit will be 1 if the class is an
array, and written class will be the component class. This can reduce array class name cost if component class is
serialized before.
- The little bit is different first bit of
encoded class id, which is `0`. Fury can use this information to determine whether read class by class id.
- If meta share mode is enabled, class will be written as a unsigned int.
- If meta share mode is not enabled, class will be written as two enumerated string:
- package name.
Expand All @@ -64,24 +81,54 @@ If schema consistent mode is enabled globally or enabled for current class, clas

If schema evolution mode is enabled globally or enabled for current class, class meta will be written as follows:

- If meta share mode is not enabled, class meta will be written as scheme consistent mode, field meta such as field type
- If meta share mode is not enabled, class meta will be written as scheme consistent mode. Additionally, field meta such as field type
and name will be written when the object value is being serialized using a key-value like layout.
- If meta share mode is enabled, class will be written as a unsigned int.
theweipeng marked this conversation as resolved.
Show resolved Hide resolved

## Meta share

> This mode will forbid streaming writing since it needs to look back for update the offset after the whole object graph
> writing and mete collecting is finished.
> TODO: We have plan to streamline meta writing but not started yet.
> We have plan to streamline meta writing but haven't started yet.
chaokunyang marked this conversation as resolved.
Show resolved Hide resolved

### Schema consistent

Class will be encoded as an enumerated string by full class name.

### Schema evolution

Class meta format:

```
| meta header: hash + num classes | current class meta | parent class meta | ... |
```

#### Meta header

Meta header is a 64 bits number value encoded in little endian order.

- Lowest 4 digits `0b0000~0b1111` are used to record num classes. `0b1111` is preserved to indicate that Fury need to
read more bytes for length using Fury unsigned int encoding. If current class doesn't has parent class, or parent
class doesn't have fields to serialize, or we're in a context which serialize fields of current class
only( `ObjectStreamSerializer#SlotInfo` is an example),
- Other 60 bits is used to store murmur hash of `flags + all layers class meta`. num classes will be 0.

#### Single layer class meta

```
| enumerated class name string | unsigned int: num fields | field info: type info + field name | next field info | ... |
```

Type info of custom type field will be written as an one-byte flag instead of inline its meta, because the field value may be null, and Fury can reduce this field type meta writing if object of this type is serialized to in current object graph.

Field order are left as implementation details, which is not exposed to specification, the deserialization need to
resort fields based on Fury field comparator. In this way, fury can compute statistics for field names or types and
using a more compact encoding.

## Enumerated String

Enumerated string are mainly used to encode class name and field names. The format consists of header and binary.
Enumerated string are mainly used to encode meta string such class name and field names. The format consists of header
and binary.

Header are written using little endian order, Fury can read this flag first to determine how to deserialize the data.

Expand All @@ -95,6 +142,16 @@ If string hasn't been written before, the data will be written as follows:
| unsigned int: string binary size + 1bit: not written before | 61bits: murmur hash + 3 bits encoding flags | string binary |
```

Murmur hash can be omitted if caller pass a flag. In such cases, the format will be:

```
| unsigned int: string binary size + 1bit: not written before | 8 bits encoding flags | string binary |
```

5 bits in `8 bits encoding flags` will be left empty.

Encoding flags:

| Encoding Flag | Pattern | Encoding Action |
|---------------|-----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| 0 | every char is in `a-z._$\|` | `LOWER_SPECIAL` |
Expand All @@ -104,7 +161,9 @@ If string hasn't been written before, the data will be written as follows:
| 4 | any utf-8 char | use `UTF-8` encoding |

#### Write by ref

If string has been written before, the data will be written as follows:

```
| unsigned int: written string id + 1bit: written before |
```
Expand Down Expand Up @@ -186,8 +245,12 @@ String binary encoding:

Format:

- one byte for encoding: 0 for `latin`, 1 for `utf-16`, 2 for `utf-8`.
- positive varint for encoded string binary length.
```
| header: size + encoding | binary data |
```

- `size + encoding` will be concat as a long and encoded as a unsigned var long. The little 2 bits is used for encoding:
0 for `latin`, 1 for `utf-16`, 2 for `utf-8`.
- encoded string binary data based on encoding: `latin/utf-16/utf-8`.

Which encoding to choose:
Expand All @@ -200,7 +263,7 @@ Which encoding to choose:

### Collection

> All collection serializer must extends `io.fury.serializer.collection.CollectionSerializer`.
> All collection serializer must extends `AbstractCollectionSerializer`.

Format:

Expand All @@ -220,11 +283,11 @@ In most cases, all collection elements are same type and not null, elements head
information to avoid the cost of writing it for every elements. Specifically, there are four kinds of information
which will be encoded by elements header, each use one bit:

- Whether track elements ref, use first bit `0b1` of header to flag it.
- Whether collection has null, use second bit `0b10` of header to flag it. If ref tracking is enabled for this
- If track elements ref, use first bit `0b1` of header to flag it.
- If collection has null, use second bit `0b10` of header to flag it. If ref tracking is enabled for this
element type, this flag is invalid.
- Whether collection elements type is not declare type, use 3rd bit `0b100` of header to flag it.
- Whether collection elements type different, use 4rd bit `0b1000` of header to flag it.
- If collection elements type is not declare type, use 3rd bit `0b100` of header to flag it.
chaokunyang marked this conversation as resolved.
Show resolved Hide resolved
- If collection elements type different, use 4rd bit `0b1000` of header to flag it.
chaokunyang marked this conversation as resolved.
Show resolved Hide resolved

By default, all bits are unset, which means all elements won't track ref, all elements are same type,, not null and the
actual element is the declare type in custom class field.
Expand All @@ -233,24 +296,144 @@ actual element is the declare type in custom class field.

Based on the elements header, the serialization of elements data may skip `ref flag`/`null flag`/`element class info`.

`io.fury.serializer.collection.CollectionSerializer#write/read` can be taken as an example.
`CollectionSerializer#write/read` can be taken as an example.

### Array

#### Primitive array

Primitive array are taken as a binary buffer, serialization will just write the length of array size as an unsigned int,
then copy the whole buffer into the stream.

Such serialization won't compress the array. If users want to compress primitive array, users need to register custom
serializers for such types.

#### Object array

Object array is serialized using collection format. Object component type will be taken as collection element generic
type.

### Map

> All Map serializer must extends `AbstractMapSerializer`.

Format:

```
| length(unsigned varint) | map header | key value pairs data |
```

#### Map header

- For `HashMap/LinkedHashMap`, this will be empty.
- For `TreeMap`, this will be `Comparator`
- For other `Map`, this may be extra object field info.

#### Map Key-Value data

Map iteration is too expensive, Fury can't compute the header like for collection before since it introduce
[considerable overhead](https://github.com/alipay/fury/issues/925).
Users can use `MapFieldInfo` annotation to provide header in advance. Otherwise Fury will use first key-value pair to
predict header optimistically, and update the chunk header if predict failed at some pair.

Fury will serialize map chunk by chunk, every chunk
has 127 pairs at most.

```
+----------------+----------------+~~~~~~~~~~~~~~~~~+
| chunk size: N | KV header | N*2 objects |
+----------------+----------------+~~~~~~~~~~~~~~~~~+
```

KV header:

- If track key ref, use first bit `0b1` of header to flag it.
- If key has null, use second bit `0b10` of header to flag it. If ref tracking is enabled for this
key type, this flag is invalid.
- If map key type is not declared type, use 3rd bit `0b100` of header to flag it.
- If map key type different, use 4rd bit `0b1000` of header to flag it.
chaokunyang marked this conversation as resolved.
Show resolved Hide resolved
- If track value ref, use 5rd bit `0b10000` of header to flag it.
- If value has null, use 6rd bit `0b100000` of header to flag it. If ref tracking is enabled for this
value type, this flag is invalid.
- If map value type is not declared type, use 7rd bit `0b1000000` of header to flag it.
- If map value type different, use 8rd bit `0b10000000` of header to flag it.

If streaming write is enabled, which means Fury can't update written `chunk size`. In such cases, map key-value data
format will be:

```
+----------------+~~~~~~~~~~~~~~~~~+
| KV header | N*2 objects |
+----------------+~~~~~~~~~~~~~~~~~+
```

`KV header` will be header marked by `MapFieldInfo` in java. For languages such as golang, this can be computed in
advance for non-interface type mostly.

### Enum

Enum are serialized as an
Enum are serialized as an unsigned var int. If the order of enum values change, the deserialized enum value may not be
the value users expect. In such cases, users must register enum serializer by make it write enum value as a enumerated
chaokunyang marked this conversation as resolved.
Show resolved Hide resolved
string with unique hash disabled.

### Object

Object means object of `pojo/struct/bean` type.
Object will be serialized by writing its fields data in fury order.

Depends on schema compatibility, object will have different format.

#### Field order

Field will be ordered as following, every group of fields will have it's own order:

- primitive fields: larger size type first, smaller later, variable size type last.
- boxed primitive fields: same sort as primitive fields
chaokunyang marked this conversation as resolved.
Show resolved Hide resolved
- final fields: same type together, then sort by field name lexicographically.
- collection fields: same sort as final fields
- map fields: same sort as final fields
- other fields: same sort as final fields
chaokunyang marked this conversation as resolved.
Show resolved Hide resolved

#### Schema consistent

Object fields will be serialized one by one using following format:

```
Primitive field value:
+~~~~~~~~~~~~+
| value data |
+~~~~~~~~~~~~+
Boxed field value:
+-----------+~~~~~~~~~~~~~+
| null flag | field value |
+-----------+~~~~~~~~~~~~~+
field value of final type with ref tracking:
+===========+~~~~~~~~~~~~+
| ref meta | value data |
+===========+~~~~~~~~~~~~+
field value of final type without ref tracking:
+-----------+~~~~~~~~~~~~~+
| null flag | field value |
+-----------+~~~~~~~~~~~~~+
field value of non-final type with ref tracking:
+===========+~~~~~~~~~~~~+~~~~~~~~~~~~+
| ref meta | class meta | value data |
+===========+~~~~~~~~~~~~+~~~~~~~~~~~~+
field value of non-final type without ref tracking:
+-----------+~~~~~~~~~~~~+~~~~~~~~~~~~+
| null flag | class meta | value data |
+-----------+~~~~~~~~~~~~+~~~~~~~~~~~~+
```

#### Schema evolution
Schema evolution have similar format as schema consistent mode for object except:
- For this object type itself, `schema consistent` mode will write class by id/name, but `schema evolution` mode will write class field names, types and other meta too, see [Class meta](#class-meta).
- Class meta of `final custom type` need to be written too, because peer may not have this class defined.

### Class

Class will be serialized using class meta format.

## Implementation guidelines

- Try to merge multiple bytes into an int/long write before writing to reduce memory IO and bound check cost.
Expand Down