diff --git a/docs/protocols/README.md b/docs/protocols/README.md index 9fd99accca..041a12f653 100644 --- a/docs/protocols/README.md +++ b/docs/protocols/README.md @@ -1,4 +1,4 @@ # Serialization Protocols -- For Java Object Graph Protocol, see [java_object_graph_format](java_object_graph.md) doc. -- For Cross Language Object Graph Protocol, see [xlang_object_graph_format](./xlang_object_graph.md) doc. -- For Row Format Protocol, see [row format](./row_format.md) doc. +- For Java Object Graph Protocol, see [java_object_graph_format_spec](java_object_graph_spec.md) doc. +- For Cross Language Object Graph Protocol, see [xlang_object_graph_format_spec](./xlang_object_graph.md) doc. +- For Row Format Protocol, see [row format_spec](./row_format.md) doc. diff --git a/docs/protocols/java_object_graph.md b/docs/protocols/java_object_graph.md deleted file mode 100644 index a2627444b0..0000000000 --- a/docs/protocols/java_object_graph.md +++ /dev/null @@ -1,259 +0,0 @@ -# Fury Java Serialization Specification - -## Spec overview - -The data are serialized using little endian order overall. If bytes swap is costly, the byte order will be encoded as a -flag in data. - -The overall format are: - -``` -| fury header | object ref meta | object class meta | object value data | -``` - -## Fury header - -Fury header consists starts one byte: - -``` -| resvered 4 bits | oob | xlang | endian | null | -``` - -- null flag: set when object is null, unset otherwise. If object is null, other bits won't be set. -- endian flag: set when system use little endian, unset otherwise. -- xlang flag: set when serialization uses xlang format, unset when serialization use Fury java format. -- oob flag: set when passed `BufferCallback` is not null, unset otherwise. - -If meta share mode is enabled, uncompressed little-endian 4 bytes is appended to indicate the start offset of meta data. - -## Reference Meta - -Reference tracking handles whether the object is null, and whether to track reference for the object by writing -corresponding flags and maintain internal state. - -Reference flags: - -| Flag | Byte Value | Description | -|---------------------|------------|-------------------------------------------------------------------------------------------------------------------------------| -| NULL FLAG | `-3` | This flag indicates that object is a not-null value. We don't use another byte to indicate REF, so that we can save one byte. | -| REF FLAG | `-2` | this flag indicates the object is written before, and fury will write a unsigned ref id instead of serialize it again | -| NOT_NULL VALUE FLAG | `-1` | this flag indicates that the object is a non-null value and fury doesn't track ref for this type of object. | -| REF VALUE FLAG | `0` | this flag indicates that the object is a referencable and first read. | - -When reference tracking is disabled globally or only for some type, or for some type under some context such as some -field of a class, only `NULL FLAG` and ` NOT_NULL VALUE FLAG` will be used. - -## Class Meta - -Depending on whether meta share mode is enabled, Fury will write class meta differently. - -### Schema consistent - -If schema consistent mode is enabled globally or enabled for current class, class meta will be written as follows: - -- If class is registered, it will be written as a little-endian unsigned int: `class_id << 1` using fury unsigned int - format. -- If class is not registered, fury will write one byte `0b1` first, the little bit is different first bit of encoded - class id, which is `0`. Fury can use this information to determine whether read class by class id. - - If meta share mode is enabled, class will be written as a unsigned int. - - If meta share mode is not enabled, class will be written as two enumerated string: - - package name. - - class name. - -### Schema evolution - -If schema evolution mode is enabled globally or enabled for current class, class meta will be written as follows: - -- If meta share mode is not enabled, class meta will be written as scheme consistent mode, field meta such as field type - and name will be written when the object value is being serialized using a key-value like layout. -- If meta share mode is enabled, class will be written as a unsigned int. - -## Meta share -> This mode will forbid streaming writing since it needs to look back for update the offset after the whole object graph -> writing and mete collecting is finished. -> TODO: We have plan to streamline meta writing but not started yet. - -### Schema consistent - - -### Schema evolution - - -## Enumerated String - -Enumerated string are mainly used to encode class name and field names. The format consists of header and binary. - -Header are written using little endian order, Fury can read this flag first to determine how to deserialize the data. - -### Header - -#### Write by data - -If string hasn't been written before, the data will be written as follows: - -``` -| unsigned int: string binary size + 1bit: not written before | 61bits: murmur hash + 3 bits encoding flags | string binary | -``` - -| Encoding Flag | Pattern | Encoding Action | -|---------------|-----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------| -| 0 | every char is in `a-z._$\|` | `LOWER_SPECIAL` | -| 1 | every char is in `a-z._$` except first char is upper case | replace first upper case char to lower case, then use `LOWER_SPECIAL` | -| 2 | every char is in `a-zA-Z._$` | replace every upper case char by `\|` + `lower case`, then use `LOWER_SPECIAL`, use this encoding if it's smaller than Encoding `3` | -| 3 | every char is in `a-zA-Z._$` | use `LOWER_UPPER_DIGIT_SPECIAL` encoding if it's smaller than Encoding `2` | -| 4 | any utf-8 char | use `UTF-8` encoding | - -#### Write by ref -If string has been written before, the data will be written as follows: -``` -| unsigned int: written string id + 1bit: written before | -``` - -### String binary - -String binary encoding: - -| Algorithm | Pattern | Description | -|---------------------------|----------------|-------------------------------------------------------------------------------------------------------------------------------------------------| -| LOWER_SPECIAL | `a-z._$\|` | every char is writen using 5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101` | -| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._$` | every char is writen using 6 bits, `a-z`: `0b00000~0b11110`, `A-Z`: `0b11010~0b110011`, `0~9`: `0b110100~0b111101`, `._$`: `0b111110~0b1000000` | -| UTF-8 | any chars | UTF-8 encoding | - -## Value Format - -### Basic types - -#### Bool - -- size: 1 byte -- format: 0 for `false`, 1 for `true` - -#### Byte - -- size: 1 byte -- format: write as pure byte. - -#### Short - -- size: 2 byte -- byte order: little endian order - -#### Char - -- size: 2 byte -- byte order: little endian order - -#### Unsigned int - -- size: 1~5 byte -- Format: first bit in every byte indicate whether to has next byte. if first bit is set i.e. `b & 0x80 == 0x80`, then - next byte should be read util first bit of next byte is unset. - -#### Signed int - -- size: 1~5 byte -- Format: First convert the number into positive unsigned int by `(v << 1) ^ (v >> 31)` ZigZag algorithm, then encoding - it as an unsigned int. - -#### Unsigned long - -- size: 1~9 byte -- Fury PVL(Progressive Variable-length Long) Encoding: - - positive long format: first bit in every byte indicate whether to has next byte. if first bit is set - i.e. `b & 0x80 == 0x80`, then next byte should be read util first bit is unset. - -#### Signed long - -- size: 1~9 byte -- Fury SLI(Small long as int) Encoding: - - If long is in [-1073741824, 1073741823], encode as 4 bytes int: `| little-endian: ((int) value) << 1 |` - - Otherwise write as 9 bytes: `| 0b1 | little-endian 8bytes long |` -- Fury PVL(Progressive Variable-length Long) Encoding: - - First convert the number into positive unsigned long by ` (v << 1) ^ (v >> 63)` ZigZag algorithm to reduce cost of - small negative numbers, then encoding it as an unsigned long. - -#### Float - -- size: 4 byte -- format: convert float to 4 bytes int by `Float.floatToRawIntBits`, then write as binary by little endian order. - -#### Double - -- size: 8 byte -- format: convert double to 8 bytes int by `Double.doubleToRawLongBits`, then write as binary by little endian order. - -### String - -Format: - -- one byte for encoding: 0 for `latin`, 1 for `utf-16`, 2 for `utf-8`. -- positive varint for encoded string binary length. -- encoded string binary data based on encoding: `latin/utf-16/utf-8`. - -Which encoding to choose: - -- For JDK8: fury detect `latin` at runtime, if string is `latin` string, then use `latin` encoding, otherwise - use `utf-16`. -- For JDK9+: fury use `coder` in `String` object for encoding, `latin`/`utf-16` will be used for encoding. -- If the string is encoded by `utf-8`, then fury will use `utf-8` to decode the data. But currently fury doesn't enable - utf-8 encoding by default for java. Cross-language string serialization of fury use `utf-8` by default. - -### Collection - -> All collection serializer must extends `org.apache.fury.serializer.collection.CollectionSerializer`. - -Format: - -``` -length(unsigned varint) | collection header | elements header | elements data -``` - -#### Collection header - -- For `ArrayList/LinkedArrayList/HashSet/LinkedHashSet`, this will be empty. -- For `TreeSet`, this will be `Comparator` -- For subclass of `ArrayList`, this may be extra object field info. - -#### Elements header - -In most cases, all collection elements are same type and not null, elements header will encode those homogeneous -information to avoid the cost of writing it for every elements. Specifically, there are four kinds of information -which will be encoded by elements header, each use one bit: - -- Whether track elements ref, use first bit `0b1` of header to flag it. -- Whether collection has null, use second bit `0b10` of header to flag it. If ref tracking is enabled for this - element type, this flag is invalid. -- Whether collection elements type is not declare type, use 3rd bit `0b100` of header to flag it. -- Whether collection elements type different, use 4rd bit `0b1000` of header to flag it. - -By default, all bits are unset, which means all elements won't track ref, all elements are same type,, not null and the -actual element is the declare type in custom class field. - -#### Elements data - -Based on the elements header, the serialization of elements data may skip `ref flag`/`null flag`/`element class info`. - -`org.apache.fury.serializer.collection.CollectionSerializer#write/read` can be taken as an example. - -### Array - -#### Primitive array - -#### Object array - -### Map - -### Enum - -Enum are serialized as an - -### Object - -### Class - -## Implementation guidelines - -- Try to merge multiple bytes into an int/long write before writing to reduce memory IO and bound check cost. -- Read multiple bytes as an int/long, then spilt into multiple bytes to reduce memory IO and bound check cost. -- Try to use one varint/long to write flags and length together to save one byte cost and reduce memory io. -- Condition branch is less expensive compared to memory IO cost unless there are too much branches. \ No newline at end of file diff --git a/docs/protocols/java_object_graph_spec.md b/docs/protocols/java_object_graph_spec.md new file mode 100644 index 0000000000..b602f1b1a2 --- /dev/null +++ b/docs/protocols/java_object_graph_spec.md @@ -0,0 +1,543 @@ +# Fury Java Serialization Specification + +## Spec overview + +Fury Java Serialization is an automatic object serialization framework that supports reference and polymorphism. Fury +will +convert an object from/to fury java serialization binary format. Fury has two core concepts for java serialization: + +- **Fury Java Binary format** +- **Framework to convert object to/from Fury Java Binary format** + +The serialization format is a dynamic binary format. The dynamics and reference/polymorphism support make Fury flexible, +much more easy to use, but +also introduce more complexities compared to static serialization frameworks. So the format will be more complex. + +Here is the overall format: + +``` +| fury header | object ref meta | object class meta | object value data | +``` + +The data are serialized using little endian byte order overall. If bytes swap is costly for some object, +Fury will write the byte order for that object into the data instead of converting it to little endian. + +## Fury header + +Fury header consists starts one byte: + +``` +| 4 bits | 1 bit | 1 bit | 1 bit | 1 bit | optional 4 bytes | ++---------------+-------+-------+--------+-------+------------------------------------+ +| reserved bits | oob | xlang | endian | null | unsigned int for meta start offset | +``` + +- null flag: 1 when object is null, 0 otherwise. If an object is null, other bits won't be set. +- endian flag: 1 when data is encoded by little endian, 0 for big endian. +- xlang flag: 1 when serialization uses xlang format, 0 when serialization uses Fury java format. +- oob flag: 1 when passed `BufferCallback` is not null, 0 otherwise. + +If meta share mode is enabled, an uncompressed unsigned int is appended to indicate the start offset of metadata. + +## Reference Meta + +Reference tracking handles whether the object is null, and whether to track reference for the object by writing +corresponding flags and maintaining internal state. + +Reference flags: + +| Flag | Byte Value | Description | +|---------------------|------------|---------------------------------------------------------------------------------------------------------------------------------------------------------| +| NULL FLAG | `-3` | This flag indicates the object is a null value. We don't use another byte to indicate REF, so that we can save one byte. | +| REF FLAG | `-2` | This flag indicates the object is already serialized previously, and fury will write a ref id with unsigned varint format instead of serialize it again | +| NOT_NULL VALUE FLAG | `-1` | This flag indicates the object is a non-null value and fury doesn't track ref for this type of object. | +| REF VALUE FLAG | `0` | This flag indicates the object is referencable and the first time to serialize. | + +When reference tracking is disabled globally or for specific types, or for certain types within a particular +context(e.g., a field of a class), only the `NULL` and `NOT_NULL VALUE` flags will be used for reference meta. + +## Class Meta + +Fury supports to register class by an optional id, the registration can be used for security check and class +identification. +If a class is registered, it will have a user-provided or an auto-growing unsigned int i.e. `class_id`. + +Depending on whether meta share mode and registration is enabled for current class, Fury will write class meta +differently. + +### Schema consistent + +If schema consistent mode is enabled globally or enabled for current class, class meta will be written as follows: + +- If class is registered, it will be written as a fury unsigned varint: `class_id << 1`. +- If class is not registered: + - If class is not an array, fury will write one byte `0bxxxxxxx1` first, then write class name. + - The first little bit is `1`, which is different from first bit `0` of + encoded class id. Fury can use this information to determine whether to read class by class id for + deserialization. + - If class is not registered and class is an array, fury will write one byte `dimensions << 1 | 1` first, then write + component + class subsequently. This can reduce array class name cost if component class is or will be serialized. + - Class will be written as two enumerated fury unsigned by default: `package name` and `class name`. If meta share + mode is + enabled, + class will be written as an unsigned varint which points to index in `MetaContext`. + +### Schema evolution + +If schema evolution mode is enabled globally or enabled for current class, class meta will be written as follows: + +- If meta share mode is not enabled, class meta will be written as schema consistent mode. Additionally, field meta such + as field type + and name will be written with the field value using a key-value like layout. +- If meta share mode is enabled, class meta will be written as a meta-share encoded binary if class hasn't been written + before, otherwise an unsigned varint id which references to previous written class meta will be written. + +## Meta share + +> This mode will forbid streaming writing since it needs to look back for update the start offset after the whole object +> graph +> writing and meta collecting is finished. Only in this way we can ensure deserialization failure doesn't lost shared +> meta. +> Meta streamline will be supported in the future for enclosed meta sharing which doesn't cross multiple serializations +> of different objects. + +For Schema consistent mode, class will be encoded as an enumerated string by full class name. Here we mainly describe +the meta layout for schema evolution mode: + +``` +| 8 bytes meta header | variable bytes | variable bytes | variable bytes | ++-------------------------------+--------------------+-------------------+----------------+ +| 7 bytes hash + 1 bytes header | current class meta | parent class meta | ... | +``` + +Class meta are encoded from parent class to leaf class, only class with serializable fields will be encoded. + +### Meta header + +Meta header is a 64 bits number value encoded in little endian order. + +- Lowest 4 digits `0b0000~0b1110` are used to record num classes. `0b1111` is preserved to indicate that Fury need to + read more bytes for length using Fury unsigned int encoding. If current class doesn't has parent class, or parent + class doesn't have fields to serialize, or we're in a context which serialize fields of current class + only( `ObjectStreamSerializer#SlotInfo` is an example), num classes will be 1. +- 5rd bit is used to indicate whether this class needs schema evolution. +- Other 56 bits is used to store the unique hash of `flags + all layers class meta`. + +### Single layer class meta + +``` +| unsigned varint | meta string | meta string | field info: variable bytes | variable bytes | ... | ++----------------------------+-----------------------+---------------------+-------------------------------+-----------------+-----+ +| num fields + register flag | header + package name | header + class name | header + type id + field name | next field info | ... | +``` + +- num fields: encode `num fields << 1 | register flag(1 when class registered)` as unsigned varint. + - If class is registered, then an unsigned varint class id will be written next, package and class name will be + omitted. + - If current class is schema consistent, then num field will be `0` to flag it. + - If current class isn't schema consistent, then num field will be the number of compatible fields. For example, + users + can use tag id to mark some field as compatible field in schema consistent context. In such cases, schema + consistent + fields will be serialized first, then compatible fields will be serialized next. At deserialization, Fury will use + fields info of those fields which aren't annotated by tag id for deserializing schema consistent fields, then use + fields info in meta for deserializing compatible fields. +- Package name encoding(omitted when class is registered): + - Header: + - If meta string encoding is `LOWER_SPECIAL` and the length of encoded string `<=` 128, then header will be + `7 bits size + flag(set)`. + Otherwise, header will be `4 bits unset + 3 bits encoding flags + flag(unset)` + - Package name: + - If bit flag is set, then package name will be encoded meta string binary. + - Otherwise, it will be `| unsigned varint length | encoded meta string binary |` +- Class name encoding(omitted when class is registered):: + - header: + - If meta string encoding is in `LOWER_SPECIAL~LOWER_UPPER_DIGIT_SPECIAL (0~3)`, and the length of encoded + string `<=` 32, then the header will be `5 bits size + 2 bits encoding flags + flag(set)`. + - Otherwise, header will be `| unsigned varint length | encoded meta string binary |` +- Field info: + - header(8 + bits): `reserved 1 bit + 3 bits field name encoding + polymorphism flag + nullability flag + ref tracking flag + tag id flag`. + Users can use annotation to provide those info. + - tag id: when set to 1, field name will be written by an unsigned varint tag id. + - ref tracking: when set to 0, ref tracking will be disabled for this field. + - nullability: when set to 0, this field won't be null. + - polymorphism: when set to 1, the actual type of field will be the declared field type even the type if + not `final`. + - 3 bits field name encoding will be set to meta string encoding flags when tag id is not set. + - type id: + - For registered type-consistent classes, it will be the registered class id. + - Otherwise it will be encoded as `OBJECT_ID` if it isn't `final` and `FINAL_OBJECT_ID` if it's `final`. The + meta + for such types is written separately instead of inlining here is to reduce meta space cost if object of this + type is serialized in current object graph multiple times, and the field value may be null too. + - List Type Info: list type will have an extra byte for elements info. + Users can use annotation to provide those info. + - elements type same + - elements tracking ref + - elements nullability + - elements declared type + - Map Type Info: map type will have an extra byte for kv items info. + Users can use annotation to provide those info. + - keys type same + - keys tracking ref + - keys nullability + - keys declared type + - values type same + - values tracking ref + - values nullability + - values declared type + - Field name: If type id is set, type id will be used instead. Otherwise meta string encoding length and data will + be + written instead. + +Field order are left as implementation details, which is not exposed to specification, the deserialization need to +resort fields based on Fury field comparator. In this way, fury can compute statistics for field names or types and +using a more compact encoding. + +### Other layers class meta + +Same encoding algorithm as the previous layer except: + +- header + package name: + - Header: + - If package name has been written before: `varint index + sharing flag(set)` will be written + - If package name hasn't been written before: + - If meta string encoding is `LOWER_SPECIAL` and the length of encoded string `<=` 64, then header will be + `6 bits size + encoding flag(set) + sharing flag(unset)`. + - Otherwise, header will + be `3 bits unset + 3 bits encoding flags + encoding flag(unset) + sharing flag(unset)` + +## Meta String + +Meta string is mainly used to encode meta strings such class name and field names. + +### Encoding Algorithms + +String binary encoding algorithm: + +| Algorithm | Pattern | Description | +|---------------------------|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------| +| LOWER_SPECIAL | `a-z._$\|` | every char is written using 5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101` | +| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._$` | every char is written using 6 bits, `a-z`: `0b00000~0b11110`, `A-Z`: `0b11010~0b110011`, `0~9`: `0b110100~0b111101`, `._$`: `0b111110~0b1000000` | +| UTF-8 | any chars | UTF-8 encoding | + +Encoding flags: + +| Encoding Flag | Pattern | Encoding Algorithm | +|---------------------------|-----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------| +| LOWER_SPECIAL | every char is in `a-z._$\|` | `LOWER_SPECIAL` | +| REP_FIRST_LOWER_SPECIAL | every char is in `a-z._$` except first char is upper case | replace first upper case char to lower case, then use `LOWER_SPECIAL` | +| REP_MUL_LOWER_SPECIAL | every char is in `a-zA-Z._$` | replace every upper case char by `\|` + `lower case`, then use `LOWER_SPECIAL`, use this encoding if it's smaller than Encoding `3` | +| LOWER_UPPER_DIGIT_SPECIAL | every char is in `a-zA-Z._$` | use `LOWER_UPPER_DIGIT_SPECIAL` encoding if it's smaller than Encoding `2` | +| UTF8 | any utf-8 char | use `UTF-8` encoding | +| Compression | any utf-8 char | lossless compression | + +Depending on cases, one can choose encoding `flags + data` jointly, uses 3 bits of first byte for flags and other bytes +for data. + +### Shared meta string + +The shared meta string format consists of header and encoded string binary. Header of encoded string binary will be +inlined +in shared meta header. + +Header is written using little endian order, Fury can read this flag first to determine how to deserialize the data. + +#### Write by data + +If string hasn't been written before, the data will be written as follows: + +``` +| unsigned varint: string binary size + 1 bit: not written before | 56 bits: unique hash | 3 bits encoding flags + string binary | +``` + +If string binary size is less than `16` bytes, the hash will be omitted to save spaces. Unique hash can be omitted too +if caller pass a flag to disable it. In such cases, the format will be: + +``` +| unsigned varint: string binary size + 1 bit: not written before | 3 bits encoding flags + string binary | +``` + +#### Write by ref + +If string has been written before, the data will be written as follows: + +``` +| unsigned varint: written string id + 1 bit: written before | +``` + +## Value Format + +### Basic types + +#### Bool + +- size: 1 byte +- format: 0 for `false`, 1 for `true` + +#### Byte + +- size: 1 byte +- format: write as pure byte. + +#### Short + +- size: 2 byte +- byte order: little endian order + +#### Char + +- size: 2 byte +- byte order: little endian order + +#### Unsigned int + +- size: 1~5 byte +- Format: The most significant bit (MSB) in every byte indicates whether to have the next byte. If first bit is set + i.e. `b & 0x80 == 0x80`, then + the next byte should be read until the first bit of the next byte is unset. + +#### Signed int + +- size: 1~5 byte +- Format: First convert the number into positive unsigned int by `(v << 1) ^ (v >> 31)` ZigZag algorithm, then encoding + it as an unsigned int. + +#### Unsigned long + +- size: 1~9 byte +- Fury PVL(Progressive Variable-length Long) Encoding: + - positive long format: first bit in every byte indicates whether to have the next byte. If first bit is set + i.e. `b & 0x80 == 0x80`, then the next byte should be read until the first bit is unset. + +#### Signed long + +- size: 1~9 byte +- Fury SLI(Small long as int) Encoding: + - If long is in [-1073741824, 1073741823], encode as 4 bytes int: `| little-endian: ((int) value) << 1 |` + - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |` +- Fury PVL(Progressive Variable-length Long) Encoding: + - First convert the number into positive unsigned long by ` (v << 1) ^ (v >> 63)` ZigZag algorithm to reduce cost of + small negative numbers, then encoding it as an unsigned long. + +#### Float + +- size: 4 byte +- format: convert float to 4 bytes int by `Float.floatToRawIntBits`, then write as binary by little endian order. + +#### Double + +- size: 8 byte +- format: convert double to 8 bytes int by `Double.doubleToRawLongBits`, then write as binary by little endian order. + +### String + +Format: + +``` +| header: size << 2 | 2 bits encoding flags | binary data | +``` + +- `size + encoding` will be concat as a long and encoded as an unsigned var long. The little 2 bits is used for + encoding: + 0 for `latin`, 1 for `utf-16`, 2 for `utf-8`. +- encoded string binary data based on encoding: `latin/utf-16/utf-8`. + +Which encoding to choose: + +- For JDK8: fury detect `latin` at runtime, if string is `latin` string, then use `latin` encoding, otherwise + use `utf-16`. +- For JDK9+: fury use `coder` in `String` object for encoding, `latin`/`utf-16` will be used for encoding. +- If the string is encoded by `utf-8`, then fury will use `utf-8` to decode the data. But currently fury doesn't enable + utf-8 encoding by default for java. Cross-language string serialization of fury uses `utf-8` by default. + +### Collection + +> All collection serializers must extend `AbstractCollectionSerializer`. + +Format: + +``` +length(unsigned varint) | collection header | elements header | elements data +``` + +#### Collection header + +- For `ArrayList/LinkedArrayList/HashSet/LinkedHashSet`, this will be empty. +- For `TreeSet`, this will be `Comparator` +- For subclass of `ArrayList`, this may be extra object field info. + +#### Elements header + +In most cases, all collection elements are same type and not null, elements header will encode those homogeneous +information to avoid the cost of writing it for every element. Specifically, there are four kinds of information +which will be encoded by elements header, each use one bit: + +- If track elements ref, use the first bit `0b1` of the header to flag it. +- If the collection has null, use the second bit `0b10` of the header to flag it. If ref tracking is enabled for this + element type, this flag is invalid. +- If the collection element types are not declared type, use the 3rd bit `0b100` of the header to flag it. +- If the collection element types are different, use the 4rd bit `0b1000` header to flag it. + +By default, all bits are unset, which means all elements won't track ref, all elements are same type, not null and +the actual element is the declared type in the custom class field. + +#### Elements data + +Based on the elements header, the serialization of elements data may skip `ref flag`/`null flag`/`element class info`. + +`CollectionSerializer#write/read` can be taken as an example. + +### Array + +#### Primitive array + +Primitive array are taken as a binary buffer, serialization will just write the length of array size as an unsigned int, +then copy the whole buffer into the stream. + +Such serialization won't compress the array. If users want to compress primitive array, users need to register custom +serializers for such types. + +#### Object array + +Object array is serialized using the collection format. Object component type will be taken as collection element +generic +type. + +### Map + +> All Map serializers must extend `AbstractMapSerializer`. + +Format: + +``` +| length(unsigned varint) | map header | key value pairs data | +``` + +#### Map header + +- For `HashMap/LinkedHashMap`, this will be empty. +- For `TreeMap`, this will be `Comparator` +- For other `Map`, this may be extra object field info. + +#### Map Key-Value data + +Map iteration is too expensive, Fury won't compute the header like for collection before since it introduce +[considerable overhead](https://github.com/apache/incubator-fury/issues/925). +Users can use `MapFieldInfo` annotation to provide header in advance. Otherwise Fury will use first key-value pair to +predict header optimistically, and update the chunk header if the prediction failed at some pair. + +Fury will serialize map chunk by chunk, every chunk has 127 pairs at most. + +``` +| 1 byte | 1 byte | variable bytes | ++----------------+----------------+-----------------+ +| chunk size: N | KV header | N*2 objects | +``` + +KV header: + +- If track key ref, use the first bit `0b1` of the header to flag it. +- If the key has null, use the second bit `0b10` of the header to flag it. If ref tracking is enabled for this + key type, this flag is invalid. +- If the key types of map are different, use the 3rd bit `0b100` of the header to flag it. +- If the actual key type of map is not the declared key type, use the 4rd bit `0b1000` of the header to flag it. +- If track value ref, use the 5th bit `0b10000` of the header to flag it. +- If the value has null, use the 6th bit `0b100000` of the header to flag it. If ref tracking is enabled for this + value type, this flag is invalid. +- If the value types of map are different, use the 7rd bit `0b1000000` header to flag it. +- If the value type of map is not the declared value type, use the 8rd bit `0b10000000` of the header to flag it. + +If streaming write is enabled, which means Fury can't update written `chunk size`. In such cases, map key-value data +format will be: + +``` +| 1 byte | variable bytes | ++----------------+-----------------+ +| KV header | N*2 objects | +``` + +`KV header` will be a header marked by `MapFieldInfo` in java. For languages such as golang, this can be computed in +advance for non-interface types in most times. + +### Enum + +Enums are serialized as an unsigned var int. If the order of enum values change, the deserialized enum value may not be +the value users expect. In such cases, users must register enum serializer by make it write enum value as an enumerated +string with unique hash disabled. + +### Object + +Object means object of `pojo/struct/bean/record` type. +Object will be serialized by writing its fields data in fury order. + +Depending on schema compatibility, objects will have different formats. + +#### Field order + +Field will be ordered as following, every group of fields will have its own order: + +- primitive fields: larger size type first, smaller later, variable size type last. +- boxed primitive fields: same order as primitive fields +- final fields: same type together, then sorted by field name lexicographically. +- collection fields: same order as final fields +- map fields: same order as final fields +- other fields: same order as final fields + +#### Schema consistent + +Object fields will be serialized one by one using following format: + +``` +Primitive field value: +| var bytes | ++----------------+ +| value data | ++----------------+ +Boxed field value: +| one byte | var bytes | ++-----------+---------------+ +| null flag | field value | ++-----------+---------------+ +field value of final type with ref tracking: +| var bytes | var objects | ++-----------+-------------+ +| ref meta | value data | ++-----------+-------------+ +field value of final type without ref tracking: +| one byte | var objects | ++-----------+-------------+ +| null flag | field value | ++-----------+-------------+ +field value of non-final type with ref tracking: +| one byte | var bytes | var objects | ++-----------+-------------+-------------+ +| ref meta | class meta | value data | ++-----------+-------------+-------------+ +field value of non-final type without ref tracking: +| one byte | var bytes | var objects | ++-----------+------------+------------+ +| null flag | class meta | value data | ++-----------+------------+------------+ +``` + +#### Schema evolution + +Schema evolution have similar format as schema consistent mode for object except: + +- For this object type itself, `schema consistent` mode will write class by id/name, but `schema evolution` mode will + write class field names, types and other meta too, see [Class meta](#class-meta). +- Class meta of `final custom type` needs to be written too, because peers may not have this class defined. + +### Class + +Class will be serialized using class meta format. + +## Implementation guidelines + +- Try to merge multiple bytes into an int/long write before writing to reduce memory IO and bound check cost. +- Read multiple bytes as an int/long, then split into multiple bytes to reduce memory IO and bound check cost. +- Try to use one varint/long to write flags and length together to save one byte cost and reduce memory io. +- Condition branches are less expensive compared to memory IO cost unless there are too many branches. diff --git a/java/fury-core/src/main/java/org/apache/fury/Fury.java b/java/fury-core/src/main/java/org/apache/fury/Fury.java index 1320c23bea..436b1b38b4 100644 --- a/java/fury-core/src/main/java/org/apache/fury/Fury.java +++ b/java/fury-core/src/main/java/org/apache/fury/Fury.java @@ -82,7 +82,7 @@ public final class Fury implements BaseFury { public static final byte REF_FLAG = -2; // this flag indicates that the object is a non-null value. public static final byte NOT_NULL_VALUE_FLAG = -1; - // this flag indicates that the object is a referencable and first read. + // this flag indicates that the object is a referencable and first write. public static final byte REF_VALUE_FLAG = 0; public static final byte NOT_SUPPORT_CROSS_LANGUAGE = 0; public static final short FURY_TYPE_TAG_ID = Type.FURY_TYPE_TAG.getId();