-
Notifications
You must be signed in to change notification settings - Fork 10
Bois Schema Specs
TO BE COMPLETED.
Bois binary format is pretty straight forward. BOIS stands for Binary Object Indexed Serializer. Even tough the overall structure doesn't follow any specific rule, it still can be categorized as indexed sequential data format, hence the indexed word in name. Being indexed means that there is an index byte before every object. This index byte contains information about the the data that comes after it. It can even contain data by in itself. To know how continue reading.
There are several type of index bytes that depending on the type of data that is going to be stored are used as the index byte.
IB1 - Nullable: Generally used if the object/number is nullable.
index byte: [0_{null-flag}_0_0_0_0_0_0]
embedable integer: none
IB2 - Embed-able Nullable: Generally used if the object/number is nullable and is small enough to be embedded.
index byte: [{embedded-flag}_{null-flag}_0_0_0_0_0_0]
followed by optional data: [0_0_0_0_0_0_0_0]
embedable integer: 0..63
IB3 - Embed-able Nullable Signed Number: Used for signed numbers which is nullable and is small enough to be embedded.
index byte: [{embedded-flag}_{null-flag}_{negative-flag}_0_0_0_0_0]
followed by optional data: [0_0_0_0_0_0_0_0]
embedable integer: 0..31
IB4 - Embed-able Not-Null Signed Number: Used for signed numbers which can not be null and is small enough to be embedded.
index byte: [{embedded-flag}_{negative-flag}_0_0_0_0_0_0]
followed by optional data: [0_0_0_0_0_0_0_0]
embedable integer: 0..63
IB5 - Embed-able Nullable Unsigned Number: Used for unsigned numbers which can be null and is small enough to be embedded.
index byte: [{embedded-flag}_{null-flag}_0_0_0_0_0_0]
followed by optional data: [0_0_0_0_0_0_0_0]
embedable integer: 0..63
IB6 - Embed-able Not-Null Unsigned Number: Used for unsigned numbers which can not be null is be small enough to be embedded.
index byte: [{embedded-flag}_0_0_0_0_0_0_0]
followed by optional data: [0_0_0_0_0_0_0_0]
embedable integer: 0..127
If you have noticed, some of these index bytes have same structure. I've done this to simplify the process of writing the program. But we still need more info about these bytes which is the the amount of data that be embedded. Before that Lets see how to embed data in index byte.
If the number that is going to be stored is small enough it can be stored in the index byte by merging the number and the flags.
The flags should be preserved at all times. Any misuse of the embedded flag may lead to invalid data.
First we have to know how much data can can be stored. For example Int32
is type of IB4 which can store any number in 0...63 range.
As an example of a Unsinged Integer imagine we want to store number 50. Since the datatype is uint
and is not nullable it falls into IB6 category. Because 50 is smaller than IB6 embeddable range it can be stored in the index byte. Finally because the number is embeded we have to set the flag.
50 decimal = [00110010] byte
IB6
Embedded flag = [10000000]
Final byte = [10110010]
Now imagine that we want to save the same number 50 but this time the data type is a nullable signed integer int?
. This type falls into IB3 category which the largest embedable number is 31 so that means we cannot embed 50 into index byte. This is how it is stored.
50 decimal = [00110010] byte
IB3
Not null not embeded signed number flag = [00000000]
Final bytes = [0000000][00110010]
In here the first byte is index flag which its flags are not enabled and the second byte is the number itself.
Same process should be done while reading data. As the first step we have to determine the datatable from the schema, then decide which index bytes category it belongs to and finally check the flags and read the data and seperated it from any flags.
This section descirbes the category and also the structure of simple data types supported by the serializer.
Category: None
Structure: None.
Category: IB5
Structure: None.
Category: None
Structure: None.
Category: IB3
Structure: None.
Category: IB4
Structure: None.
Category: None
Structure: byte
.
Category: IB2
Structure: byte?
.
Category: IB6
Structure: ushort
.
Category: IB2
Structure: ushort?
.
This section describes the types that require a simple structure in addition to the category.
Structure: [data-length : uint?][string-data-encoded]
Structure: [data-length : uint][double-variable-data]
Data Format: Double value is converted to 8 bytes and only low values with actual data stored.
TODO: explain.
Structure: [data-length : uint?][]
Every data has an index byte that describes the type or length of it.
There is 3 major type of index byte.
[NullIndicator-0-0-0-0-0-0-0]
[EmbedIndicator -0-0-0-0-0-0-0]
[NullIndicator- EmbedIndicator -0-0-0-0-0-0]
You will read their usage in the following papers.
Objects: [NullIndicator-0-0-0-0-0-0-0] Indicator byte of any object. Object either can be null or not.
Non-Nullable Primitive Types: Unsigned numbers: [EmbedIndicator-0-0-0-0-0-0-0] [optional data] 0..127 can be embedded
Signed numbers: [EmbedIndicator-SignIndicator-0-0-0-0-0-0] [optional data] 0..63 can be embedded
Nullable Primitive Types: Unsigned numbers: [NullIndicator-EmbedIndicator-0-0-0-0-0-0] [optional data] 0..127 can be embedded
Signed numbers: [NullIndicator-EmbedIndicator-SignIndicator-0-0-0-0-0] [optional data] 0..31 can be embedded
No use of ZigZag algorithm which was making the numbers big.
Non-Nullable Primitive Types: [EmbedIndicator-0-0-0-0-0-0-0] [optional data] Most primitive types can embd their data in the index-byte, but there is limitation of what type of data can be stored in the index-byte. Data in index-byte can only be an integer between 0…63 which is [EmbedIndicator-1-1-1-1-1-1-1] If the data is larger it should be stored in the after the index-byte using its own format. Data in the index-byte will indicate the size of the data after index-byte. Byte array: byte[] Byte arrays are used in many places as a standard data sorting algorithm in the storing/reading process. Byte arrays are stored in two formats:
Standard Byte Array: In this format the array is stored without any alteration. LossyArrayIndicator is always zero. [NullIndicator- EmbedIndicator -{LossyArrayIndicator=0}-0-0-0-0-0] [StoredBytesCount]
Lossless Compact Byte Array: This means any array provided by models are stored in compact format. Before storing the array it is scanned for 0 bytes from end of array until non-zero byte found. To determine whether if compact method should be used or not, the number of zero should superpass the 4 bytes which will be used to store the original size of the array. If this condition fails the standard method of storing arrays will be used. [NullIndicator- EmbedIndicator -LossyArrayIndicator-0-0-0-0-0] [ArrayOriginalSize] [StoredBytesCount]
Lossy Compact Byte Array: These type of arrays used internally only. This means any array provided by models are stored in compact format.
Before storing the array it is scanned for 0 bytes from end of array until non-zero byte found. These zeros will not be stored. The original size of the array is not stored. This method of storing array is mostly used in storing integer and floating point numbers. [NullIndicator- EmbedIndicator -0-0-0-0-0-0] [StoredBytesCount]
short, ushort, int, uint, long, ulong, byte, sbyte and … If the number is between 0 and 127: [EmbedIndicator=1 - the number] If the number is larger than 127 or smaller than 0: [EmbedIndicator=0 - the number of bytes holding data] [byte 0 of the zig-zag-number] [byte 1 of the zig-zag-number] … [byte n of the zig-zag-number] Examples: 0 [10000000] 100 [11100100] 127 [11111111] 128 [00000001][10000000] -512 [00000000] 1000000 [00000000]
float, double and decimal: The general concept is the somewhat same as integer numbers. If the floating point is zero and the number is between 0 and 127 the number can be stored in the index-byte otherwise it is stored in the following format.
0.0 [10000000] 100.0 [11100100]
For each type the number of bytes are differ:
Single 4 bytes
50.123 [10000100] [] [] [] []
Double 8 bytes array
5000.123 [10001000] [] [] [] []
Decimal 16 bytes array
C# Type Observation (Remove This)
Properties required while determining the object type: TypeBasicInfoCache-> -ShoulBeComputed -> Only used when deciding to serialize root instance not when properties/field of the object being serialized -UnderlyingType -> Holds the member-type or Array Item type or nullable item type -KnownType -> EnKnownType
So Two hashtable is required. TypeBasicInfoCache -> Contains only a few info about the type, mentioned above ComputedTypeInfoCache -> Containes Emitted type and the dlegates
C# Implementation: There are internal emitters who generate dynamic emit for complex types. CollectionGenerator, ArrayGenerator, EnumGenerator, ColorGenerator and .. Every complex type will have its Emit generated. Generated emit is basically is a code that uses Bois internal classes and method to read or write binary format.
This code is to be generated by Emit.
public static void Serialize(SampleClass obj, BinaryWriter writer) { // Checking if object itself is null or not if (obj == null) { writer.Write((byte)BoisInternalWriter.NullIndicator); return; } else { writer.Write((byte)BoisInternalWriter.NotNullIndicator); }
// writing List1 as List<string>
if (obj.List1 == null)
{
writer.Write((byte)BoisInternalWriter.NullIndicator);
}
else
{
BoisInternalWriter.WriteVarInt(obj.List1.Length);
for (int i = 0; i < obj.List1.Length; i++)
{
BoisInternalWriter.WriteString(obj.List1[i]);
}
}
// writing Name as string
if (obj.Name == null)
{
BoisInternalWriter.WriteVarInt(writer, (int?)null);
}
else if (obj.Name.Length == 0)
{
BoisInternalWriter.WriteVarInt(writer, (int?)0);
}
else
{
var strBytes = Encoding.GetBytes(obj.Name);
// Int32
BoisInternalWriter.WriteVarInt(writer, (int?)strBytes.Length);
writer.Write(strBytes);
}
}