diff --git a/format/Message.fbs b/format/Message.fbs index b02f3fa3869..71428b58103 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -28,12 +28,13 @@ table Int { is_signed: bool; } -enum Precision:short {SINGLE, DOUBLE} +enum Precision:short {HALF, SINGLE, DOUBLE} table FloatingPoint { precision: Precision; } +/// Unicode with UTF-8 encoding table Utf8 { } diff --git a/format/Metadata.md b/format/Metadata.md new file mode 100644 index 00000000000..e227b8d4afd --- /dev/null +++ b/format/Metadata.md @@ -0,0 +1,258 @@ +# Metadata: Logical types, schemas, data headers + +This is documentation for the Arrow metadata specification, which enables +systems to communicate the + +* Logical array types (which are implemented using the physical memory layouts + specified in [Layout.md][1]) + +* Schemas for table-like collections of Arrow data structures + +* "Data headers" indicating the physical locations of memory buffers sufficient + to reconstruct a Arrow data structures without copying memory. + +## Canonical implementation + +We are using [Flatbuffers][2] for low-overhead reading and writing of the Arrow +metadata. See [Message.fbs][3]. + +## Schemas + +The `Schema` type describes a table-like structure consisting of any number of +Arrow arrays, each of which can be interpreted as a column in the table. A +schema by itself does not describe the physical structure of any particular set +of data. + +A schema consists of a sequence of **fields**, which are metadata describing +the columns. The Flatbuffers IDL for a field is: + +``` +table Field { + // Name is not required, in i.e. a List + name: string; + nullable: bool; + type: Type; + children: [Field]; +} +``` + +The `type` is the logical type of the field. Nested types, such as List, +Struct, and Union, have a sequence of child fields. + +## Record data headers + +A record batch is a collection of top-level named, equal length Arrow arrays +(or vectors). If one of the arrays contains nested data, its child arrays are +not required to be the same length as the top-level arrays. + +One can be thought of as a realization of a particular schema. The metadata +describing a particular record batch is called a "data header". Here is the +Flatbuffers IDL for a record batch data header + +``` +table RecordBatch { + length: int; + nodes: [FieldNode]; + buffers: [Buffer]; +} +``` + +The `nodes` and `buffers` fields are produced by a depth-first traversal / +flattening of a schema (possibly containing nested types) for a given in-memory +data set. + +### Buffers + +A buffer is metadata describing a contiguous memory region relative to some +virtual address space. This may include: + +* Shared memory, e.g. a memory-mapped file +* An RPC message received in-memory +* Data in a file + +The key form of the Buffer type is: + +``` +struct Buffer { + offset: long; + length: long; +} +``` + +In the context of a record batch, each field has some number of buffers +associated with it, which are derived from their physical memory layout. + +Each logical type (separate from its children, if it is a nested type) has a +deterministic number of buffers associated with it. These will be specified in +the logical types section. + +### Field metadata + +The `FieldNode` values contain metadata about each level in a nested type +hierarchy. + +``` +struct FieldNode { + /// The number of value slots in the Arrow array at this level of a nested + /// tree + length: int; + + /// The number of observed nulls. + null_count: int; +} +``` + +## Flattening of nested data + +Nested types are flattened in the record batch in depth-first order. When +visiting each field in the nested type tree, the metadata is appended to the +top-level `fields` array and the buffers associated with that field (but not +its children) are appended to the `buffers` array. + +For example, let's consider the schema + +``` +col1: Struct, c: Float64> +col2: Utf8 +``` + +The flattened version of this is: + +``` +FieldNode 0: Struct name='col1' +FieldNode 1: Int32 name=a' +FieldNode 2: List name='b' +FieldNode 3: Int64 name='item' # arbitrary +FieldNode 4: Float64 name='c' +FieldNode 5: Utf8 name='col2' +``` + +For the buffers produced, we would have the following (as described in more +detail for each type below): + +``` +buffer 0: field 0 validity bitmap + +buffer 1: field 1 validity bitmap +buffer 2: field 1 values + +buffer 3: field 2 validity bitmap +buffer 4: field 2 list offsets + +buffer 5: field 3 validity bitmap +buffer 6: field 3 values + +buffer 7: field 4 validity bitmap +buffer 8: field 4 values + +buffer 9: field 5 validity bitmap +buffer 10: field 5 offsets +buffer 11: field 5 data +``` + +## Logical types + +A logical type consists of a type name and metadata along with an explicit +mapping to a physical memory representation. These may fall into some different +categories: + +* Types represented as fixed-width primitive arrays (for example: C-style + integers and floating point numbers) +* Types having equivalent memory layout to a physical nested type (e.g. strings + use the list representation, but logically are not nested types) + +### Integers + +In the first version of Arrow we provide the standard 8-bit through 64-bit size +standard C integer types, both signed and unsigned: + +* Signed types: Int8, Int16, Int32, Int64 +* Unsigned types: UInt8, UInt16, UInt32, UInt64 + +The IDL looks like: + +``` +table Int { + bitWidth: int; + is_signed: bool; +} +``` + +The integer endianness is currently set globally at the schema level. If a +schema is set to be little-endian, then all integer types occurring within must +be little-endian. Integers that are part of other data representations, such as +list offsets and union types, must have the same endianness as the entire +record batch. + +### Floating point numbers + +We provide 3 types of floating point numbers as fixed bit-width primitive array + +- Half precision, 16-bit width +- Single precision, 32-bit width +- Double precision, 64-bit width + +The IDL looks like: + +``` +enum Precision:int {HALF, SINGLE, DOUBLE} + +table FloatingPoint { + precision: Precision; +} +``` + +### Boolean + +The Boolean logical type is represented as a 1-bit wide primitive physical +type. The bits are numbered using least-significant bit (LSB) ordering. + +Like other fixed bit-width primitive types, boolean data appears as 2 buffers +in the data header (one bitmap for the validity vector and one for the values). + +### List + +The `List` logical type is the logical (and identically-named) counterpart to +the List physical type. + +In data header form, the list field node contains 2 buffers: + +* Validity bitmap +* List offsets + +The buffers associated with a list's child field are handled recursively +according to the child logical type (e.g. `List` vs. `List`). + +### Utf8 and Binary + +We specify two logical types for variable length bytes: + +* `Utf8` data is unicode values with UTF-8 encoding +* `Binary` is any other variable length bytes + +These types both have the same memory layout as the nested type `List`, +with the constraint that the inner bytes can contain no null values. From a +logical type perspective they are primitive, not nested types. + +In data header form, while `List` would appear as 2 field nodes (`List` +and `UInt8`) and 4 buffers (2 for each of the nodes, as per above), these types +have a simplified representation single field node (of `Utf8` or `Binary` +logical type, which have no children) and 3 buffers: + +* Validity bitmap +* List offsets +* Byte data + +### Decimal + +TBD + +### Timestamp + +TBD + +## Dictionary encoding + +[1]: https://github.com/apache/arrow/blob/master/format/Layout.md +[2]: http://github.com/google/flatbuffers +[3]: https://github.com/apache/arrow/blob/master/format/Message.fbs diff --git a/format/README.md b/format/README.md index c84e00772c3..3b0e50364d8 100644 --- a/format/README.md +++ b/format/README.md @@ -6,6 +6,7 @@ Currently, the Arrow specification consists of these pieces: +- Metadata specification (see Metadata.md) - Physical memory layout specification (see Layout.md) - Metadata serialized representation (see Message.fbs)