Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-94: [Format] Expand list example to clarify null vs empty list #58

Closed
wants to merge 8 commits into from
213 changes: 192 additions & 21 deletions format/Layout.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ concepts, here is a small glossary to help disambiguate.
* Slot or array slot: a single logical value in an array of some particular data type
* Contiguous memory region: a sequential virtual address space with a given
length. Any byte can be reached via a single pointer offset less than the
regions length.
region's length.
* Primitive type: a data type that occupies a fixed-size memory slot specified
in bit width or byte width
* Nested or parametric type: a data type whose full structure depends on one or
Expand Down Expand Up @@ -42,7 +42,7 @@ Base requirements
* Capable of representing fully-materialized and decoded / decompressed Parquet
data
* All leaf nodes (primitive value arrays) use contiguous memory regions
* Any relative type can be have null slots
* Any relative type can have null slots
* Arrays are immutable once created. Implementations can provide APIs to mutate
an array, but applying mutations will require a new array data structure to
be built.
Expand All @@ -69,11 +69,15 @@ Base requirements
* To define a selection or masking vector construct
* Implementation-specific details
* Details of a user or developer C/C++/Java API.
* Any table structure composed of named arrays each having their own type or
* Any "table" structure composed of named arrays each having their own type or
any other structure that composes arrays.
* Any memory management or reference counting subsystem
* To enumerate or specify types of encodings or compression support

## Byte Order (Endianess)

The Arrow format is little endian.

## Array lengths

Any array has a known and fixed length, stored as a 32-bit signed integer, so a
Expand Down Expand Up @@ -142,10 +146,61 @@ the size is rounded up to the nearest byte.
The associated null bitmap is contiguously allocated (as described above) but
does not need to be adjacent in memory to the values buffer.

(diagram not to scale)

<img src="diagrams/layout-primitive-array.png" width="400"/>
### Example Layout: Int32 Array
For example a primitive array of int32s:

[1, 2, null, 4, 8]

Would look like:

```
* Length: 5, Null count: 1
* Null bitmap buffer:

|Byte 0 (validity bitmap) | Bytes 1-7 |
|-------------------------|-----------------------|
|00011011 | 0 (padding) |

* Value Buffer:

|Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 |
|------------|-------------|-------------|-------------|-------------|
| 1 | 2 | unspecified | 4 | 8 |

```

### Example Layout: Non-null int32 Array

[1, 2, 3, 4, 8] has two possible layouts:

```
* Length: 5, Null count: 0
* Null bitmap buffer:

| Byte 0 (validity bitmap) | Bytes 1-7 (padding) |
|--------------------------|-----------------------|
| 00011111 | 0 (padding) |

* Value Buffer:

|Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | bytes 12-15 | bytes 16-19 |
|------------|-------------|-------------|-------------|-------------|
| 1 | 2 | 3 | 4 | 8 |
```

or with the bitmap elided:

```
* Length 5, Null count: 0
* Null bitmap buffer: Not required
* Value Buffer:

|Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | bytes 12-15 | bytes 16-19 |
|------------|-------------|-------------|-------------|-------------|
| 1 | 2 | 3 | 4 | 8 |

```
## List type

List is a nested type in which each array slot contains a variable-size
Expand Down Expand Up @@ -175,20 +230,84 @@ slot_length = offsets[j + 1] - offsets[j] // (for 0 <= j < length)
The first value in the offsets array is 0, and the last element is the length
of the values array.

Let’s consider an example, the type `List<Char>`, where Char is a 1-byte
### Example Layout: `List<Char>` Array
Let's consider an example, the type `List<Char>`, where Char is a 1-byte
logical type.

For an array of length 3 with respective values:
For an array of length 4 with respective values:

[[‘j’, ‘o’, ‘e’], null, [‘m’, ‘a’, ‘r’, ‘k’]]
[['j', 'o', 'e'], null, ['m', 'a', 'r', 'k'], []]

We have the following offsets and values arrays
will have the following representation:

<img src="diagrams/layout-list.png" width="400"/>
```
* Length: 4, Null count: 1
* Null bitmap buffer:

Let’s consider an array of a nested type, `List<List<byte>>`
| Byte 0 (validity bitmap) | Bytes 0-7 |
|--------------------------|-----------------------|
| 00001101 | 0 (padding) |

<img src="diagrams/layout-list-of-list.png" width="400"/>
* Offsets array (int32 array)
* Length: 5, Null count: 0
* Null bitmap buffer: Not required
* Value Buffer (offsets into the Values array):

| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 |
|------------|-------------|-------------|-------------|-------------|
| 0 | 3 | 3 | 7 | 7 |

* Values array (char array):
* Length: 7, Null count: 0
* Null bitmap buffer: Not required

| Bytes 0-7 |
|------------|
| joemark |
```

### Example Layout: `List<List<byte>>`
[[[1, 2], [3, 4]], [[5, 6, 7], null, [8]], [[9, 10]]]

will be be represented as follows:

```
* Length 3
* Nulls count: 0
* Null bitmap buffer: Not required
* Offsets array (int32 array)
* Length: 4, Null count: 0
* Null bitmap buffer: Not required
* Value Buffer (offsets into the Values array):

| Bytes 0-3 | Bytes 3-6 | Bytes 7-10 | Bytes 10-13 |
|------------|------------|------------|-------------|
| 0 | 2 | 6 | 7 |

* Values array (`List<byte>`)
* Length: 6, Null count: 1
* Null bitmap buffer:

| Byte 0 (validity bitmap) | Bytes 1-7 |
|--------------------------|-------------|
| 00110111 | 0 (padding) |

* Offsets array (int32 array)
* Length 7, Null count: 0
* Null bitmap buffer: Not required

| Bytes 0-28 |
|----------------------|
| 0, 2, 4, 7, 7, 8, 10 |

* Values array (bytes):
* Length: 10, Null count: 0
* Null bitmap buffer: Not required

| Bytes 0-9 |
|-------------------------------|
| 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 |
```

## Struct type

Expand All @@ -198,7 +317,8 @@ types (which can all be distinct), called its fields.
Typically the fields have names, but the names and their types are part of the
type metadata, not the physical memory layout.

A struct does not have any additional allocated physical storage.
A struct array does not have any additional allocated physical storage for its values.
A struct array must still have an allocated null bitmap, if it has one or more null values.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a struct array will always have the null bitmap, regardless being of any null entry or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't required to have the memory allocated if there are no nulls, like the other array types.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thx. I'm not sure about how these types are used. When creating the object of such type, it must be known before if any null exists by scanning the data to put into it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the following dataset:

data = [
  {foo: 1, bar: null},
  null,
  null,
  {foo: 0, bar: 5}
]       

Physically, a struct type has one child array for each field.

Expand All @@ -213,15 +333,63 @@ Struct <
```

has two child arrays, one List<char> array (layout as above) and one 4-byte
physical value array having Int32 logical type. Here is a diagram showing the
full physical layout of this struct:
primitive value array having Int32 logical type.

### Example Layout: `Struct<List<char>, Int32>`:
The layout for [{'joe', 1}, {null, 2}, null, {'mark', 4}] would be:

```
* Length: 4, Null count: 1
* Null bitmap buffer:

<img src="diagrams/layout-list-of-struct.png" width="400"/>
| Byte 0 (validity bitmap) | Bytes 1-7 |
|--------------------------|-------------|
| 00001011 | 0 (padding) |

* Children arrays:
* field-0 array (`List<char>`):
* Length: 4, Null count: 1
* Null bitmap buffer:

| Byte 0 (validity bitmap) | Bytes 1-7 |
|--------------------------|-----------------------|
| 00011101 | 0 (padding) |

* Offsets array:
* Length: 5, Null count: 0
* Null bitmap buffer: Not required

| byte 0-19 |
|----------------|
| 0, 3, 3, 6, 10 |

* Values array:
* Length: 10, Null count: 0
* Null bitmap buffer: Not required

* Value buffer:

| byte 0-9 |
|----------------|
| joebobmark |

* field-1 array (int32 array):
* Length: 4, Null count: 0
* Null bitmap buffer: Not required
* Value Buffer:

| byte 0-15 |
|----------------|
| 1, 2, 3, 4 |

```

While a struct does not have physical storage for each of its semantic slots
(i.e. each scalar C-like struct), an entire struct slot can be set to null via
the null bitmap. Any of the child field arrays can have null values according
to their respective independent null bitmaps.
In the example above, the child arrays have a valid entries for the null struct
but are 'hidden' from the consumer by the parent array's null bitmap.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I recall there have been some questions around whether the child arrays' bitmaps must necessarily be set to null if the parent struct slot is null. I think the answer is "no" (and, in fact, you could combine a set of immutable constructed-elsewhere arrays with a bitmap that you layer on top to "null out" those other values), so having to twiddle other bitmaps would be onerous and ultimately not that useful.

If you agree perhaps we can spell this out here also specifically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree and I thought I read this elsewhere in the spec, but sounds like it is worth confirming the mailing list?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is this statement already: "Any of the child field arrays can have null values according to their respective independent null bitmaps." I don't think it was ever a controversial point but rather just unclear -- feel free to raise it on the mailing list if you like, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rereading the spec, I agree. I will update the main text accordingly.


## Dense union type

Expand All @@ -248,8 +416,8 @@ Alternate proposal (TBD): the types and offset values may be packed into an
int48 with 2 bytes for the type and 4 bytes for the offset.

Critically, the dense union allows for minimal overhead in the ubiquitous
union-of-structs with non-overlapping-fields use case (Union<s1: Struct1, s2:
Struct2, s3: Struct3, …>)
union-of-structs with non-overlapping-fields use case (`Union<s1: Struct1, s2:
Struct2, s3: Struct3, ...>`)

Here is a diagram of an example dense union:

Expand All @@ -266,15 +434,18 @@ union, it has some advantages that may be desirable in certain use cases:

<img src="diagrams/layout-sparse-union.png" width="400"/>

More amenable to vectorized expression evaluation in some use cases.
Equal-length arrays can be interpreted as a union by only defining the types array
* A sparse union is more amenable to vectorized expression evaluation in some use cases.
* Equal-length arrays can be interpreted as a union by only defining the types array.

Note that nested types in a sparse union must be internally consistent
(e.g. see the List in the diagram), i.e. random access at any index j yields
the correct value.
In other words, the array for the nested type must be valid if it is
reinterpreted as a non-nested array.


## References

Drill docs https://drill.apache.org/docs/value-vectors/

[1]: https://en.wikipedia.org/wiki/Bit_numbering
[1]: https://en.wikipedia.org/wiki/Bit_numbering