Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37876: [Format] Add list-view specification to arrow format #37877

Merged
merged 13 commits into from
Oct 5, 2023
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 120 additions & 12 deletions docs/source/format/Columnar.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,15 +100,15 @@ Arrays are defined by a few pieces of metadata and data:
Nested arrays additionally have a sequence of one or more sets of
these items, called the **child arrays**.

Each logical data type has a well-defined physical layout. Here are
the different physical layouts defined by Arrow:
Each logical data type has one or more well-defined physical layouts. Here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the singular. There is no disjunction in Arrow (unlike Parquet) between "logical" data type and physical layout. ListView and StringView are simply distinct types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change this back to singular and all the other places I've changed it. But in the future, the "logical data type" terminology should probably be removed altogether because it's very confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree with that. The spec was often confusing to me at the start.

are the different physical layouts defined by Arrow:

* **Primitive (fixed-size)**: a sequence of values each having the
same byte or bit width
* **Variable-size Binary**: a sequence of values each having a variable
byte length. Two variants of this layout are supported using 32-bit
and 64-bit length encoding.
* **Views of Variable-size Binary**: a sequence of values each having a
* **View of Variable-size Binary**: a sequence of values each having a
variable byte length. In contrast to Variable-size Binary, the values
of this layout are distributed across potentially multiple buffers
instead of densely and sequentially packed in a single buffer.
Expand All @@ -118,6 +118,11 @@ the different physical layouts defined by Arrow:
variable-length sequence of values taken from a child data type. Two
variants of this layout are supported using 32-bit and 64-bit length
encoding.
* **View of Variable-size List**: a nested layout where each value is a
variable-length sequence of values taken from a child data type. This
layout differs from **Variable-size List** by having an additional
buffer containing the sizes of each list value. This removes a constraint
on the offsets buffer — it does not need to be in order.
* **Struct**: a nested layout consisting of a collection of named
child **fields** each having the same length but possibly different
types.
Expand Down Expand Up @@ -382,7 +387,7 @@ In both the long and short string cases, the first four bytes encode the
length of the string and can be used to determine how the rest of the view
should be interpreted.

In the short string case the string's bytes are inlined- stored inside the
In the short string case the string's bytes are inlined stored inside the
view itself, in the twelve bytes which follow the length.

In the long string case, a buffer index indicates which data buffer
Expand All @@ -401,11 +406,17 @@ This layout is adapted from TU Munich's `UmbraDB`_.

.. _variable-size-list-layout:

Variable-size List Layout
-------------------------
Variable-size List Layouts
--------------------------

List is a nested type which is semantically similar to variable-size
binary. It is defined by two buffers, a validity bitmap and an offsets
binary. There are two list layout variations — "list" and "list-view" —
and each variation can use either 32-bit or 64-bit offsets.
felipecrv marked this conversation as resolved.
Show resolved Hide resolved

List Layout
~~~~~~~~~~~

The List layout is defined by two buffers, a validity bitmap and an offsets
buffer, and a child array. The offsets are the same as in the
variable-size binary case, and both 32-bit and 64-bit signed integer
offsets are supported options for the offsets. Rather than referencing
Expand Down Expand Up @@ -487,6 +498,103 @@ will be represented as follows: ::
|-------------------------------|-----------------------|
| 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | unspecified (padding) |

ListView Layout
~~~~~~~~~~~~~~~

The ListView layout is defined by three buffers instead of just two:
felipecrv marked this conversation as resolved.
Show resolved Hide resolved
a validity bitmap, an offsets buffer, and an additional sizes buffer.
The sizes have the same bit width as the offsets and both 32-bit and 64-bit
signed integer options are supported. Like in the List layout, the offsets
reference the child array.

Rather then inferring list lengths from the offsets, the sizes buffer
stores the length of each list in the array. This in turn allows offsets to be
felipecrv marked this conversation as resolved.
Show resolved Hide resolved
out of order. Elements of the child array do not have to be stored in the
same order they logically appear in the list elements of the parent array.

When a value is null, the corresponding offset and size can have arbitrary
values. When size is 0, the corresponding offset can have an arbitrary value.
felipecrv marked this conversation as resolved.
Show resolved Hide resolved
If choosing a value is possible, we recommend setting offsets and sizes to 0 in
felipecrv marked this conversation as resolved.
Show resolved Hide resolved
these cases.

A list-view type is specified like ``ListView<T>``, where ``T`` is any type
(primitive or nested). In these examples we use 32-bit offsets where
the 64-bit offset version would be denoted by ``LargeListView<T>``.
felipecrv marked this conversation as resolved.
Show resolved Hide resolved

**Example Layout: ``List<Int8>`` Array**
felipecrv marked this conversation as resolved.
Show resolved Hide resolved

We illustrate an example of ``ListView<Int8>`` with length 4 having values::

[[12, -7, 25], null, [0, -127, 127, 50], []]

will have the following representation: ::
felipecrv marked this conversation as resolved.
Show resolved Hide resolved

* Length: 4, Null count: 1
* Validity bitmap buffer:

| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00001101 | 0 (padding) |

* Offsets buffer (int32)

| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
|------------|-------------|-------------|-------------|-----------------------|
| 0 | unspecified | 3 | unspecified | unspecified (padding) |

* Sizes buffer (int32)

| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
|------------|-------------|-------------|-------------|-----------------------|
| 3 | unspecified | 4 | 0 | unspecified (padding) |

* Values array (Int8array):
felipecrv marked this conversation as resolved.
Show resolved Hide resolved
* Length: 7, Null count: 0
* Validity bitmap buffer: Not required
* Values buffer (int8)

| Bytes 0-6 | Bytes 7-63 |
|------------------------------|-----------------------|
| 12, -7, 25, 0, -127, 127, 50 | unspecified (padding) |

**Example Layout: ``ListView<Int8>`` Array**

We continue with the ``ListView<Int8>`` type, but this instance illustrates out
of order offsets and sharing of child array values. It is an array with length 5
having logical values::

[[12, -7, 25], null, [0, -127, 127, 50], [], [50, 12]]

It will have the following representation: ::

* Length: 4, Null count: 1
* Validity bitmap buffer:

| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00011101 | 0 (padding) |

* Offsets buffer (int32)

| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
|------------|-------------|-------------|-------------|-------------|-----------------------|
| 4 | unspecified | 0 | unspecified | 3 | unspecified (padding) |

* Sizes buffer (int32)

| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
|------------|-------------|-------------|-------------|-------------|-----------------------|
| 3 | unspecified | 4 | 0 | 2 | unspecified (padding) |

* Values array (Int8array):
* Length: 7, Null count: 0
* Validity bitmap buffer: Not required
* Values buffer (int8)

| Bytes 0-6 | Bytes 7-63 |
|------------------------------|-----------------------|
| 0, -127, 127, 50, 12, -7, 25 | unspecified (padding) |

Fixed-Size List Layout
----------------------

Expand Down Expand Up @@ -618,8 +726,8 @@ for the null struct but they are "hidden" by the struct array's validity
bitmap. However, when treated independently, corresponding entries of the
children array will be non-null.

Union Layout
------------
Union Layouts
-------------
felipecrv marked this conversation as resolved.
Show resolved Hide resolved

A union is defined by an ordered sequence of types; each slot in the
union can have a value chosen from these types. The types are named
Expand Down Expand Up @@ -858,19 +966,19 @@ are held in the second child array.
For the purposes of determining field names and schemas, these child arrays
are prescribed the standard names of **run_ends** and **values** respectively.

The values in the first child array represent the accumulated length of all runs
The values in the first child array represent the accumulated length of all runs
from the first to the current one, i.e. the logical index where the
current run ends. This allows relatively efficient random access from a logical
index using binary search. The length of an individual run can be determined by
subtracting two adjacent values. (Contrast this with run-length encoding, in
which the lengths of the runs are represented directly, and in which random
access is less efficient.)
access is less efficient.)

.. note::
Because the ``run_ends`` child array cannot have nulls, it's reasonable
to consider why the ``run_ends`` are a child array instead of just a
buffer, like the offsets for a :ref:`variable-size-list-layout`. This
layout was considered, but it was decided to use the child arrays.
layout was considered, but it was decided to use the child arrays.

Child arrays allow us to keep the "logical length" (the decoded length)
associated with the parent array and the "physical length" (the number
Expand Down
16 changes: 15 additions & 1 deletion format/Schema.fbs
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@
/// Version 1.1 - Add Decimal256.
/// Version 1.2 - Add Interval MONTH_DAY_NANO.
/// Version 1.3 - Add Run-End Encoded.
/// Version 1.4 - Add BinaryView, Utf8View, and variadicBufferCounts.
/// Version 1.4 - Add BinaryView, Utf8View, variadicBufferCounts, ListView, and
/// LargeListView.

namespace org.apache.arrow.flatbuf;

Expand Down Expand Up @@ -97,6 +98,17 @@ table List {
table LargeList {
}

/// Represents the same logical types that List can, but contains offsets and
/// sizes allowing for writes in any order and sharing of child values among
/// list values.
table ListView {
}

/// Same as ListVIew, but with 64-bit offsets and sizes, allowing to represent
felipecrv marked this conversation as resolved.
Show resolved Hide resolved
/// extremely large data values.
table LargeListView {
}

table FixedSizeList {
/// Number of list items per value
listSize: int;
Expand Down Expand Up @@ -451,6 +463,8 @@ union Type {
RunEndEncoded,
BinaryView,
Utf8View,
ListView,
LargeListView,
}

/// ----------------------------------------------------------------------
Expand Down