Skip to content

Commit

Permalink
feat: add data file format / version information to manifest (#2673)
Browse files Browse the repository at this point in the history
Add new "data storage format" property which allows a dataset to specify
what file format (only lance) and the version to use when writing data

Introduce a configurable version to the lance writer and change FSST and
bitpacking to be guarded by a 2_1 version instead of env. variables
Change compression to be based on field metadata instead of environment
variables
Migrate some tests to use v2
  • Loading branch information
westonpace authored Aug 5, 2024
1 parent 7544611 commit 70a75f3
Show file tree
Hide file tree
Showing 51 changed files with 1,360 additions and 552 deletions.
Binary file removed docs/file_struct.png
Binary file not shown.
199 changes: 139 additions & 60 deletions docs/format.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
File Format
===========
Lance Formats
=============

The Lance project includes both a table format and a file format. Lance typically refers
to tables as "datasets". A Lance dataset is designed to efficiently handle secondary indices,
fast ingestion and modification of data, and a rich set of schema evolution features.

Dataset Directory
------------------
Expand Down Expand Up @@ -53,31 +57,147 @@ File Structure

Each ``.lance`` file is the container for the actual data.

.. image:: file_struct.png
.. image:: format_overview.png

At the tail of the file, a `Metadata` protobuf block is used to describe the structure of the data file.
At the tail of the file, `ColumnMetadata` protobuf blocks are used to describe the encoding of the columns
in the file.

.. literalinclude:: ../protos/file.proto
.. literalinclude:: ../protos/file2.proto
:language: protobuf
:linenos:
:start-at: message Metadata {
:end-at: } // Metadata
:start-at: // ## Metadata
:end-at: } // Metadata-End

Optionally, a ``Manifest`` block can be stored after the ``Metadata`` block, to make the lance file self-describable.
A ``Footer`` describes the overall layout of the file. The entire file layout is described here:

In the end of the file, a ``Footer`` is written to indicate the closure of a file:
.. literalinclude:: ../protos/file2.proto
:language: protobuf
:linenos:
:start-at: // ## File Layout
:end-at: // File Layout-End

File Version
------------

The Lance file format has gone through a number of changes including a breaking change
from version 1 to version 2. There are a number of APIs that allow the file version to
be specified. Using a newer version of the file format will lead to better compression
and/or performance. However, older software versions may not be able to read newer files.

In addition, the latest version of the file format (next) is unstable and should not be
used for production use cases. Breaking changes could be made to unstable encodings and
that would mean that files written with these encodings are no longer readable by any
newer versions of Lance. The ``next`` version should only be used for experimentation
and benchmarking upcoming features.

The following values are supported:

.. list-table: File Versions
:widths: 20 20 20 40
:header-rows: 1
* - Version
- Minimal Lance Version
- Maximum Lance Version
- Description
* - 0.1
- Any
- Any
- This is the initial Lance format.
* - 2.0
- 0.15.0
- Any
- Rework of the Lance file format that removed row groups and introduced null
support for lists, fixed size lists, and primtives
* - 2.1 (unstable)
- None
- Any
- Adds FSST string compression and bit packing
* - legacy
- N/A
- N/A
- Alias for 0.1
* - stable
- N/A
- N/A
- Alias for the latest stable version (currently 2.0)
* - next
- N/A
- N/A
- Alias for the latest unstable version (currently 2.1)
File Encodings
--------------

.. code-block::
Lance supports a variety of encodings for different data types. The encodings
are chosen to give both random access and scan performance. Encodings are added
over time and may be extended in the future. The manifest records a max format
version which controls which encodings will be used. This allows for a gradual
migration to a new data format so that old readers can still read new data while
a migration is in progress.

Encodings are divided into "field encodings" and "array encodings". Field encodings
are consistent across an entire field of data, while array encodings are used for
individual pages of data within a field. Array encodings can nest other array
encodings (e.g. a dictionary encoding can bitpack the indices) however array encodings
cannot nest field encodings. For this reason data types such as
``Dictionary<UInt8, List<String>>`` are not yet supported (since there is no dictionary
field encoding)

.. list-table:: Encodings Available
:widths: 15 15 40 15 15
:header-rows: 1

+---------------+----------------+
| 0 - 3 byte | 4 - 7 byte |
+===============+================+
| metadata position (uint64) |
+---------------+----------------+
| major version | minor version |
+---------------+----------------+
| Magic number "LANC" |
+--------------------------------+
* - Encoding Name
- Encoding Type
- What it does
- Supported Versions
- When it is applied
* - Basic struct
- Field encoding
- Encodes non-nullable struct data
- >= 2.0
- Default encoding for structs
* - List
- Field encoding
- Encodes lists (nullable or non-nullable)
- >= 2.0
- Default encoding for lists
* - Basic Primitive
- Field encoding
- Encodes primitive data types using separate validity array
- >= 2.0
- Default encoding for primitive data types
* - Value
- Array encoding
- Encodes a single vector of fixed-width values
- >= 2.0
- Fallback encoding for fixed-width types
* - Binary
- Array encoding
- Encodes a single vector of variable-width data
- >= 2.0
- Fallback encoding for variable-width types
* - Dictionary
- Array encoding
- Encodes data using a dictionary array and an indices array which is useful for large data types with few unique values
- >= 2.0
- Used on string pages with fewer than 100 unique elements
* - Packed struct
- Array encoding
- Encodes a struct with fixed-width fields in a row-major format making random access more efficient
- >= 2.0
- Only used on struct types if the field metadata attribute ``"packed"`` is set to ``"true"``
* - Fsst
- Array encoding
- Compresses binary data by identifying common substrings (of 8 bytes or less) and encoding them as symbols
- >= 2.1
- Used on string pages that are not dictionary encoded
* - Bitpacking
- Array encoding
- Encodes a single vector of fixed-width values using bitpacking which is useful for integral types that do not span the full range of values
- >= 2.1
- Used on integral types

Feature Flags
-------------
Expand Down Expand Up @@ -143,47 +263,6 @@ Would be represented as the following field list:
- 2
- ``"int32"``

Encodings
---------

`Lance` uses encodings that can render good both point query and scan performance.
Generally, it requires:

1. It takes no more than 2 disk reads to access any data points.
2. It takes sub-linear computation (``O(n)``) to locate one piece of data.

Plain Encoding
~~~~~~~~~~~~~~

Plain encoding stores Arrow array with **fixed size** values, such as primitive values, in contiguous space on disk.
Because the size of each value is fixed, the offset of a particular value can be computed directly.

Null: TBD

Variable-Length Binary Encoding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For variable-length data types, i.e., ``(Large)Binary / (Large)String / (Large)List`` in Arrow, Lance uses variable-length
encoding. Similar to Arrow in-memory layout, the on-disk layout include an offset array, and the actual data array.
The offset array contains the **absolute offset** of each value appears in the file.

.. code-block::
+---------------+----------------+
| offset array | data array |
+---------------+----------------+
If ``offsets[i] == offsets[i + 1]``, we treat the ``i-th`` value as ``Null``.

Dictionary Encoding
~~~~~~~~~~~~~~~~~~~

Directory encoding is a composite encoding for a
`Arrow Dictionary Type <https://arrow.apache.org/docs/python/generated/pyarrow.DictionaryType.html#pyarrow.DictionaryType>`_,
where Lance encodes the `key` and `value` separately using primitive encoding types,
i.e., `key` are usually encoded with `Plain Encoding`_.


Dataset Update and Schema Evolution
-----------------------------------
Expand Down
Binary file added docs/format_overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ Preview releases receive the same level of testing as regular releases.

Quickstart <./notebooks/quickstart>
./read_and_write
File Format <./format>
Lance Formats <./format>
Arrays <./arrays>
Integrations <./integrations/integrations>
API References <./api/api>
Expand Down
4 changes: 3 additions & 1 deletion protos/file2.proto
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,8 @@ import "google/protobuf/empty.proto";
// | "LANC" |
// ├──────────────────────────────────┤
//
// File Layout-End
//
// ## Data Pages
//
// A lot of flexiblity is provided in how data is stored. Note that the file
Expand Down Expand Up @@ -196,7 +198,7 @@ message ColumnMetadata {
// This field will have the same length as `buffer_offsets` and
// may be empty.
repeated uint64 buffer_sizes = 4;
}
} // Metadata-End

// ## Where is the rest?
//
Expand Down
16 changes: 16 additions & 0 deletions protos/table.proto
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,22 @@ message Manifest {
//
// This is only used if the "move_stable_row_ids" feature flag is set.
uint64 next_row_id = 14;

message DataStorageFormat {
// The format of the data files (e.g. "lance")
string file_format = 1;
// The max format version of the data files.
//
// This is the maximum version of the file format that the dataset will create.
// This may be lower than the maximum version that can be written in order to allow
// older readers to read the dataset.
string version = 2;
}

// The data storage format
//
// This specifies what format is used to store the data files.
DataStorageFormat data_format = 15;
} // Manifest

// Auxiliary Data attached to a version.
Expand Down
33 changes: 28 additions & 5 deletions python/python/lance/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -374,6 +374,13 @@ def lance_schema(self) -> "LanceSchema":
"""
return self._ds.lance_schema

@property
def data_storage_version(self) -> str:
"""
The version of the data storage format this dataset is using
"""
return self._ds.data_storage_version

def to_table(
self,
columns: Optional[Union[List[str], Dict[str, str]]] = None,
Expand Down Expand Up @@ -2641,7 +2648,8 @@ def write_dataset(
commit_lock: Optional[CommitLock] = None,
progress: Optional[FragmentWriteProgress] = None,
storage_options: Optional[Dict[str, str]] = None,
use_legacy_format: bool = True,
data_storage_version: str = "legacy",
use_legacy_format: Optional[bool] = None,
) -> LanceDataset:
"""Write a given data_obj to the given uri
Expand Down Expand Up @@ -2681,10 +2689,25 @@ def write_dataset(
storage_options : optional, dict
Extra options that make sense for a particular storage connection. This is
used to store connection parameters like credentials, endpoint, etc.
use_legacy_format : optional, bool, default True
Use the Lance v1 writer to write Lance v1 files. The default is currently
True but will change as we roll out the v2 format.
data_storage_version: optional, str, default "legacy"
The version of the data storage format to use. Newer versions are more
efficient but require newer versions of lance to read. The default is
"legacy" which will use the legacy v1 version. See the user guide
for more details.
use_legacy_format : optional, bool, default None
Deprecated method for setting the data storage version. Use the
`data_storage_version` parameter instead.
"""
if use_legacy_format is not None:
warnings.warn(
"use_legacy_format is deprecated, use data_storage_version instead",
DeprecationWarning,
)
if use_legacy_format:
data_storage_version = "legacy"
else:
data_storage_version = "stable"

if _check_for_hugging_face(data_obj):
# Huggingface datasets
from .dependencies import datasets
Expand All @@ -2705,7 +2728,7 @@ def write_dataset(
"max_bytes_per_file": max_bytes_per_file,
"progress": progress,
"storage_options": storage_options,
"use_legacy_format": use_legacy_format,
"data_storage_version": data_storage_version,
}

if commit_lock:
Expand Down
7 changes: 6 additions & 1 deletion python/python/lance/file.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,7 @@ def __init__(
schema: Optional[pa.Schema] = None,
*,
data_cache_bytes: Optional[int] = None,
version: Optional[str] = None,
**kwargs,
):
"""
Expand All @@ -166,9 +167,13 @@ def __init__(
data_cache_bytes: int
How many bytes (per column) to cache before writing a page. The
default is an appropriate value based on the filesystem.
version: str
The version of the file format to write. If not specified then
the latest stable version will be used. Newer versions are more
efficient but may not be readable by older versions of the software.
"""
self._writer = _LanceFileWriter(
path, schema, data_cache_bytes=data_cache_bytes, **kwargs
path, schema, data_cache_bytes=data_cache_bytes, version=version, **kwargs
)
self.closed = False

Expand Down
Loading

0 comments on commit 70a75f3

Please sign in to comment.