feat: add data file format / version information to manifest (#2673)

Add new "data storage format" property which allows a dataset to specify what file format (only lance) and the version to use when writing data Introduce a configurable version to the lance writer and change FSST and bitpacking to be guarded by a 2_1 version instead of env. variables Change compression to be based on field metadata instead of environment variables Migrate some tests to use v2
lancedb · Aug 5, 2024 · 70a75f3 · 70a75f3
1 parent 7544611
commit 70a75f3
Show file tree

Hide file tree

Showing 51 changed files with 1,360 additions and 552 deletions.
diff --git a/docs/file_struct.png b/docs/file_struct.png
diff --git a/docs/format.rst b/docs/format.rst
@@ -1,5 +1,9 @@
-File Format
-===========
+Lance Formats
+=============
+
+The Lance project includes both a table format and a file format.  Lance typically refers
+to tables as "datasets".  A Lance dataset is designed to efficiently handle secondary indices,
+fast ingestion and modification of data, and a rich set of schema evolution features.
 
 Dataset Directory
 ------------------
@@ -53,31 +57,147 @@ File Structure
 
 Each ``.lance`` file is the container for the actual data.
 
-.. image:: file_struct.png
+.. image:: format_overview.png
 
-At the tail of the file, a `Metadata` protobuf block is used to describe the structure of the data file.
+At the tail of the file, `ColumnMetadata` protobuf blocks are used to describe the encoding of the columns
+in the file.
 
-.. literalinclude:: ../protos/file.proto
+.. literalinclude:: ../protos/file2.proto
    :language: protobuf
    :linenos:
-   :start-at: message Metadata {
-   :end-at: } // Metadata
+   :start-at: // ## Metadata
+   :end-at: } // Metadata-End
 
-Optionally, a ``Manifest`` block can be stored after the ``Metadata`` block, to make the lance file self-describable.
+A ``Footer`` describes the overall layout of the file.  The entire file layout is described here:
 
-In the end of the file, a ``Footer`` is written to indicate the closure of a file:
+.. literalinclude:: ../protos/file2.proto
+   :language: protobuf
+   :linenos:
+   :start-at: // ## File Layout
+   :end-at: // File Layout-End
+
+File Version
+------------
+
+The Lance file format has gone through a number of changes including a breaking change
+from version 1 to version 2.  There are a number of APIs that allow the file version to
+be specified.  Using a newer version of the file format will lead to better compression
+and/or performance.  However, older software versions may not be able to read newer files.
+
+In addition, the latest version of the file format (next) is unstable and should not be
+used for production use cases.  Breaking changes could be made to unstable encodings and
+that would mean that files written with these encodings are no longer readable by any 
+newer versions of Lance.  The ``next`` version should only be used for experimentation
+and benchmarking upcoming features.
+
+The following values are supported:
+
+.. list-table: File Versions
+    :widths: 20 20 20 40
+    :header-rows: 1
+  
+    * - Version
+      - Minimal Lance Version
+      - Maximum Lance Version
+      - Description
+    * - 0.1
+      - Any
+      - Any
+      - This is the initial Lance format.
+    * - 2.0
+      - 0.15.0
+      - Any
+      - Rework of the Lance file format that removed row groups and introduced null
+        support for lists, fixed size lists, and primtives
+    * - 2.1 (unstable)
+      - None
+      - Any
+      - Adds FSST string compression and bit packing
+    * - legacy
+      - N/A
+      - N/A
+      - Alias for 0.1
+    * - stable
+      - N/A
+      - N/A
+      - Alias for the latest stable version (currently 2.0)
+    * - next
+      - N/A
+      - N/A
+      - Alias for the latest unstable version (currently 2.1)
+
+File Encodings
+--------------
 
-.. code-block::
+Lance supports a variety of encodings for different data types. The encodings
+are chosen to give both random access and scan performance. Encodings are added
+over time and may be extended in the future.  The manifest records a max format
+version which controls which encodings will be used.  This allows for a gradual
+migration to a new data format so that old readers can still read new data while
+a migration is in progress.
+
+Encodings are divided into "field encodings" and "array encodings".  Field encodings
+are consistent across an entire field of data, while array encodings are used for
+individual pages of data within a field.  Array encodings can nest other array
+encodings (e.g. a dictionary encoding can bitpack the indices) however array encodings
+cannot nest field encodings.  For this reason data types such as
+``Dictionary<UInt8, List<String>>`` are not yet supported (since there is no dictionary
+field encoding)
+
+.. list-table:: Encodings Available
+   :widths: 15 15 40 15 15
+   :header-rows: 1
 
-    +---------------+----------------+
-    | 0 - 3 byte    | 4 - 7 byte     |
-    +===============+================+
-    | metadata position (uint64)     |
-    +---------------+----------------+
-    | major version | minor version  |
-    +---------------+----------------+
-    |   Magic number "LANC"          |
-    +--------------------------------+
+   * - Encoding Name
+     - Encoding Type
+     - What it does
+     - Supported Versions
+     - When it is applied
+   * - Basic struct
+     - Field encoding
+     - Encodes non-nullable struct data
+     - >= 2.0
+     - Default encoding for structs
+   * - List
+     - Field encoding
+     - Encodes lists (nullable or non-nullable)
+     - >= 2.0
+     - Default encoding for lists
+   * - Basic Primitive
+     - Field encoding
+     - Encodes primitive data types using separate validity array
+     - >= 2.0
+     - Default encoding for primitive data types
+   * - Value
+     - Array encoding
+     - Encodes a single vector of fixed-width values
+     - >= 2.0
+     - Fallback encoding for fixed-width types
+   * - Binary
+     - Array encoding
+     - Encodes a single vector of variable-width data
+     - >= 2.0
+     - Fallback encoding for variable-width types
+   * - Dictionary
+     - Array encoding
+     - Encodes data using a dictionary array and an indices array which is useful for large data types with few unique values
+     - >= 2.0
+     - Used on string pages with fewer than 100 unique elements
+   * - Packed struct
+     - Array encoding
+     - Encodes a struct with fixed-width fields in a row-major format making random access more efficient
+     - >= 2.0
+     - Only used on struct types if the field metadata attribute ``"packed"`` is set to ``"true"``
+   * - Fsst
+     - Array encoding
+     - Compresses binary data by identifying common substrings (of 8 bytes or less) and encoding them as symbols
+     - >= 2.1
+     - Used on string pages that are not dictionary encoded
+   * - Bitpacking
+     - Array encoding
+     - Encodes a single vector of fixed-width values using bitpacking which is useful for integral types that do not span the full range of values
+     - >= 2.1
+     - Used on integral types
 
 Feature Flags
 -------------
@@ -143,47 +263,6 @@ Would be represented as the following field list:
      - 2
      - ``"int32"``
 
-Encodings
----------
-
-`Lance` uses encodings that can render good both point query and scan performance.
-Generally, it requires:
-
-1. It takes no more than 2 disk reads to access any data points.
-2. It takes sub-linear computation (``O(n)``) to locate one piece of data.
-
-Plain Encoding
-~~~~~~~~~~~~~~
-
-Plain encoding stores Arrow array with **fixed size** values, such as primitive values, in contiguous space on disk.
-Because the size of each value is fixed, the offset of a particular value can be computed directly.
-
-Null: TBD
-
-Variable-Length Binary Encoding
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-For variable-length data types, i.e., ``(Large)Binary / (Large)String / (Large)List`` in Arrow, Lance uses variable-length
-encoding. Similar to Arrow in-memory layout, the on-disk layout include an offset array, and the actual data array.
-The offset array contains the **absolute offset** of each value appears in the file.
-
-.. code-block::
-
-    +---------------+----------------+
-    | offset array  | data array     |
-    +---------------+----------------+
-
-
-If ``offsets[i] == offsets[i + 1]``, we treat the ``i-th`` value as ``Null``.
-
-Dictionary Encoding
-~~~~~~~~~~~~~~~~~~~
-
-Directory encoding is a composite encoding for a
-`Arrow Dictionary Type <https://arrow.apache.org/docs/python/generated/pyarrow.DictionaryType.html#pyarrow.DictionaryType>`_,
-where Lance encodes the `key` and `value` separately using primitive encoding types,
-i.e., `key` are usually encoded with `Plain Encoding`_.
-
 
 Dataset Update and Schema Evolution
 -----------------------------------

diff --git a/docs/format_overview.png b/docs/format_overview.png
diff --git a/docs/index.rst b/docs/index.rst
@@ -43,7 +43,7 @@ Preview releases receive the same level of testing as regular releases.
 
    Quickstart <./notebooks/quickstart>
    ./read_and_write
-   File Format <./format>
+   Lance Formats <./format>
    Arrays <./arrays>
    Integrations <./integrations/integrations>
    API References <./api/api>

diff --git a/protos/file2.proto b/protos/file2.proto
@@ -92,6 +92,8 @@ import "google/protobuf/empty.proto";
 // |   "LANC"                         |
 // ├──────────────────────────────────┤
 //
+// File Layout-End
+//
 // ## Data Pages
 //
 // A lot of flexiblity is provided in how data is stored.  Note that the file
@@ -196,7 +198,7 @@ message ColumnMetadata {
   // This field will have the same length as `buffer_offsets` and
   // may be empty.
   repeated uint64 buffer_sizes = 4;
-}
+} // Metadata-End
 
 // ## Where is the rest?
 //

diff --git a/protos/table.proto b/protos/table.proto
@@ -121,6 +121,22 @@ message Manifest {
   //
   // This is only used if the "move_stable_row_ids" feature flag is set.
   uint64 next_row_id = 14;
+
+  message DataStorageFormat {
+    // The format of the data files (e.g. "lance")
+    string file_format = 1;
+    // The max format version of the data files.
+    //
+    // This is the maximum version of the file format that the dataset will create.
+    // This may be lower than the maximum version that can be written in order to allow
+    // older readers to read the dataset.
+    string version = 2;
+  }
+
+  // The data storage format
+  //
+  // This specifies what format is used to store the data files.
+  DataStorageFormat data_format = 15;
 } // Manifest
 
 // Auxiliary Data attached to a version.

diff --git a/python/python/lance/dataset.py b/python/python/lance/dataset.py
@@ -374,6 +374,13 @@ def lance_schema(self) -> "LanceSchema":
         """
         return self._ds.lance_schema
 
+    @property
+    def data_storage_version(self) -> str:
+        """
+        The version of the data storage format this dataset is using
+        """
+        return self._ds.data_storage_version
+
     def to_table(
         self,
         columns: Optional[Union[List[str], Dict[str, str]]] = None,
@@ -2641,7 +2648,8 @@ def write_dataset(
     commit_lock: Optional[CommitLock] = None,
     progress: Optional[FragmentWriteProgress] = None,
     storage_options: Optional[Dict[str, str]] = None,
-    use_legacy_format: bool = True,
+    data_storage_version: str = "legacy",
+    use_legacy_format: Optional[bool] = None,
 ) -> LanceDataset:
     """Write a given data_obj to the given uri
 
@@ -2681,10 +2689,25 @@ def write_dataset(
     storage_options : optional, dict
         Extra options that make sense for a particular storage connection. This is
         used to store connection parameters like credentials, endpoint, etc.
-    use_legacy_format : optional, bool, default True
-        Use the Lance v1 writer to write Lance v1 files.  The default is currently
-        True but will change as we roll out the v2 format.
+    data_storage_version: optional, str, default "legacy"
+        The version of the data storage format to use. Newer versions are more
+        efficient but require newer versions of lance to read.  The default is
+        "legacy" which will use the legacy v1 version.  See the user guide
+        for more details.
+    use_legacy_format : optional, bool, default None
+        Deprecated method for setting the data storage version. Use the
+        `data_storage_version` parameter instead.
     """
+    if use_legacy_format is not None:
+        warnings.warn(
+            "use_legacy_format is deprecated, use data_storage_version instead",
+            DeprecationWarning,
+        )
+        if use_legacy_format:
+            data_storage_version = "legacy"
+        else:
+            data_storage_version = "stable"
+
     if _check_for_hugging_face(data_obj):
         # Huggingface datasets
         from .dependencies import datasets
@@ -2705,7 +2728,7 @@ def write_dataset(
         "max_bytes_per_file": max_bytes_per_file,
         "progress": progress,
         "storage_options": storage_options,
-        "use_legacy_format": use_legacy_format,
+        "data_storage_version": data_storage_version,
     }
 
     if commit_lock:

diff --git a/python/python/lance/file.py b/python/python/lance/file.py
@@ -149,6 +149,7 @@ def __init__(
         schema: Optional[pa.Schema] = None,
         *,
         data_cache_bytes: Optional[int] = None,
+        version: Optional[str] = None,
         **kwargs,
     ):
         """
@@ -166,9 +167,13 @@ def __init__(
         data_cache_bytes: int
             How many bytes (per column) to cache before writing a page.  The
             default is an appropriate value based on the filesystem.
+        version: str
+            The version of the file format to write.  If not specified then
+            the latest stable version will be used.  Newer versions are more
+            efficient but may not be readable by older versions of the software.
         """
         self._writer = _LanceFileWriter(
-            path, schema, data_cache_bytes=data_cache_bytes, **kwargs
+            path, schema, data_cache_bytes=data_cache_bytes, version=version, **kwargs
         )
         self.closed = False