Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: reorg the scalar index python docstring to make the index type clear #2678

Merged
merged 1 commit into from
Aug 2, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 40 additions & 40 deletions python/python/lance/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -1196,10 +1196,9 @@ def create_scalar_index(
dataset = lance.dataset("/tmp/images.lance")
my_table = dataset.scanner(filter="my_col != 7").to_table()

Scalar indices can also speed up scans containing a vector search and a
prefilter:
Vector search with pre-filers can also benefit from scalar indices. For example,

.. code-block::python
.. code-block:: python

import lance

Expand All @@ -1214,44 +1213,25 @@ def create_scalar_index(
prefilter=True
)

Scalar indices can only speed up scans for basic filters using
equality, comparison, range (e.g. ``my_col BETWEEN 0 AND 100``), and set
membership (e.g. `my_col IN (0, 1, 2)`)

Scalar indices can be used if the filter contains multiple indexed columns and
the filter criteria are AND'd or OR'd together
(e.g. ``my_col < 0 AND other_col> 100``)

Scalar indices may be used if the filter contains non-indexed columns but,
depending on the structure of the filter, they may not be usable. For example,
if the column ``not_indexed`` does not have a scalar index then the filter
``my_col = 0 OR not_indexed = 1`` will not be able to use any scalar index on
``my_col``.

To determine if a scan is making use of a scalar index you can use
``explain_plan`` to look at the query plan that lance has created. Queries
that use scalar indices will either have a ``ScalarIndexQuery`` relation or a
``MaterializeIndex`` operator.

There are 4 types of scalar indices available today. The most common
type is ``BTREE``. This index is inspired by the btree data structure
although only the first few layers of the btree are cached in memory. It iwll
perform well on columns with a large number of unique values and few rows per
value.

The other common index type is ``BITMAP``. This index stores a bitmap for each
unique value in the column. This index is useful for columns with a small
number of unique values and many rows per value.

The ``LABEL_LIST`` index type is a special index that is used to index list
columns whose values have small cardinality. For example, a column that
contains lists of tags (e.g. ``["tag1", "tag2", "tag3"]``) can be indexed
with a ``LABEL_LIST`` index. This index can only speedup queries with
``array_has_any`` or ``array_has_all`` filters.

The ``INVERTED`` index type is used to index document columns. This index
can conduct full-text searches. For example, a column that contains any word
of query string "hello world". The results will be ranked by BM25.
There are 4 types of scalar indices available today.

* ``BTREE``. The most common type is ``BTREE``. This index is inspired
by the btree data structure although only the first few layers of the btree
are cached in memory. It will
perform well on columns with a large number of unique values and few rows per
value.
* ``BITMAP``. This index stores a bitmap for each unique value in the column.
This index is useful for columns with a small number of unique values and
many rows per value.
* ``LABEL_LIST``. A special index that is used to index list
columns whose values have small cardinality. For example, a column that
contains lists of tags (e.g. ``["tag1", "tag2", "tag3"]``) can be indexed
with a ``LABEL_LIST`` index. This index can only speedup queries with
``array_has_any`` or ``array_has_all`` filters.
* ``INVERTED``. It is used to index document columns. This index
can conduct full-text searches. For example, a column that contains any word
of query string "hello world". The results will be ranked by BM25.

Note that the ``LANCE_BYPASS_SPILLING`` environment variable can be used to
bypass spilling to disk. Setting this to true can avoid memory exhaustion
Expand Down Expand Up @@ -1285,6 +1265,26 @@ def create_scalar_index(
"category",
"BTREE",
)

Scalar indices can only speed up scans for basic filters using
equality, comparison, range (e.g. ``my_col BETWEEN 0 AND 100``), and set
membership (e.g. `my_col IN (0, 1, 2)`)

Scalar indices can be used if the filter contains multiple indexed columns and
the filter criteria are AND'd or OR'd together
(e.g. ``my_col < 0 AND other_col> 100``)

Scalar indices may be used if the filter contains non-indexed columns but,
depending on the structure of the filter, they may not be usable. For example,
if the column ``not_indexed`` does not have a scalar index then the filter
``my_col = 0 OR not_indexed = 1`` will not be able to use any scalar index on
``my_col``.

To determine if a scan is making use of a scalar index you can use
``explain_plan`` to look at the query plan that lance has created. Queries
that use scalar indices will either have a ``ScalarIndexQuery`` relation or a
``MaterializeIndex`` operator.

"""
if isinstance(column, str):
column = [column]
Expand Down