Add support for pgvector's hnsw (0.7.4) and generic support for Postgres (16) indexes #309

onurctirtir · 2024-09-09T13:24:45Z

Closes #293.

Add .vscode to .gitignore
Add support for pgvector's hnsw and generic support for Postgres indexes
See below explanation, this is the meat of the PR.
Install the things required to collect flamegraphs when needed
Can revert this if requested.

This PR adds support for benchmarking pgvector's hnsw index-access-method with
the runbooks and the datasets supported by bigann benchmarks.

To do that, added a base docker image that would help us testing other Postgres
index-access-methods in the future. And to make use of that docker image, had
to make some changes in install.py so that other Postgres based indexes can
depend on a common docker image that has the Postgres installed already. Note
that install.py will build that base docker image only if the algorithm name
starts with "postgres-". If you see that this PR is not a draft one anymore,
then I should've already documented this in the docs.

This PR also adds BaseStreamingANNPostgres that can be used to easily add
support for other Postgres based indexes in the future. One would simply need
to define a new python wrapper which implements:

determine_index_op_class(self, metric)
determine_query_op(self, metric)

and that properly sets the following attributes in their __init__ methods
before calling super().__init__:

self.name
self.pg_index_method
self.guc_prefix

Given that pgvector's hnsw is the first Postgres-based-index that benefit from
this infra (via this PR), neurips23/streaming/postgres-pgvector-hnsw/ can be seen
as an example implementation on how to make use of Dockerfile.BasePostgres and
BaseStreamingANNPostgres in general to add support for more Postgres based indexes.

Differently than other other algorithms under streaming, the time it takes to
complete a runbook can be several times slower than what is for other algorithms.
This is not because Postgres based indexes are bad, but because SQL is the only
interface to such indexes. So, all those insert / delete / search operations
first have to build the SQL queries, and, specifically for inserts, transferring
the data to the Postgres server adds an important overhead. Unless we make some
huge changes in this repo to re-design "insert" in a way that it could benefit
from server-side-copy functionality of Postgres, we cannot do much about it.
Other than that, please feel free to drop comments if you see any inefficiencies
that I can quickly fix in my code. Note that I'm not a python expert, hence
sincerely requesting this :)

And, to explain the build & query time params that have to be provided in
such a Postgres based indexing algorithm's config.yaml file, let's take a look
the the following snippet from pgvector's hnsw's config.yaml file:

random-xs:
    postgres-pgvector-hnsw:
      docker-tag: neurips23-streaming-postgres-pgvector-hnsw
      module: neurips23.streaming.postgres-pgvector-hnsw.wrapper
      constructor: PostgresPgvectorHnsw
      base-args: ["@metric"]
      run-groups:
        base:
          args: |
            [{"m":16, "ef_construction":64, "insert_conns":16}]
          query-args: |
            [{"ef_search":50, "query_conns":8}]

Presence of insert_conns & query_conns are enforced by BaseStreamingANNPostgres
and any Postgres based index implementation that we add to this repo in the future
must also provide values for them in their config.yaml files.

insert_conns
Similar to insert_threads in other algorithm implementations, this is used to
determine parallelism for inserts. In short, this determines the number of database
connections used during insert steps.
query_conns
Similar to T in other algorithm implementations, this is used to determine
parallelism for SELECT queries. In short, this determines the number of database
connections used during search steps.

Other than those two params, any other parameters that need to be specified when
building the index or when performing an index-scan (read as "search" step) via
config.yaml too.

The parameters provided in "args" (except insert_conns) are directly passed into
CREATE INDEX
statement that's used to create index in setup step. For example, for pgvector's
hnsw, above config will result in the following CREATE INDEX statement to create
the index. Especially note the "WITH" clause:
```
CREATE INDEX vec_col_idx ON test_tbl USING hnsw (vec_col vector_l2_ops) WITH (m = 16, ef_construction = 64);"
```
The parameters provided in "query-args" (except query_conns) are directly used to
set the GUCs that
determine runtime parameters used during "search" steps by the algorithm via
SET commands. Note that
BaseStreamingANNPostgres qualifies all those query-args with self.guc_prefix when
creating the SET commands that need to be run for all Postgres connections. For
example, for pgvector's hnsw, above config will result in the executing the following
SET statement for each Postgres connection. Note that if pgvector's hnsw had more
query-args, then we'd have multiple SET statements:
```
SET hnsw.ef_search TO 50;
```
We prefixed "ef_search" with "hnsw" since PostgresPgvectorHnsw sets self.guc_prefix
to "hnsw".

And while we're at it, let's take a closer look into how the python wrapper should
look like when adding support for a Postgres based index in the future. From the wrapper
added for pgvector's hnsw:

from neurips23.streaming.base_postgres import BaseStreamingANNPostgres

class PostgresPgvectorHnsw(BaseStreamingANNPostgres):
    def __init__(self, metric, index_params):
        self.name = "PostgresPgvectorHnsw"
        self.pg_index_method = "hnsw"
        self.guc_prefix = "hnsw"

        super().__init__(metric, index_params)

    # Can add support for other metrics here.
    def determine_index_op_class(self, metric):
        if metric == 'euclidean':
            return "vector_l2_ops"
        else:
            raise Exception('Invalid metric')

    # Can add support for other metrics here.
    def determine_query_op(self, metric):
        if metric == 'euclidean':
            return "<->"
        else:
            raise Exception('Invalid metric')

self.name
This probably sets the experiment name, mostly required by grand-parent class
.BaseStreamingANN
self.pg_index_method is used to specify the index-access-method name in the
USING clause of the CREATE INDEX statement that's used when creating the index.
See the docs for CREATE INDEX mentioned earlier and CREATE INDEX statement
shared above as an example of how we make use of it.
self.guc_prefix is used to qualify the GUCs that need to be set to enforce
query-args, as described above.
determine_index_op_class(self, metric) is used to map given metric to the
relevant opclass
that the index needs to use when building the index and is passed to CREATE
INDEX statements again. See the docs for CREATE INDEX mentioned earlier.
For example, when the metric is "euclidian" for pgvector's hnsw, this function
returns "vector_l2_ops" and so it was used in the above CREATE INDEX statement.
determine_query_op(self, metric) is used to map given metric to the comparison
operator that's used to match the index during search. For example, when the
metric is "euclidian" for pgvector's hnsw, this function returns "<->" and so
it will be used in the SELECT query that's executed in search step as in:
```
SELECT id FROM test_tbl ORDER BY vec_col <-> [input-query-vec] LIMIT [k]
```

onurctirtir · 2024-09-09T14:24:01Z

neurips23/streaming/base_postgres.py

+                    if len(result_ids) < k:
+                        # Pad with -1 if we have less than k results. This is only needed if the
+                        # index-access method cannot guarantee returning k results.
+                        #
+                        # As of today, this is only possible with PostgresPgvectorHnsw when
+                        # ef_search < k.
+                        result_ids.extend([-1] * (k - len(result_ids)))


Is it ok to pad self.res like this?

Yes, should be fine. The recall computation code performs a set intersection between the ground-truth results and reported results, so any -1 results would be ignored as intended.

onurctirtir · 2024-09-09T14:26:37Z

neurips23/streaming/base_postgres.py

+        with psycopg.connect(PG_CONN_STR, autocommit=True) as conn:
+            with conn.cursor() as cur:
+                cursor_print_and_execute(cur, "TRUNCATE test_tbl")


Although the other algo implementations don't seem to be doing this, I thought that we have to reset all the data in main table and the index here, i.e., before switching to a different query-args set. Does that make sense?

neurips23/streaming/base_postgres.py

onurctirtir · 2024-09-12T16:00:39Z

benchmark/runner.py

@@ -289,6 +289,7 @@ def run_docker(definition, dataset, count, runs, timeout, rebuild,
            },
            cpuset_cpus=cpu_limit,
            mem_limit=mem_limit,
+            privileged=True, # so that we can run perf inside the container


can be reverted as mentioned in PR descr, only useful for FlameGraph'ing purposes

onurctirtir · 2024-09-12T16:00:45Z

neurips23/streaming/Dockerfile.BasePostgres

+# install linux-tools-generic into docker so that devs can use perf if they want
+RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt install -y linux-tools-generic
+
+# clone FlameGraph for the same purpose
+RUN git clone https://github.com/brendangregg/FlameGraph


can be reverted as mentioned in PR descr, only useful for FlameGraph'ing purposes

magdalendobson · 2024-09-12T17:32:03Z

neurips23/streaming/start_database.sh

What is the usage of this script? Not sure I see it referenced in the README.

Ah nevermind I see where it's used. Would it be possible to just create the file directly inside the Docker container instead of having it separately like this? Would be a bit more in line with the style of the other algorithms

Would it be possible to just create the file directly inside the Docker container instead of having it separately like this?

sure 👍

magdalendobson · 2024-09-12T17:32:41Z

requirements_py3.10.txt

Do these need to be top-level requirements or could they be installed in the Dockerfile?

they can be installed in the Dockerfile, let me do that

onurctirtir · 2024-09-30T09:06:27Z

neurips23/streaming/base_postgres.py

+        with psycopg.connect(PG_CONN_STR, autocommit=True) as conn:
+            with conn.cursor() as cur:
+                cursor_print_and_execute(cur, f"CREATE TABLE test_tbl (id bigint, vec_col vector({ndim}))")
+                cursor_print_and_execute(cur, f"CREATE INDEX vec_col_idx ON test_tbl USING {self.pg_index_method} (vec_col {self.ind_op_class}) {index_build_params_clause}")


To stabilize the measurements, should we disable autovacuum when creating the table since we explicitly vacuum the table at some point? @orhankislal?

Add .vscode to .gitignore

498e8b2

onurctirtir commented Sep 9, 2024

View reviewed changes

onurctirtir marked this pull request as ready for review September 10, 2024 09:10

onurctirtir added 3 commits September 10, 2024 15:57

Add support for pgvector's hnsw and generic support for Postgres indexes

9b68a2a

Install the things required to collect flamegraphs when needed

3f65080

Add docs

5147f07

onurctirtir force-pushed the upstream-pgvector-hnsw branch from 273c4dc to 5147f07 Compare September 10, 2024 12:57

onurctirtir added 2 commits September 10, 2024 16:06

clarify

0cc417c

use binary protocol to transfer vecs

381215d

onurctirtir commented Sep 12, 2024

View reviewed changes

neurips23/streaming/base_postgres.py Outdated Show resolved Hide resolved

onurctirtir commented Sep 12, 2024

View reviewed changes

magdalendobson reviewed Sep 12, 2024

View reviewed changes

onurctirtir added 2 commits September 13, 2024 13:36

address feedback

2c30c59

comment

ac3eb27

onurctirtir commented Sep 30, 2024

View reviewed changes

onurctirtir closed this Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for pgvector's hnsw (0.7.4) and generic support for Postgres (16) indexes #309

Add support for pgvector's hnsw (0.7.4) and generic support for Postgres (16) indexes #309

onurctirtir commented Sep 9, 2024 •

edited

Loading

onurctirtir Sep 9, 2024

magdalendobson Sep 12, 2024

onurctirtir Sep 9, 2024

onurctirtir Sep 12, 2024

onurctirtir Sep 12, 2024

magdalendobson Sep 12, 2024

magdalendobson Sep 12, 2024

onurctirtir Sep 13, 2024

onurctirtir Sep 13, 2024

magdalendobson Sep 12, 2024

onurctirtir Sep 13, 2024

onurctirtir Sep 13, 2024

onurctirtir Sep 30, 2024

Add support for pgvector's hnsw (0.7.4) and generic support for Postgres (16) indexes #309

Add support for pgvector's hnsw (0.7.4) and generic support for Postgres (16) indexes #309

Conversation

onurctirtir commented Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onurctirtir commented Sep 9, 2024 •

edited

Loading