Optimize pgvector test for semi-recent enhancements #319

jkatz · 2024-05-08T18:00:43Z

This commit adds several changes to the pgvector test to create a more representative test environment based on recent and older changes to pgvector. Notable changes include allowing for testing of parallel index buiding parameters, using loading with the recommended binary loading method, and other changes to better emulate what a typical user of pgvector would do.

This commit also has some general cleanups as well.

Co-authored-by: Mark Greenhalgh greenhal@users.noreply.github.com
Co-authored-by: Tyler House tahouse@users.noreply.github.com

XuanYang-cn · 2024-05-09T02:17:45Z

/assign @alwayslove2013
/assign

jkatz · 2024-05-10T01:20:46Z

@XuanYang-cn @alwayslove2013 Please let us know if this PR requires additional work. There are some other changes we'd like to include for testing other configurations of pgvector, but we'd like to baseline it against the flat implementation first. Thanks!

alwayslove2013 · 2024-05-10T02:28:37Z

vectordb_bench/backend/clients/pgvector/config.py

+from abc import abstractmethod
+from typing import Any, Mapping, Optional, Sequence, TypedDict
+
+from psycopg import sql


Recommend avoiding adding specific dependencies in the config.py. Users only install the corresponding toolkit when they are conducting tests.

However, in the scenario where the default results page is opened solely for result display, VDBBench will load the standardized result (json). Serializing this data requires the config.py file from all clients.

Currently, if a user hasn't installed psycopg, they won't be able to open the results page.

This is fixed in ea29f47

jkatz · 2024-05-14T03:03:57Z

@alwayslove2013 Thanks for the feedback! This is resolved in the latest push.

Overall, I would suggest moving to psycopg3 (psycopg) as it's now the maintained version of psycopg; however, that change could be made in a separate pull request.

vectordb_bench/backend/clients/pgvector/config.py

alwayslove2013 · 2024-05-14T06:21:47Z

@jkatz Thank you so much for your contribution! We greatly appreciate it and are thrilled to receive your pull request. We look forward to collaborating with you and driving the project forward together!

Overall, I would suggest moving to psycopg3 (psycopg) as it's now the maintained version of psycopg; however, that change could be made in a separate pull request.

This commit adds several changes to the pgvector test to create a more representative test environment based on recent and older changes to pgvector. Notable changes include allowing for testing of parallel index buiding parameters, using loading with the recommended binary loading method, and other changes to better emulate what a typical user of pgvector would do. This commit also has some general cleanups as well. Co-authored-by: Mark Greenhalgh <greenhal@users.noreply.github.com> Co-authored-by: Tyler House <tahouse@users.noreply.github.com>

jkatz · 2024-05-14T14:56:26Z

@alwayslove2013 Likewise. I personally appreciate the approach VectorDBBench takes around testing concurrency, which resembles how users interact with databases.

I've pushed up the fix to the latest patch to handle the merge conflict that remained (which I'm still baffled how that got in, but I'll triple check next time).

alwayslove2013

/approve

sre-ci-robot · 2024-05-15T02:59:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alwayslove2013, jkatz
To complete the pull request process, please assign xuanyang-cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @xuanyang-cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alwayslove2013 · 2024-05-15T03:27:06Z

@jkatz I would like to express my sincere gratitude for your support. Our primary goal has been to ensure that the test data reflects the performance characteristics of the real-world usage scenarios as accurately as possible.

... testing concurrency, which resembles how users interact with databases.

If you have any suggestions or innovative ideas with VDBBench, we would be more than happy to discuss them with you. Your valuable input is crucial for us to enhance the functionality and user experience of the tool.

wahajali · 2024-08-09T15:22:33Z

@jkatz I'm trying to see how we can optimize the pgvector and pgvecto.rs client further and one thought I had was to use functions for making vector queries. Do you think its something to explore further?
I've previously seen something similar for HammerDB where they use stored procedures (although its not for vector benchmarking): https://www.hammerdb.com/blog/uncategorized/why-you-should-benchmark-your-database-using-stored-procedures/

greenhal · 2024-08-09T16:24:08Z

This benchmark is pretty simple, run the query, get the result, compare the result.

To improve the performance, using stored procedures you would have to either reduce network round trips, or reduce the data transmitted, neither of which is much overhead in this benchmark. Using stored procedures HammerDB does both of these, because there is a lot of potential network traffic associated with a tpcc benchmark, for example see new order transactions, to create a new order (the main tpcc performance measurement), it could take up to 6 round trips to create an order, if all of the logic was on the client, using stored procedures in this case, it's 1 or even less, because hammer db can send a single request that says, create 100 orders.

With that in mind, there could be 2 ways that could possibly improve pgvector performance, or any engines performance. (which could be implemented with stored procedures.)

Reduce the network trips.
This would be accomplished by batching the request/results, instead of sending 1 vector to query, send 100 or even all 1k, then return all of the results. But since the same amount of data would be sent/received. The only savings would be the reduce round trips and the overhead associated with those round trips and the sending/receiving would still be done over multiple trips. Overall, the benefit would be minimal, if any at all.
Reduce the data transferred.
There are a few ways of doing this, the most extreme would be:
- to load the data, ground truth and test vector onto the target
- a "run benchmark" procedure on the target would run all the queries and just return the recall and qps.

Both would require a pretty significant change to vectordbbench, they would be engine specific and would not represent a real world use case.

alwayslove2013 reviewed May 10, 2024

View reviewed changes

jkatz force-pushed the pgvector-updates branch from 3e5d0c3 to ea29f47 Compare May 14, 2024 03:03

jkatz requested a review from alwayslove2013 May 14, 2024 03:04

alwayslove2013 requested changes May 14, 2024

View reviewed changes

vectordb_bench/backend/clients/pgvector/config.py Outdated Show resolved Hide resolved

jkatz force-pushed the pgvector-updates branch from ea29f47 to d188ad5 Compare May 14, 2024 14:55

jkatz requested a review from alwayslove2013 May 14, 2024 14:56

alwayslove2013 approved these changes May 15, 2024

View reviewed changes

alwayslove2013 merged commit c4fc7c1 into zilliztech:main May 15, 2024
4 checks passed

jkatz deleted the pgvector-updates branch May 15, 2024 03:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize pgvector test for semi-recent enhancements #319

Optimize pgvector test for semi-recent enhancements #319

jkatz commented May 8, 2024

XuanYang-cn commented May 9, 2024

jkatz commented May 10, 2024

alwayslove2013 May 10, 2024

jkatz May 14, 2024

jkatz commented May 14, 2024

alwayslove2013 commented May 14, 2024

jkatz commented May 14, 2024

alwayslove2013 left a comment

sre-ci-robot commented May 15, 2024

alwayslove2013 commented May 15, 2024 •

edited

Loading

wahajali commented Aug 9, 2024 •

edited

Loading

greenhal commented Aug 9, 2024

Optimize pgvector test for semi-recent enhancements #319

Optimize pgvector test for semi-recent enhancements #319

Conversation

jkatz commented May 8, 2024

XuanYang-cn commented May 9, 2024

jkatz commented May 10, 2024

alwayslove2013 May 10, 2024

Choose a reason for hiding this comment

jkatz May 14, 2024

Choose a reason for hiding this comment

jkatz commented May 14, 2024

alwayslove2013 commented May 14, 2024

jkatz commented May 14, 2024

alwayslove2013 left a comment

Choose a reason for hiding this comment

sre-ci-robot commented May 15, 2024

alwayslove2013 commented May 15, 2024 • edited Loading

wahajali commented Aug 9, 2024 • edited Loading

greenhal commented Aug 9, 2024

alwayslove2013 commented May 15, 2024 •

edited

Loading

wahajali commented Aug 9, 2024 •

edited

Loading