Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize pgvector test for semi-recent enhancements #319

Merged
merged 1 commit into from
May 15, 2024

Conversation

jkatz
Copy link
Contributor

@jkatz jkatz commented May 8, 2024

This commit adds several changes to the pgvector test to create a more representative test environment based on recent and older changes to pgvector. Notable changes include allowing for testing of parallel index buiding parameters, using loading with the recommended binary loading method, and other changes to better emulate what a typical user of pgvector would do.

This commit also has some general cleanups as well.

Co-authored-by: Mark Greenhalgh greenhal@users.noreply.github.com
Co-authored-by: Tyler House tahouse@users.noreply.github.com

@XuanYang-cn
Copy link
Collaborator

/assign @alwayslove2013
/assign

@jkatz
Copy link
Contributor Author

jkatz commented May 10, 2024

@XuanYang-cn @alwayslove2013 Please let us know if this PR requires additional work. There are some other changes we'd like to include for testing other configurations of pgvector, but we'd like to baseline it against the flat implementation first. Thanks!

from abc import abstractmethod
from typing import Any, Mapping, Optional, Sequence, TypedDict

from psycopg import sql
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend avoiding adding specific dependencies in the config.py. Users only install the corresponding toolkit when they are conducting tests.

However, in the scenario where the default results page is opened solely for result display, VDBBench will load the standardized result (json). Serializing this data requires the config.py file from all clients.

Currently, if a user hasn't installed psycopg, they won't be able to open the results page.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed in ea29f47

@jkatz
Copy link
Contributor Author

jkatz commented May 14, 2024

@alwayslove2013 Thanks for the feedback! This is resolved in the latest push.

Overall, I would suggest moving to psycopg3 (psycopg) as it's now the maintained version of psycopg; however, that change could be made in a separate pull request.

@jkatz jkatz requested a review from alwayslove2013 May 14, 2024 03:04
@alwayslove2013
Copy link
Collaborator

@jkatz Thank you so much for your contribution! We greatly appreciate it and are thrilled to receive your pull request. We look forward to collaborating with you and driving the project forward together!

Overall, I would suggest moving to psycopg3 (psycopg) as it's now the maintained version of psycopg; however, that change could be made in a separate pull request.

This commit adds several changes to the pgvector test to create a
more representative test environment based on recent and older
changes to pgvector. Notable changes include allowing for testing
of parallel index buiding parameters, using loading with the recommended
binary loading method, and other changes to better emulate what a
typical user of pgvector would do.

This commit also has some general cleanups as well.

Co-authored-by: Mark Greenhalgh <greenhal@users.noreply.github.com>
Co-authored-by: Tyler House <tahouse@users.noreply.github.com>
@jkatz
Copy link
Contributor Author

jkatz commented May 14, 2024

@alwayslove2013 Likewise. I personally appreciate the approach VectorDBBench takes around testing concurrency, which resembles how users interact with databases.

I've pushed up the fix to the latest patch to handle the merge conflict that remained (which I'm still baffled how that got in, but I'll triple check next time).

@jkatz jkatz requested a review from alwayslove2013 May 14, 2024 14:56
Copy link
Collaborator

@alwayslove2013 alwayslove2013 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@sre-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alwayslove2013, jkatz
To complete the pull request process, please assign xuanyang-cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @xuanyang-cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alwayslove2013 alwayslove2013 merged commit c4fc7c1 into zilliztech:main May 15, 2024
4 checks passed
@alwayslove2013
Copy link
Collaborator

alwayslove2013 commented May 15, 2024

@jkatz I would like to express my sincere gratitude for your support. Our primary goal has been to ensure that the test data reflects the performance characteristics of the real-world usage scenarios as accurately as possible.

... testing concurrency, which resembles how users interact with databases.

If you have any suggestions or innovative ideas with VDBBench, we would be more than happy to discuss them with you. Your valuable input is crucial for us to enhance the functionality and user experience of the tool.

@jkatz jkatz deleted the pgvector-updates branch May 15, 2024 03:29
@wahajali
Copy link
Contributor

wahajali commented Aug 9, 2024

@jkatz I'm trying to see how we can optimize the pgvector and pgvecto.rs client further and one thought I had was to use functions for making vector queries. Do you think its something to explore further?
I've previously seen something similar for HammerDB where they use stored procedures (although its not for vector benchmarking): https://www.hammerdb.com/blog/uncategorized/why-you-should-benchmark-your-database-using-stored-procedures/

@greenhal
Copy link
Contributor

greenhal commented Aug 9, 2024

This benchmark is pretty simple, run the query, get the result, compare the result.

To improve the performance, using stored procedures you would have to either reduce network round trips, or reduce the data transmitted, neither of which is much overhead in this benchmark. Using stored procedures HammerDB does both of these, because there is a lot of potential network traffic associated with a tpcc benchmark, for example see new order transactions, to create a new order (the main tpcc performance measurement), it could take up to 6 round trips to create an order, if all of the logic was on the client, using stored procedures in this case, it's 1 or even less, because hammer db can send a single request that says, create 100 orders.

With that in mind, there could be 2 ways that could possibly improve pgvector performance, or any engines performance. (which could be implemented with stored procedures.)

  1. Reduce the network trips.
    This would be accomplished by batching the request/results, instead of sending 1 vector to query, send 100 or even all 1k, then return all of the results. But since the same amount of data would be sent/received. The only savings would be the reduce round trips and the overhead associated with those round trips and the sending/receiving would still be done over multiple trips. Overall, the benefit would be minimal, if any at all.

  2. Reduce the data transferred.
    There are a few ways of doing this, the most extreme would be:

    • to load the data, ground truth and test vector onto the target
    • a "run benchmark" procedure on the target would run all the queries and just return the recall and qps.

Both would require a pretty significant change to vectordbbench, they would be engine specific and would not represent a real world use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants