-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community[patch]: Add pgvector index using HNSW #5564
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Adding this for reference https://python.langchain.com/v0.1/docs/integrations/vectorstores/pgembedding/#create-hnsw-index |
|
||
const createIndexQuery = `CREATE INDEX IF NOT EXISTS ${ | ||
this.computedTableName | ||
}_embedding_idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this name be differentiated based on the number of dimensions and/or column?
Could maybe see a case where someone has multiple embedding fields in a given table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this name be differentiated based on the number of dimensions and/or column?
As far as I can see, there's only support to create one embedding field per table. I'll change the index name to use the vectorColumnName
that's computed in the class for consistency.
Could maybe see a case where someone has multiple embedding fields in a given table?
Haven't seen this pattern yet... I guess is possible, but IMO its easier to handle one vector per table + metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this works for now
Very cool! |
Hey apologies for the delay - @eyurtsev is on vacation but I'd really like him to take a look. Might be a few more days. |
* @returns Promise that resolves with the query response of creating the index. | ||
*/ | ||
async createHnswIndex(config?: { | ||
dims?: number; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we require the dimension instead of assigning a default? Users likely use lots of different embedding models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored to be mandatory... I also think that assigning the value is less prone error.
* Method to create the HNSW index on the vector column. | ||
* | ||
* @param dims - Defines the number of dimensions in your vector data, max: 2000. For example, use 1536 for OpenAI's text-embedding-ada-002 model and 1024 for amazon.titan-embed-text-v2:0 | ||
* @param m - The max number of connections per layer (16 by default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reference material we could link to that would help someone figure out how to choose values for m?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a reference to the paper in the docs (pgvector.mdx) and added a bit more context in the comments for m and efConstruction parameters
Looks like this:
More info at the Pgvector GitHub project
and the HNSW paper from Malkov Yu A. and Yashunin D. A.. 2020. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs
}_embedding_idx | ||
ON ${this.computedTableName} USING hnsw ((${ | ||
this.vectorColumnName | ||
}::vector(${config?.dims || 1536})) ${idxDistanceFunction}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any utilities to sanitize interpolated values? This code should be only exposed to developers, but if we have existing sanitization code won't hurt to add some more defensive code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the pg-format
lib https://www.npmjs.com/package/pg-format to format the sql literals.
It should provide some guards for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we actually remove this (for now)? It'll make existing users install a new package.
In the near-ish term, we should factor out into a @langchain/pgvector
package that can have this as a hard dep.
|
||
const createIndexQuery = `CREATE INDEX IF NOT EXISTS ${ | ||
this.computedTableName | ||
}_embedding_idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this works for now
Hey @jl4nz, can we revert that new |
9adedd5
to
7bfec42
Compare
Got it, done. |
Hey @jacoblee93, just pinging to check if this would be good to go. |
Yes, thank you! |
Add support to optionally create a HNSW index on for pgvector