Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Index requires that your documents table uses a GUID as your primary ID. #7607

Open
2 tasks done
alph486 opened this issue Jan 27, 2025 · 2 comments
Open
2 tasks done
Labels
auto:documentation Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder

Comments

@alph486
Copy link

alph486 commented Jan 27, 2025

Checklist

  • I added a very descriptive title to this issue.
  • I included a link to the documentation page I am referring to (if applicable).

Issue with current documentation:

Re: https://js.langchain.com/docs/how_to/indexing

According to @langchain/core/dist/indexing/base.js, the index function generates hashes for each of your documents prior to indexing them. Then, it attempts to write them to the database using the ids optional parameter.

        if (docsToIndex.length > 0) {
            await vectorStore.addDocuments(docsToIndex, { ids: uids });
            numAdded += docsToIndex.length - seenDocs.size;
            numUpdated += seenDocs.size;
        }

If you have setup your tables to use an integer or other type of unique id for the table, this function will not work properly if I'm understanding correctly.

Idea or request for content:

Please correct the documentation to either: a) add this as a notice or b) Add an example or configuration on how to use it with a different type of id, as I expect this will be relatively common.

@dosubot dosubot bot added the auto:documentation Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder label Jan 27, 2025
@nick-w-nick
Copy link
Contributor

@alph486 I am pretty sure that is just the name of the variable and isn't directly reflective of any actual format requirements. I personally have used multiple ID formats within my vector stores, many of which not being UUIDs.

For example, here are the docs from Pinecone that mention how you can even use custom delimiters in your IDs to make it easier to assign indexes to individual document chunks, implying that you can basically use anything you want for your document IDs as long as they are unique.

@alph486
Copy link
Author

alph486 commented Feb 2, 2025

@nick-w-nick were you using the 'index' feature when you achieved the result? This issue is related to the way the indexing feature handles the ids not the way that vector stores themselves handle them.

In this case index() is handling the vector store for you and assuming uuids afaict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:documentation Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder
Projects
None yet
Development

No branches or pull requests

2 participants