Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add async search and find_similar APIs #90

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

tomusher
Copy link
Member

This PR adds async methods afind_similar and asearch to the public API exposed by a Vector Index.

This involved creating various async variations of methods further up the chain. I've also broke up the bulk_generate_documents methods in to smaller functions to both reduce the size of that function body and so we can reuse as much as possible between the async and non-async versions of the function.

Copy link
Member

@emilytoppm emilytoppm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looking good, but I have a few questions, particularly on transactions/atomicity


async def _acreate_new_documents(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - this kind of feels like this and the sync version above should be atomic, due to the deletion + replacement? Should this be wrapped in a transaction and sync_to_async? You could always work out the embeddings outside the transaction

all_chunks = list(
chain(*[obj["chunks"] for obj in objects_to_rebuild.values()])
)

embedding_vectors = list(embedding_backend.embed(all_chunks))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this take typically? Would it make sense to move it outside the transaction, or do it asynchronously?

for idx, returned_embedding in documents:
all_keys = self._keys_for_instance(objects_by_key[object_key])
chunk = all_chunks[idx]
await Document.objects.acreate(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to be creating them one at a time rather than than in bulk? This feels like it'll do a lot of moving work onto a sync thread per acreate call when we could just use bulk_create once


yield from self._create_new_documents(object, chunks, embedding_backend)

async def ato_documents(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're losing the transaction here - would it be possible to restructure so we generate the document instances async, then save them later? Just feels like we're risking some inconsistent state

@tomusher
Copy link
Member Author

tomusher commented Dec 6, 2024

Thanks for the feedback @emilytoppm and sorry for the delay in updating this. I've just done a bit of a rework to divide the generation of documents and saving of documents in to two stages so we can keep the transaction small and ensure it's usable in an async context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants