Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Smoother API for .delete() #3207

Open
mr-infty opened this issue Nov 27, 2024 · 7 comments
Open

[Feature Request]: Smoother API for .delete() #3207

mr-infty opened this issue Nov 27, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@mr-infty
Copy link

Describe the problem

At the moment, there is no convenient way to delete all entries in a collection (without deleting the collection itself). Even though .delete() accepts None as an argument to ids, there is no "wildcard filter" that could be given to where as an argument.

Describe the proposed solution

Either make .delete() delete all entries in the collection or make it possible to pass where={} as a wild-card filter matching all documents.

Alternatives considered

No response

Importance

would make my life easier

Additional Information

No response

@mr-infty mr-infty added the enhancement New feature or request label Nov 27, 2024
@tazarov
Copy link
Contributor

tazarov commented Nov 27, 2024

@mr-infty, you can use this:

import uuid
import chromadb
import numpy as np

data = np.random.uniform(-1, 1, (500, 384))

client = chromadb.PersistentClient("delete_all")
collection = client.get_or_create_collection("test_collection")
ids = [f"{uuid.uuid4()}" for i in range(data.shape[0])]
documents = [f"document {i}" for i in range(data.shape[0])]
collection.add(ids=ids, embeddings=data, documents=documents)

print("Collection count", collection.count())

collection.delete(where={"__bastion_key__": {"$ne":1}})

print("Collection count after delete", collection.count())

Works like a charm. However you should note that due to how HNSW index works it is recommended to delete and recreate the collection to avoid a caveats: HNSW has an unbound growth, deleted embeddings are only flagged as deleted.

@mr-infty
Copy link
Author

Okay, I guess that collection.delete(where={"__bastion_key__": {"$ne":1}}) is a useable workaround, but surely something as simple as deleting all items in the collection should have a simple interface? Moreover, it appears that the unwillingness of the API to accept empty objects (metadata or where filters) has caused trouble elsewhere.

It seems to me that providing the ability of have empty metadata and empty filters would streamline the API a lot.

@HammadB
Copy link
Collaborator

HammadB commented Dec 2, 2024

We actually are somewhat opposed to allowing people to easily delete everything in their collection, its too easy a footgun to do accidentally.

Maybe we could do a safety override. I.e

collection.delete(all=true) deletes all vs collection.delete() will no-op. But this creates other confusing states.

@tazarov
Copy link
Contributor

tazarov commented Dec 3, 2024

@mr-infty, we have similar mechanic to delete all with reset() however reset, much like delete() with no params throws an error unless a flag is explicitly configured, that is off by default of course).

MySQL has something similar with SET SQL_SAFE_UPDATES = 1. So perhaps a similar, flag can make sense here.

Regarding empty params, it feels to me not very ergonomic. Wouldn't it make sense the absence of parameters to be treated as empty params rather forcing empty params. It introduces a confusion such as, is deleting nothing that matches the same as deleting all - much like the example I've shown you above, it ugly and confusing as hell (it does the job though).

Going down that 🐰 hole you might as well make the argument for a completely separate method that conveys in non-ambiguous terms what it does e.g. collection.truncate(). I think it is not coincidence why the SQL standard defines it. Furthermore we can look for opportunities to make truncate more efficient not in just deleting everything in the collection but also make it so that you start with a fresh empty collection. Today if you apply the workaround above or if we implement the delete() with no params the same way we implement deletions with params we would end up in a situation where you have an HNSW index which is full of "dead" labels and tons of data you don't need/want. Instead what we could do with truncate is recreate the index and make sure the metadata is properly scrubbed, as if you are calling delete + create (but without changing collection characteristics like ID, HNSW config etc).

@mr-infty
Copy link
Author

mr-infty commented Dec 7, 2024

@tazarov No, there is no confusion: the most obvious semantics of delete(args) is that its arguments specify a selection, and that delete() simply deletes that selection. The syntax for deleting everything is then simply dictated by the syntax for selections, and therefore the syntax for deletion is only as confusing as the syntax for selection.

@tazarov
Copy link
Contributor

tazarov commented Dec 12, 2024

@mr-infty, that’s an interesting point. Following that logic, wouldn’t it make sense to approach deletion like this?

collection.delete(ids=collection.get(include=[])["ids"])

This way, deletion is strictly tied to explicit selections (via get()), avoiding any ambiguity about whether “no selection” should be interpreted as a valid selection.

@mr-infty
Copy link
Author

@tazarov Yes, that would be one possible way of doing it.

However, what I had in mind was more like reifying the selection itself as some data structure, so that you could say collection.get(selection) and collection.delete(selection), which might be more efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants