-
Notifications
You must be signed in to change notification settings - Fork 194
feat: adding the operation delete_all_documents
to the OpenSearchDocumentStore
#2321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: adding the operation delete_all_documents
to the OpenSearchDocumentStore
#2321
Conversation
delete_all_documents
to the OpenSearchDocumentStore
Just a quick thought... I am not sure deleting and recreating the index is a good idea. As a user, I'd expect A bulk delete might be a safer option. Maybe it's also worth checking how this was done in Haystack 1.x. |
I was investigating into it and for both Elastic and Open, specially for large volumes of data, this is the most efficient way to do it. In both cases the indexes are recreated with the same mappings/settings. The only issue is if a user changes the index after creation/population. |
What about something like: def delete_all_documents(self, recreate_index=False) By default we apply a usual delete operation, and if |
I would prefer this idea! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks quite good to me already. My main suggestion is that if we have a try except in delete_all_documents_async, it makes sense to add the same to the sync implementation of delete_all_documents to ensures consistent behavior.
Other than that, I have a suggestion for the tests to make sure they only test one thing. You could create a separate test, for example test_index_functionality_after_delete_all_documents
for testing writing and retrieving of documents after deleting all documents. Something like:
def test_index_functionality_after_delete_all_documents(self, document_store):
"""Test that documents can be written and retrieved after delete_all_documents"""
document_store.write_documents([Document(id="1", content="Test")]) # or add more documents
document_store.delete_all_documents()
new_doc = Document(id="2", content="New test")
document_store.write_documents([new_doc]) # or add more documents
assert document_store.count_documents() == 1
results = document_store.filter_documents()
assert len(results) == 1
assert results[0].content == "New test"
A fast way to clear all documents from the document store while preserving any index settings and mappings. | ||
""" | ||
self._ensure_initialized() | ||
assert self._client is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's consistent with the other methods to have this assert here. Therefore, I agree it's good to have it here.
However, I believe the only case it covers is if self._client was set at some point and then for some reason is set to None again. We have the same assert in _ensure_index_exists
, which get's called as part of _ensure_initialized
when it is run for the first time.
Related Issues
Proposed Changes:
delete_all_documents()
method to theOpenSearchDocumentStore
class to both Synchronous and Asynchronous versionsrecreate_index=False
(default): Usesdelete_by_query
API for faster deletion of large datasetsrecreate_index=True
: Recreates the entire index, preserving original mappings and settings_deserialize_search_hits()
→@staticmethod
_process_bulk_write_errors()
→@staticmethod
_deserialize_document()
→@staticmethod
_postprocess_bm25_search_results()
->@staticmethod
📝 Usage Example
How did you test it?
test_document_store.py
- Addedtest_delete_all_documents()
with and without index recreationtest_document_store_async.py
- Addedtest_delete_all_documents_async()
with and without index recreationChecklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.