Skip to content

Conversation

davidsbatista
Copy link
Contributor

@davidsbatista davidsbatista commented Sep 25, 2025

Related Issues

Proposed Changes:

  • adds a delete_all_documents() method to the OpenSearchDocumentStore class to both Synchronous and Asynchronous versions
  • recreate_index=False (default): Uses delete_by_query API for faster deletion of large datasets
  • recreate_index=True: Recreates the entire index, preserving original mappings and settings
  • other small improvments:
    • _deserialize_search_hits()@staticmethod
    • _process_bulk_write_errors()@staticmethod
    • _deserialize_document()@staticmethod
    • _postprocess_bm25_search_results() -> @staticmethod
  • minor typo corrections in docstrings
  • added proper SPDX license headers to test files

📝 Usage Example

# Delete all documents (fast, keeps index structure)
document_store.delete_all_documents()

# Delete all documents and recreate index (slower, resets index state)
document_store.delete_all_documents(recreate_index=True)

How did you test it?

  • unit tests, integration tests + CI tests
  • test_document_store.py - Added test_delete_all_documents() with and without index recreation
  • test_document_store_async.py - Added test_delete_all_documents_async() with and without index recreation

Checklist

@github-actions github-actions bot added integration:opensearch type:documentation Improvements or additions to documentation labels Sep 25, 2025
@davidsbatista davidsbatista changed the title Feat/adding delete all docs to open search document store feat: adding the operation delete_all_documents to the OpenSearchDocumentStore Sep 25, 2025
@davidsbatista davidsbatista marked this pull request as ready for review September 25, 2025 14:15
@davidsbatista davidsbatista requested a review from a team as a code owner September 25, 2025 14:15
@davidsbatista davidsbatista requested review from julian-risch and removed request for a team September 25, 2025 14:15
@anakin87
Copy link
Member

Just a quick thought... I am not sure deleting and recreating the index is a good idea.

As a user, I'd expect delete_all_documents to only remove the data, not touch the index itself. And even though we recommend managing the DB via Haystack, some users might still have changed index attributes manually. In that case, recreating the index could cause surprises.

A bulk delete might be a safer option. Maybe it's also worth checking how this was done in Haystack 1.x.

@davidsbatista
Copy link
Contributor Author

davidsbatista commented Sep 26, 2025

I was investigating into it and for both Elastic and Open, specially for large volumes of data, this is the most efficient way to do it. In both cases the indexes are recreated with the same mappings/settings.

The only issue is if a user changes the index after creation/population.

@davidsbatista
Copy link
Contributor Author

What about something like:

def delete_all_documents(self, recreate_index=False)

By default we apply a usual delete operation, and if recreate_index=True, we just recreate the index using the mappings/settings defined by the user or the default ones.

@anakin87
Copy link
Member

What about something like:

def delete_all_documents(self, recreate_index=False)

I would prefer this idea!

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good to me already. My main suggestion is that if we have a try except in delete_all_documents_async, it makes sense to add the same to the sync implementation of delete_all_documents to ensures consistent behavior.
Other than that, I have a suggestion for the tests to make sure they only test one thing. You could create a separate test, for example test_index_functionality_after_delete_all_documents for testing writing and retrieving of documents after deleting all documents. Something like:

def test_index_functionality_after_delete_all_documents(self, document_store):
    """Test that documents can be written and retrieved after delete_all_documents"""
    document_store.write_documents([Document(id="1", content="Test")])  # or add more documents
    document_store.delete_all_documents()
    
    new_doc = Document(id="2", content="New test")
    document_store.write_documents([new_doc])  # or add more documents
    assert document_store.count_documents() == 1
    
    results = document_store.filter_documents()
    assert len(results) == 1
    assert results[0].content == "New test"

A fast way to clear all documents from the document store while preserving any index settings and mappings.
"""
self._ensure_initialized()
assert self._client is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's consistent with the other methods to have this assert here. Therefore, I agree it's good to have it here.

However, I believe the only case it covers is if self._client was set at some point and then for some reason is set to None again. We have the same assert in _ensure_index_exists, which get's called as part of _ensure_initialized when it is run for the first time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration:opensearch type:documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants