Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot updating embeddings on a Wikipedia dump #1343

Closed
chrk623 opened this issue Aug 13, 2021 · 2 comments
Closed

Cannot updating embeddings on a Wikipedia dump #1343

chrk623 opened this issue Aug 13, 2021 · 2 comments

Comments

@chrk623
Copy link

chrk623 commented Aug 13, 2021

Might relate to #1318.

I'm currently using the enwiki-latest-pages-articles.xml.bz2 Wikimedia dump. I tried several methods of updating the embedding for the entire dump but still no luck. I've removed all redirects which comes to around 6,311,807 documents, and after i split them up with sliding window (split_by="word", split_length=512, split_overlap=258), right now the total number of documents are 11,507,338.

The problem is that every time I run document_store.update_embeddings using DensePassageRetriever i get stuck at 512000/11507338 then the process gets killed due to an error:

Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/haystack/document_store/faiss.py", line 464, in load
    faiss_index = faiss.read_index(str(faiss_file_path))
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/faiss/swigfaiss.py", line 5424, in read_index
    return _swigfaiss.read_index(*args)
MemoryError: std::bad_alloc

As suggested by other issues, I've tried lowering the batch_size but still no luck. Current state:

from haystack.document_store import FAISSDocumentStore


document_store = FAISSDocumentStore.load(
    faiss_file_path="docstore",
    sql_url="sqlite:///haystack.db",
    index="wiki"
)
document_store.get_document_count()
# 11507338
document_store.get_embedding_count()
# 5120000

Are there any alternative ways to update_embeddings with a data of this size?

@tholor
Copy link
Member

tholor commented Aug 14, 2021

I am pretty confident that this is not related to haystack's update_embeddings but rather your FAISS index growing out of RAM.
See:

Possible workarounds:

@chrk623
Copy link
Author

chrk623 commented Aug 15, 2021

Make sense.
Thank you for the references.

@chrk623 chrk623 closed this as completed Aug 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants