Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does DPR document store "update embeddings" utilize multiple GPUs? #1318

Closed
shihabrashid-ucr opened this issue Aug 4, 2021 · 3 comments
Closed
Assignees
Labels
topic:modeling type:feature New feature or request

Comments

@shihabrashid-ucr
Copy link

shihabrashid-ucr commented Aug 4, 2021

I am trying to create DPR embeddings for the whole Wikipedia dataset (11 million documents).
First I ran the code on a single 16GB GPU and 61 GB ram. In tqdm I can see that the whole "document_store.update_embeddings()" with a batch size of 32 would take total 30 hours. However, after 20 hours the process gets "killed" everytime, I am guessing due to low RAM.
So, I again ran the code, but this time on 12GB x 8 = 96 GB GPU(8 Tesla K80 GPUs) and 488GB of RAM machine. But now, the tqdm process with batch size of 32 shows 157 hours! I ran it multiple times to make sure. So I am confused, does the update embeddings code utilize multiple GPUs?
I followed issue (#601) and got an understanding that the batch mode was introduced.
Is there any alternate way I can create embeddings for 11m docs utilizing multiple GPUs?
Here is the code I am using:
`

dpr_document_store = FAISSDocumentStore(faiss_index_factory_str="Flat", similarity="dot_product", sql_url= "sqlite:///all_docs.db", index='wiki_docs')
dpr_document_store.write_documents(wiki_dict)
retriever = DensePassageRetriever(document_store=dpr_document_store,
                              query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                              passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                              max_seq_len_query=365,
                              max_seq_len_passage=350,
                              batch_size=32,
                              use_gpu=True,
                              embed_title=False,
                              use_fast_tokenizers=True)
dpr_document_store.update_embeddings(retriever)
dpr_document_store.save('/home/ubuntu/FAISS_saves/wiki_all_docs')

`

@tholor
Copy link
Member

tholor commented Aug 4, 2021

Hey @shihab3252 ,

DPR's update_embeddings() is currently not supporting multiple GPUs. However, it makes totally sense to enable multiple GPUs (at least via DataParallell) - I'll add it to our next sprint unless you want to provide a PR yourself here.

Regarding your problems with the single GPU: The GPU memory shouldn't be a problem here. Can you share the error message you get there? As a temporary, hacky workaround you could also try to save the FAISS Index every ~ 1 Mio documents. Then you at least wouldn't need to start from scratch if errors happen so late. Could be something along these lines:

...
for batch in wiki_doc_batches: 
     dpr_document_store.write_documents(wiki_dict)
     dpr_document_store.update_embeddings(retriever, update_existing_embeddings=False)
     dpr_document_store.save("/home/ubuntu/FAISS_saves/wiki_all_docs")

@tholor tholor self-assigned this Aug 4, 2021
@shihabrashid-ucr
Copy link
Author

The error I am getting is killed after around 25 hours.
It would be great if you can add DataParallel to update_embeddings().
Let me try out the temporary workaround and see if it works.
Thanks for all your help!

@tholor
Copy link
Member

tholor commented Sep 10, 2021

Implemented in #1414

FYI @shihabrashid-ucr

@tholor tholor closed this as completed Sep 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:modeling type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants