Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help Wanted] Generate LASER embeddings for a large number of sentences (15.7 million) #192

Open
NomadXD opened this issue Aug 4, 2021 · 1 comment

Comments

@NomadXD
Copy link

NomadXD commented Aug 4, 2021

For my university FYP project related to text simplification, there's a requirement for me to generate LASER embeddings for a large number of sentences. (15.7 million) However when I try to generate LASER embeddings using the SentenceEncoder in the embed.py, the program stays fully utilized for around 12 hours and then exits without any error (I assume it is because of the high CPU and GPU utilization). I'm using the SentenceEncoder in the following way.

Initialize the SentenceEncoder with the following params. I'm using the pretrained encoder (models/bilstm.93langs.2018-12-26.pt )

SentenceEncoder(encoder_path, max_tokens=3000, cpu=False, verbose=True)

And then generate LASER embeddings as follows.

embeddings = encoder.encode_sentences(read_lines(bpe_filepath))

I tried to execute the setup with above params in a GCP compute engine with 16 cores with 102 GB memory and 1 Nvidia Tesla T4 GPU. The CPU utilization reaches 100% while the GPU utilization is somewhere around 90%. It stays like that for around 12 hours and exits without any error. (no error in nohup.out).

Any idea about what could go wrong ? I'm stucked at this point for several weeks and really appreciate if someone can help me.

cc @hoschwenk

@prasunshrestha
Copy link

I also have a similar issue. Not sure if this would help, but have you tried ThreadPoolExecutor or multiprocessing to parallelize? If you are not married to LASER, there are many embedding models now based on transformer architecture (unlike LASER's BiLSTM), so the computation is much faster from the get-go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants