-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not able to use the embedding for calculating similarity. #43
Comments
I am not sure how you are computing cosine similarities, but if you want to use pytorch tensors in numpy, as the error suggested, you need something like
This is one embedding per token. you probably still need to do |
Hey @ibeltagy Thanks for the help! I tried this and did get a result but for some reason it's quite a strange one. Quite dissimilar texts get a score of above 90% which is really unexpected (They seem to hover around .97 specifically. I wonder if I'm doing something wrong. Here is my code:
|
Hi @ibeltagy I'm also having the same issue that cosine similarity is extremely high for supposedly different articles, in my case it's 0.98x~0.99x. My code is also similar to @youssefavx , from readme sample code with little modification. I'm using
|
I see. I think the problem is that RoBERTa (and similarly, the pretrained Longformer) wasn't trained on the next-sentence prediction task, so the model never learned to aggregate the input into the
Sorry the example is not clear. It means "put global attention on tokens numbers 1, 4 and 21". |
^^ this. During pretraining, the only loss that directly impacts the top layer |
Cool big thanks for explanation @ibeltagy and @matt-peters and sorry for the late reply, I tried If I understand correctly, sentence-transformers is trying to solve the issue you mentioned (aggregate the input into the [CLS] token)? And for Longformer it kind of upgrade pre-trained models to support long documents, but if the sentence embedding was bad, it would still be bad after upgrade to long version? So what we're trying to do is semantic search for documents, in my use case they are resumes from employees. I guess we need to somehow combine these two model, but what's the right procedure here? We have:
And the question is: I just start learning NLP from scratch recently, sorry for lots of beginner questions here and thanks a lot for your help and contribution! |
If I understand correctly, does this mean that it is not possible to use longformer to generate document embeddings that work with cosine similarity (or any other metric)? |
Here are a few options sorted by expected performance:
|
Looks like we addressed this issue. I will close it for now but please feel free to reopen or create a new one if you have more questions. |
This was in case of BERT. Can we do same with longformer? Or we can take last 4 hidden layer output and so some sort of averaging which will serve as embeddings? |
First of all let me thank you for contributing this knowledge to us. It makes a lot of difference for beginners like me. :)
Now the issue: I was trying to use longformer for calculating the similarity between a query and a list of paragraphs retrieved from my index search. The idea is to re-rank these paragraphs based on the the cosine similarity of the embedding of Question and the individual paragraph.
However, once I have calculated the embedding of both query and paragraph using this code: SAMPLE_TEXT = f'{tokenizer.cls_token}{SAMPLE_TEXT}{tokenizer.eos_token}'
...................................
......................
output = model(input_ids, attention_mask=attention_mask)[0]
I get a embedding of dimension: torch.Size([1, 512, 768])
and when I try to calculate the cosine similarity on these embeddings I get error saying :
ever got this error: RuntimeError: Can't call numpy() on Variable that requires grad. Use var.detach().numpy() instead. while working with torch?
I do see that the error recommends me to use var.detach().numpy() insteam of numpy(). https://stackoverflow.com/questions/55466298/pytorch-cant-call-numpy-on-variable-that-requires-grad-use-var-detach-num
However, I am unsure where should I append this line of code.
I am a beginner and hence please pardon if I have raised an issue unrelated to longformer.
Thanks for help :)
The text was updated successfully, but these errors were encountered: