Not able to use the embedding for calculating similarity. #43

titu1992 · 2020-05-25T09:47:29Z

First of all let me thank you for contributing this knowledge to us. It makes a lot of difference for beginners like me. :)
Now the issue: I was trying to use longformer for calculating the similarity between a query and a list of paragraphs retrieved from my index search. The idea is to re-rank these paragraphs based on the the cosine similarity of the embedding of Question and the individual paragraph.

However, once I have calculated the embedding of both query and paragraph using this code: SAMPLE_TEXT = f'{tokenizer.cls_token}{SAMPLE_TEXT}{tokenizer.eos_token}'
...................................
......................
output = model(input_ids, attention_mask=attention_mask)[0]

I get a embedding of dimension: torch.Size([1, 512, 768])
and when I try to calculate the cosine similarity on these embeddings I get error saying :
ever got this error: RuntimeError: Can't call numpy() on Variable that requires grad. Use var.detach().numpy() instead. while working with torch?

I do see that the error recommends me to use var.detach().numpy() insteam of numpy(). https://stackoverflow.com/questions/55466298/pytorch-cant-call-numpy-on-variable-that-requires-grad-use-var-detach-num

However, I am unsure where should I append this line of code.
I am a beginner and hence please pardon if I have raised an issue unrelated to longformer.

Thanks for help :)

ibeltagy · 2020-05-26T17:40:54Z

I am not sure how you are computing cosine similarities, but if you want to use pytorch tensors in numpy, as the error suggested, you need something like output.detach().numpy().
You can also compute cosine similarity using pytorch functions directly.

I get a embedding of dimension: torch.Size([1, 512, 768])

This is one embedding per token. you probably still need to do output[:, 0]. It returns the embedding of the first token, the tokenizer.cls_token, which kind of work as an embedding for the whole doc.

youssefavx · 2020-06-08T02:30:48Z

Hey @ibeltagy Thanks for the help! I tried this and did get a result but for some reason it's quite a strange one. Quite dissimilar texts get a score of above 90% which is really unexpected (They seem to hover around .97 specifically. I wonder if I'm doing something wrong. Here is my code:

from numba import jit
import numpy as np
import torch
from longformer.longformer import Longformer, LongformerConfig
from longformer.sliding_chunks import pad_to_window_size
from transformers import RobertaTokenizer
import os
os.chdir('/Volumes/Transcend2/longformer')
config = LongformerConfig.from_pretrained('longformer-large-4096')
@jit(nopython=True)
def cosine_similarity_numba(u:np.ndarray, v:np.ndarray):
    assert(u.shape[0] == v.shape[0])
    uv = 0
    uu = 0
    vv = 0
    for i in range(u.shape[0]):
        uv += u[i]*v[i]
        uu += u[i]*u[i]
        vv += v[i]*v[i]
    cos_theta = 1
    if uu!=0 and vv!=0:
        cos_theta = uv/np.sqrt(uu*vv)
    return cos_theta

# choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
# 'n2': for regular n2 attantion
# 'tvm': a custom CUDA kernel implementation of our sliding window attention
# 'sliding_chunks': a PyTorch implementation of our sliding window attention
config.attention_mode = 'sliding_chunks'

model = Longformer.from_pretrained('longformer-large-4096/', config=config)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
tokenizer.model_max_length = model.config.max_position_embeddings

f = open('bio.txt', 'r')
firsttext = f.read()

f2 = open('religious.txt', 'r')
secondtext = f2.read()
documents = [firsttext, secondtext]
embeddings = []
for doc in documents:
    input_ids = torch.tensor(tokenizer.encode(doc)).unsqueeze(0)  # batch of size 1


    # TVM code doesn't work on CPU. Uncomment this if `config.attention_mode = 'tvm'`
    # model = model.cuda(); input_ids = input_ids.cuda()

    # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
    attention_mask[:, [1, 4, 21,]] =  2  # Set global attention based on the task. For example,
                                         # classification: the <s> token
                                         # QA: question tokens

    # padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
    input_ids, attention_mask = pad_to_window_size(
            input_ids, attention_mask, config.attention_window[0], tokenizer.pad_token_id)

    output = model(input_ids, attention_mask=attention_mask)[0]
    print(len(output))
    print(type(output))

    embedding = output[:, 0]
    embeddings.append(embedding[0].detach().cpu().numpy())





print(cosine_similarity_numba(embeddings[0], embeddings[1]))

FantasyCheese · 2020-06-10T07:51:19Z

Hi @ibeltagy I'm also having the same issue that cosine similarity is extremely high for supposedly different articles, in my case it's 0.98x~0.99x. My code is also similar to @youssefavx , from readme sample code with little modification. I'm using torch.nn.functional.cosine_similarity here, but other cosine similarity calculation gave the same result.
On the other hand I thought maybe I should set global attention like the comment suggest, but I have no idea what does original [1, 4, 21] mean and how do I modify it.

import torch
from longformer.longformer import Longformer, LongformerConfig
from longformer.sliding_chunks import pad_to_window_size
from transformers import RobertaTokenizer
import torch.nn.functional as F

config = LongformerConfig.from_pretrained('longformer-base-4096/')
# choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
# 'n2': for regular n2 attantion
# 'tvm': a custom CUDA kernel implementation of our sliding window attention
# 'sliding_chunks': a PyTorch implementation of our sliding window attention
config.attention_mode = 'sliding_chunks'

model = Longformer.from_pretrained('longformer-base-4096/', config=config)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
tokenizer.model_max_length = model.config.max_position_embeddings


def embed(text: str):
        text = f'{tokenizer.cls_token}{text}{tokenizer.eos_token}'
        input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)  # batch of size 1

        # TVM code doesn't work on CPU. Uncomment this if `config.attention_mode = 'tvm'`
        # model = model.cuda(); input_ids = input_ids.cuda()

        # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
        attention_mask = torch.ones(input_ids.shape, dtype=torch.long,
                            device=input_ids.device)  # initialize to local attention
        attention_mask[:, [1, 4, 21, ]] = 2  # Set global attention based on the task. For example,
        # classification: the <s> token
        # QA: question tokens

        # padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
        input_ids, attention_mask = pad_to_window_size(
        input_ids, attention_mask, config.attention_window[0], tokenizer.pad_token_id)

        output = model(input_ids, attention_mask=attention_mask)[0]
        return output[:, 0]

# This is for better readability, I also try random medium articles and still have the same issue
SAMPLE_TEXT_1 = ' '.join(['Hello world! '] * 1000)
SAMPLE_TEXT_2 = ' '.join(['Foo Bar! '] * 1000)

embedding1 = embed(SAMPLE_TEXT_1)
embedding2 = embed(SAMPLE_TEXT_2)
print(F.cosine_similarity(embedding1, embedding2)) # 0.9956

ibeltagy · 2020-06-10T18:11:16Z

I see. I think the problem is that RoBERTa (and similarly, the pretrained Longformer) wasn't trained on the next-sentence prediction task, so the model never learned to aggregate the input into the [CLS] token. Can you try the same example but with RoBERTa (and a shorter input) and see if you have the same problem?

but I have no idea what does original [1, 4, 21] mean and how do I modify it.

Sorry the example is not clear. It means "put global attention on tokens numbers 1, 4 and 21".
1, 4, 21 are random positions just for the demo. In your case, you only want global attention on the [CLS] token, which is token number 0.

matt-peters · 2020-06-10T18:35:24Z

^^ this. During pretraining, the only loss that directly impacts the top layer <s> hidden state is the MLM loss. As the first element is always <s>, the model will learn to always output the same hidden state for this element, so we expect cosine similarity to always be approximately 1 for any two input sequences.

FantasyCheese · 2020-06-12T10:36:52Z

Cool big thanks for explanation @ibeltagy and @matt-peters and sorry for the late reply, I tried roberta-base and yeah it has the same issue.

If I understand correctly, sentence-transformers is trying to solve the issue you mentioned (aggregate the input into the [CLS] token)? And for Longformer it kind of upgrade pre-trained models to support long documents, but if the sentence embedding was bad, it would still be bad after upgrade to long version?

So what we're trying to do is semantic search for documents, in my use case they are resumes from employees. I guess we need to somehow combine these two model, but what's the right procedure here? We have:

Base pre-trained models (Bert, RoBERTa, Albert...)
Task to improve sentence embedding from sentence-transformer: training_nli.py
Task to upgrade to long version from longformer: convert_model_to_long.ipynb
Task to train on our corpus (hope to improve our domain specific document embedding)

And the question is:
a. How to choose base pre-trained models?
b. What should the training process be? For example 1>4>2>3, 1>3>2>4...etc
c. What task should we train for 4?

I just start learning NLP from scratch recently, sorry for lots of beginner questions here and thanks a lot for your help and contribution!

youssefavx · 2020-06-14T23:10:35Z

If I understand correctly, does this mean that it is not possible to use longformer to generate document embeddings that work with cosine similarity (or any other metric)?

ibeltagy · 2020-06-15T04:14:09Z

Here are a few options sorted by expected performance:

if you have training data, you can fine-tune our pretrained longformer to learn the task
use our script to build BertLong, a version of BERT that works with long documents. BERT was trained with the next-sentence prediction so it might do a reasonable job aggregating the document into the CLS token.
average or maxpool the token embedding to get a document embedding

ibeltagy · 2020-06-24T14:33:54Z

Looks like we addressed this issue. I will close it for now but please feel free to reopen or create a new one if you have more questions.

pratikchhapolika · 2021-07-20T04:26:46Z

I am not sure how you are computing cosine similarities, but if you want to use pytorch tensors in numpy, as the error suggested, you need something like output.detach().numpy().
You can also compute cosine similarity using pytorch functions directly.

I get a embedding of dimension: torch.Size([1, 512, 768])

This is one embedding per token. you probably still need to do output[:, 0]. It returns the embedding of the first token, the tokenizer.cls_token, which kind of work as an embedding for the whole doc.

This was in case of BERT. Can we do same with longformer? Or we can take last 4 hidden layer output and so some sort of averaging which will serve as embeddings?

ibeltagy closed this as completed Jun 24, 2020

swethakasireddi mentioned this issue Mar 22, 2023

Longformer embeddings for calculating similarity score between 2 documents using KNN #253

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to use the embedding for calculating similarity. #43

Not able to use the embedding for calculating similarity. #43

titu1992 commented May 25, 2020

ibeltagy commented May 26, 2020 •

edited

Loading

youssefavx commented Jun 8, 2020 •

edited

Loading

FantasyCheese commented Jun 10, 2020

ibeltagy commented Jun 10, 2020 •

edited

Loading

matt-peters commented Jun 10, 2020

FantasyCheese commented Jun 12, 2020

youssefavx commented Jun 14, 2020 •

edited

Loading

ibeltagy commented Jun 15, 2020

ibeltagy commented Jun 24, 2020

pratikchhapolika commented Jul 20, 2021 •

edited

Loading

Not able to use the embedding for calculating similarity. #43

Not able to use the embedding for calculating similarity. #43

Comments

titu1992 commented May 25, 2020

ibeltagy commented May 26, 2020 • edited Loading

youssefavx commented Jun 8, 2020 • edited Loading

FantasyCheese commented Jun 10, 2020

ibeltagy commented Jun 10, 2020 • edited Loading

matt-peters commented Jun 10, 2020

FantasyCheese commented Jun 12, 2020

youssefavx commented Jun 14, 2020 • edited Loading

ibeltagy commented Jun 15, 2020

ibeltagy commented Jun 24, 2020

pratikchhapolika commented Jul 20, 2021 • edited Loading

ibeltagy commented May 26, 2020 •

edited

Loading

youssefavx commented Jun 8, 2020 •

edited

Loading

ibeltagy commented Jun 10, 2020 •

edited

Loading

youssefavx commented Jun 14, 2020 •

edited

Loading

pratikchhapolika commented Jul 20, 2021 •

edited

Loading