Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to use the embedding for calculating similarity. #43

Closed
titu1992 opened this issue May 25, 2020 · 10 comments
Closed

Not able to use the embedding for calculating similarity. #43

titu1992 opened this issue May 25, 2020 · 10 comments

Comments

@titu1992
Copy link

First of all let me thank you for contributing this knowledge to us. It makes a lot of difference for beginners like me. :)
Now the issue: I was trying to use longformer for calculating the similarity between a query and a list of paragraphs retrieved from my index search. The idea is to re-rank these paragraphs based on the the cosine similarity of the embedding of Question and the individual paragraph.

However, once I have calculated the embedding of both query and paragraph using this code: SAMPLE_TEXT = f'{tokenizer.cls_token}{SAMPLE_TEXT}{tokenizer.eos_token}'
...................................
......................
output = model(input_ids, attention_mask=attention_mask)[0]

I get a embedding of dimension: torch.Size([1, 512, 768])
and when I try to calculate the cosine similarity on these embeddings I get error saying :
ever got this error: RuntimeError: Can't call numpy() on Variable that requires grad. Use var.detach().numpy() instead. while working with torch?

I do see that the error recommends me to use var.detach().numpy() insteam of numpy(). https://stackoverflow.com/questions/55466298/pytorch-cant-call-numpy-on-variable-that-requires-grad-use-var-detach-num

However, I am unsure where should I append this line of code.
I am a beginner and hence please pardon if I have raised an issue unrelated to longformer.

Thanks for help :)

@ibeltagy
Copy link
Collaborator

ibeltagy commented May 26, 2020

I am not sure how you are computing cosine similarities, but if you want to use pytorch tensors in numpy, as the error suggested, you need something like output.detach().numpy().
You can also compute cosine similarity using pytorch functions directly.

I get a embedding of dimension: torch.Size([1, 512, 768])

This is one embedding per token. you probably still need to do output[:, 0]. It returns the embedding of the first token, the tokenizer.cls_token, which kind of work as an embedding for the whole doc.

@youssefavx
Copy link

youssefavx commented Jun 8, 2020

Hey @ibeltagy Thanks for the help! I tried this and did get a result but for some reason it's quite a strange one. Quite dissimilar texts get a score of above 90% which is really unexpected (They seem to hover around .97 specifically. I wonder if I'm doing something wrong. Here is my code:

from numba import jit
import numpy as np
import torch
from longformer.longformer import Longformer, LongformerConfig
from longformer.sliding_chunks import pad_to_window_size
from transformers import RobertaTokenizer
import os
os.chdir('/Volumes/Transcend2/longformer')
config = LongformerConfig.from_pretrained('longformer-large-4096')
@jit(nopython=True)
def cosine_similarity_numba(u:np.ndarray, v:np.ndarray):
    assert(u.shape[0] == v.shape[0])
    uv = 0
    uu = 0
    vv = 0
    for i in range(u.shape[0]):
        uv += u[i]*v[i]
        uu += u[i]*u[i]
        vv += v[i]*v[i]
    cos_theta = 1
    if uu!=0 and vv!=0:
        cos_theta = uv/np.sqrt(uu*vv)
    return cos_theta

# choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
# 'n2': for regular n2 attantion
# 'tvm': a custom CUDA kernel implementation of our sliding window attention
# 'sliding_chunks': a PyTorch implementation of our sliding window attention
config.attention_mode = 'sliding_chunks'

model = Longformer.from_pretrained('longformer-large-4096/', config=config)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
tokenizer.model_max_length = model.config.max_position_embeddings

f = open('bio.txt', 'r')
firsttext = f.read()

f2 = open('religious.txt', 'r')
secondtext = f2.read()
documents = [firsttext, secondtext]
embeddings = []
for doc in documents:
    input_ids = torch.tensor(tokenizer.encode(doc)).unsqueeze(0)  # batch of size 1


    # TVM code doesn't work on CPU. Uncomment this if `config.attention_mode = 'tvm'`
    # model = model.cuda(); input_ids = input_ids.cuda()

    # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
    attention_mask[:, [1, 4, 21,]] =  2  # Set global attention based on the task. For example,
                                         # classification: the <s> token
                                         # QA: question tokens

    # padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
    input_ids, attention_mask = pad_to_window_size(
            input_ids, attention_mask, config.attention_window[0], tokenizer.pad_token_id)

    output = model(input_ids, attention_mask=attention_mask)[0]
    print(len(output))
    print(type(output))

    embedding = output[:, 0]
    embeddings.append(embedding[0].detach().cpu().numpy())





print(cosine_similarity_numba(embeddings[0], embeddings[1]))

@FantasyCheese
Copy link

Hi @ibeltagy I'm also having the same issue that cosine similarity is extremely high for supposedly different articles, in my case it's 0.98x~0.99x. My code is also similar to @youssefavx , from readme sample code with little modification. I'm using torch.nn.functional.cosine_similarity here, but other cosine similarity calculation gave the same result.
On the other hand I thought maybe I should set global attention like the comment suggest, but I have no idea what does original [1, 4, 21] mean and how do I modify it.

import torch
from longformer.longformer import Longformer, LongformerConfig
from longformer.sliding_chunks import pad_to_window_size
from transformers import RobertaTokenizer
import torch.nn.functional as F

config = LongformerConfig.from_pretrained('longformer-base-4096/')
# choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
# 'n2': for regular n2 attantion
# 'tvm': a custom CUDA kernel implementation of our sliding window attention
# 'sliding_chunks': a PyTorch implementation of our sliding window attention
config.attention_mode = 'sliding_chunks'

model = Longformer.from_pretrained('longformer-base-4096/', config=config)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
tokenizer.model_max_length = model.config.max_position_embeddings


def embed(text: str):
        text = f'{tokenizer.cls_token}{text}{tokenizer.eos_token}'
        input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)  # batch of size 1

        # TVM code doesn't work on CPU. Uncomment this if `config.attention_mode = 'tvm'`
        # model = model.cuda(); input_ids = input_ids.cuda()

        # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
        attention_mask = torch.ones(input_ids.shape, dtype=torch.long,
                            device=input_ids.device)  # initialize to local attention
        attention_mask[:, [1, 4, 21, ]] = 2  # Set global attention based on the task. For example,
        # classification: the <s> token
        # QA: question tokens

        # padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
        input_ids, attention_mask = pad_to_window_size(
        input_ids, attention_mask, config.attention_window[0], tokenizer.pad_token_id)

        output = model(input_ids, attention_mask=attention_mask)[0]
        return output[:, 0]

# This is for better readability, I also try random medium articles and still have the same issue
SAMPLE_TEXT_1 = ' '.join(['Hello world! '] * 1000)
SAMPLE_TEXT_2 = ' '.join(['Foo Bar! '] * 1000)

embedding1 = embed(SAMPLE_TEXT_1)
embedding2 = embed(SAMPLE_TEXT_2)
print(F.cosine_similarity(embedding1, embedding2)) # 0.9956

@ibeltagy
Copy link
Collaborator

ibeltagy commented Jun 10, 2020

I see. I think the problem is that RoBERTa (and similarly, the pretrained Longformer) wasn't trained on the next-sentence prediction task, so the model never learned to aggregate the input into the [CLS] token. Can you try the same example but with RoBERTa (and a shorter input) and see if you have the same problem?

but I have no idea what does original [1, 4, 21] mean and how do I modify it.

Sorry the example is not clear. It means "put global attention on tokens numbers 1, 4 and 21".
1, 4, 21 are random positions just for the demo. In your case, you only want global attention on the [CLS] token, which is token number 0.

@matt-peters
Copy link

^^ this. During pretraining, the only loss that directly impacts the top layer <s> hidden state is the MLM loss. As the first element is always <s>, the model will learn to always output the same hidden state for this element, so we expect cosine similarity to always be approximately 1 for any two input sequences.

@FantasyCheese
Copy link

Cool big thanks for explanation @ibeltagy and @matt-peters and sorry for the late reply, I tried roberta-base and yeah it has the same issue.

If I understand correctly, sentence-transformers is trying to solve the issue you mentioned (aggregate the input into the [CLS] token)? And for Longformer it kind of upgrade pre-trained models to support long documents, but if the sentence embedding was bad, it would still be bad after upgrade to long version?

So what we're trying to do is semantic search for documents, in my use case they are resumes from employees. I guess we need to somehow combine these two model, but what's the right procedure here? We have:

  1. Base pre-trained models (Bert, RoBERTa, Albert...)
  2. Task to improve sentence embedding from sentence-transformer: training_nli.py
  3. Task to upgrade to long version from longformer: convert_model_to_long.ipynb
  4. Task to train on our corpus (hope to improve our domain specific document embedding)

And the question is:
a. How to choose base pre-trained models?
b. What should the training process be? For example 1>4>2>3, 1>3>2>4...etc
c. What task should we train for 4?

I just start learning NLP from scratch recently, sorry for lots of beginner questions here and thanks a lot for your help and contribution!

@youssefavx
Copy link

youssefavx commented Jun 14, 2020

If I understand correctly, does this mean that it is not possible to use longformer to generate document embeddings that work with cosine similarity (or any other metric)?

@ibeltagy
Copy link
Collaborator

Here are a few options sorted by expected performance:

  • if you have training data, you can fine-tune our pretrained longformer to learn the task

  • use our script to build BertLong, a version of BERT that works with long documents. BERT was trained with the next-sentence prediction so it might do a reasonable job aggregating the document into the CLS token.

  • average or maxpool the token embedding to get a document embedding

@ibeltagy
Copy link
Collaborator

Looks like we addressed this issue. I will close it for now but please feel free to reopen or create a new one if you have more questions.

@pratikchhapolika
Copy link

pratikchhapolika commented Jul 20, 2021

I am not sure how you are computing cosine similarities, but if you want to use pytorch tensors in numpy, as the error suggested, you need something like output.detach().numpy().
You can also compute cosine similarity using pytorch functions directly.

I get a embedding of dimension: torch.Size([1, 512, 768])

This is one embedding per token. you probably still need to do output[:, 0]. It returns the embedding of the first token, the tokenizer.cls_token, which kind of work as an embedding for the whole doc.

This was in case of BERT. Can we do same with longformer? Or we can take last 4 hidden layer output and so some sort of averaging which will serve as embeddings?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants