Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use different Transformer model for each sentence embedding #328

Closed
umairspn opened this issue Jul 28, 2020 · 5 comments
Closed

How to use different Transformer model for each sentence embedding #328

umairspn opened this issue Jul 28, 2020 · 5 comments

Comments

@umairspn
Copy link

I am trying to use a different Transformer model for each sentence passed into the SBERT code.
e.g. For sent-A, if I am using Transformer1 (say. Bert-base-uncased).
For sent-B, I want to use a different Transformer2 (say. Bert-base-NLI)

The rest of the process is the same. It generates embedding of sent-A and sent-B using different Transformers, adds pooling operation, and concatenates them.

Any help will be appreciated!
Thank You.

@nreimers
Copy link
Member

Hi @umairspn
Do you really need two different transformers?

I thing that works well (and simplifies a lot of things) is to prepend your input with special tokens, e.g. your input A looks like:
[INP_A]My first sent
[INT_A]My second send

and for Input B:
[INP_B]A sentence for B
[INP_B]Another sentence for B

All the inputs are fed to the same transformer network. The transformer networks applies self-attention across all layers, so I will be able to learn what the difference are between inputs from A and inputs from B. Also, the information if this is an input for A or an input for B will be available at every token.

This has several advantages:

  • Easy to implement, just prepend your inputs with special tokens and train like before.
  • You only need one model => saves spaces on GPU & disc
  • And most important: The single model learns jointly from A and B. If you have two separate transformers, each only knows inputs from either A or B. But with this approach, you have a "spill over effect" which often leads to better performances.

Best
Nils Reimers

@umairspn
Copy link
Author

Hi @nreimers,
Thank you very much for your response. I completely understand the idea you presented here, and I have already started working on this to see if it works in my case.
On second thoughts, however, do you think it's still possible to use two different transformers with SBERT code?
The reason being that I am taking sent-A (a simple sentence) along with sent-B (completely different in nature), and training SoftMax loss and use it further for entailment tasks. So, I planned to use different transformer models in this case.

@nreimers
Copy link
Member

nreimers commented Jul 28, 2020

Hi @umairspn
What you describe should be no problem with prepending different special tokens to the input.

You do this quite often in machine translation, as you have one transformer network that can translate in different directions, e.g. English to Spanish and French to Chinese. This is achieved by just adding special tokens for your input / target language.

Training this single transformer network with special tokens greatly outperforms the setup where you have independent transformer networks for your different languages.

So even if sent-B is of completely different nature, I don't see why it shouldn't work.

Back to your question:
Quite extensive code changes would be needed if you want to pass inputs to different transformer networks. In your internal sentence feature representation, you would need to encode if this is sentence A or sentence B.

Then, you would need another models class that has two transformer networks. Based on the sentence feature input, it could then route the input to either model A or model B.

Your starting point would be to create a new class similar to models.Transformer

@umairspn
Copy link
Author

@nreimers I get the overall idea and it makes much sense now. I'll go with the pretending method and work my way up to the separate transformers method if required.
Again, Thank you very much for your quick response, it was quite helpful.
I'll close the issue now. Appreciate your help!

@nreimers
Copy link
Member

For completeness, here are two papers that use this special token adding:

mBART: https://arxiv.org/abs/2001.08210 - For machine translation
https://arxiv.org/abs/2004.13969 - For Information Retrieval (embeddings)

Here is an code example how you can add new special tokens to BERT tokenizer:
huggingface/transformers#1413

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]

@umairspn umairspn reopened this Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants