How to use different Transformer model for each sentence embedding #328

umairspn · 2020-07-28T07:36:43Z

I am trying to use a different Transformer model for each sentence passed into the SBERT code.
e.g. For sent-A, if I am using Transformer1 (say. Bert-base-uncased).
For sent-B, I want to use a different Transformer2 (say. Bert-base-NLI)

The rest of the process is the same. It generates embedding of sent-A and sent-B using different Transformers, adds pooling operation, and concatenates them.

Any help will be appreciated!
Thank You.

nreimers · 2020-07-28T07:45:58Z

Hi @umairspn
Do you really need two different transformers?

I thing that works well (and simplifies a lot of things) is to prepend your input with special tokens, e.g. your input A looks like:
[INP_A]My first sent
[INT_A]My second send

and for Input B:
[INP_B]A sentence for B
[INP_B]Another sentence for B

All the inputs are fed to the same transformer network. The transformer networks applies self-attention across all layers, so I will be able to learn what the difference are between inputs from A and inputs from B. Also, the information if this is an input for A or an input for B will be available at every token.

This has several advantages:

Easy to implement, just prepend your inputs with special tokens and train like before.
You only need one model => saves spaces on GPU & disc
And most important: The single model learns jointly from A and B. If you have two separate transformers, each only knows inputs from either A or B. But with this approach, you have a "spill over effect" which often leads to better performances.

Best
Nils Reimers

umairspn · 2020-07-28T09:07:40Z

Hi @nreimers,
Thank you very much for your response. I completely understand the idea you presented here, and I have already started working on this to see if it works in my case.
On second thoughts, however, do you think it's still possible to use two different transformers with SBERT code?
The reason being that I am taking sent-A (a simple sentence) along with sent-B (completely different in nature), and training SoftMax loss and use it further for entailment tasks. So, I planned to use different transformer models in this case.

nreimers · 2020-07-28T09:24:56Z

Hi @umairspn
What you describe should be no problem with prepending different special tokens to the input.

You do this quite often in machine translation, as you have one transformer network that can translate in different directions, e.g. English to Spanish and French to Chinese. This is achieved by just adding special tokens for your input / target language.

Training this single transformer network with special tokens greatly outperforms the setup where you have independent transformer networks for your different languages.

So even if sent-B is of completely different nature, I don't see why it shouldn't work.

Back to your question:
Quite extensive code changes would be needed if you want to pass inputs to different transformer networks. In your internal sentence feature representation, you would need to encode if this is sentence A or sentence B.

Then, you would need another models class that has two transformer networks. Based on the sentence feature input, it could then route the input to either model A or model B.

Your starting point would be to create a new class similar to models.Transformer

umairspn · 2020-07-28T09:52:23Z

@nreimers I get the overall idea and it makes much sense now. I'll go with the pretending method and work my way up to the separate transformers method if required.
Again, Thank you very much for your quick response, it was quite helpful.
I'll close the issue now. Appreciate your help!

nreimers · 2020-07-29T15:30:01Z

For completeness, here are two papers that use this special token adding:

mBART: https://arxiv.org/abs/2001.08210 - For machine translation
https://arxiv.org/abs/2004.13969 - For Information Retrieval (embeddings)

Here is an code example how you can add new special tokens to BERT tokenizer:
huggingface/transformers#1413

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]

nreimers mentioned this issue Jul 28, 2020

Simple way of producing two independent embeddings #238

Open

umairspn closed this as completed Jul 28, 2020

umairspn reopened this Aug 4, 2020

umairspn closed this as completed Aug 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use different Transformer model for each sentence embedding #328

How to use different Transformer model for each sentence embedding #328

umairspn commented Jul 28, 2020

nreimers commented Jul 28, 2020

umairspn commented Jul 28, 2020

nreimers commented Jul 28, 2020 •

edited

Loading

umairspn commented Jul 28, 2020

nreimers commented Jul 29, 2020

How to use different Transformer model for each sentence embedding #328

How to use different Transformer model for each sentence embedding #328

Comments

umairspn commented Jul 28, 2020

nreimers commented Jul 28, 2020

umairspn commented Jul 28, 2020

nreimers commented Jul 28, 2020 • edited Loading

umairspn commented Jul 28, 2020

nreimers commented Jul 29, 2020

nreimers commented Jul 28, 2020 •

edited

Loading