You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The way that we are currently using Transformers models involves taking the based encoder, and extracting the full set of hidden activations (across all layers). See link. We later separately pull out only the top layer, and extract the first token representation if we're doing a single-vector task such as classification.
Because of this workflow, we do not end up using the pretrained pooler layers in the respective models, e.g. BERT, ALBERT. RoBERTa also inherits from BERT.
https://arxiv.org/pdf/1903.05987.pdf found diagnostic classifier on finetuned BERT layers achieves similar performance in layer 9-12 (MRPC), and layer 5-12 (STS-B). See figure 1 in the linked pdf.
This suggests pertrained layers on the top may not be that helpful to downstream pair sentence classification tasks.
The way that we are currently using Transformers models involves taking the based encoder, and extracting the full set of hidden activations (across all layers). See link. We later separately pull out only the top layer, and extract the first token representation if we're doing a single-vector task such as classification.
Because of this workflow, we do not end up using the pretrained pooler layers in the respective models, e.g. BERT, ALBERT. RoBERTa also inherits from BERT.
On the other hand, we do not expect this to be a major issue, as we have seen good results from tuning with this format across several works, e.g. https://arxiv.org/abs/1812.10860, https://arxiv.org/abs/1905.00537
The text was updated successfully, but these errors were encountered: