Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretraining for sequence classification #7

Open
TahaAslani opened this issue Jun 6, 2021 · 1 comment
Open

Pretraining for sequence classification #7

TahaAslani opened this issue Jun 6, 2021 · 1 comment

Comments

@TahaAslani
Copy link

Hi,

I am implementing fine-tuning exBERT for sequence classification. I already have done the pretraining for my data. However, since the pre-training python script that you have provided is only for NER, I was wondering how should I implement tokenization. Should I just load the model and tokenized my text like this?

from exBERT import BertTokenizer, BertForSequenceClassification
model = BertForSequenceClassification(path_to_config_file_of_the_OFF_THE_SHELF_MODEL, config_and_vocab/exBERT_no_ex_vocab/bert_config_ex_s3.json, len(list_of_lables))
tokenizer = BertTokenizer(path_to_off_the_shelf_model_vocab)

and just use it as a regular model in hugging face, or I have to add certain lines for handling the new vocabulary (tokens that start with ##)

Thanks for your help in advance!

@sonicrux
Copy link

sonicrux commented Apr 1, 2022

This is the way I got classification to work -

# Get your imports
from exBERT import BertForSequenceClassification, BertConfig
from transformers import BertTokenizer



# Load in your config files 
bert_config_1 = BertConfig.from_json_file('path_to_off_the_shelf_config_file')
bert_config_2 = BertConfig.from_json_file('updated_config_file_with_new_vocab_size')

# Initialize your classification object
num_labels = 2
model = BertForSequenceClassification(bert_config_1,bert_config_2, num_labels=num_labels)

# Load in your pretrained state dict
model.load_state_dict(torch.load('path_to_state_dict_from_pretraining'), strict=False)

# Initialize tokenizer 
tokenizer = BertTokenizer(vocab_file='path_to_augmented_vocab.txt')

# Tokenize input
input_ids = []
attention_masks = []
for sentence in sentences:
        encoded_dict = tokenizer.encode_plus(
                            sentence,                      # Sentence to encode.
                            add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                            max_length = 512,           # Pad & truncate all sentences.
                            pad_to_max_length = True,
                            return_attention_mask = True,   # Construct attn. masks.
                            return_tensors = 'pt',     # Return pytorch tensors.
                       )
        
        # Add the encoded sentence to the list.    
        input_ids.append(encoded_dict['input_ids'])
        
        # And its attention mask (simply differentiates padding from non-padding).
        attention_masks.append(encoded_dict['attention_mask'])
      
# At this point you'll convert your input_ids, attention_masks and labels to pytorch tensors

# Get model output 
# You should probably batch-ify this
(loss, logits) = model(input_ids, 
                                      token_type_ids=None, 
                                      attention_mask=attention_masks,
                                      labels=labels)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants