You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am implementing fine-tuning exBERT for sequence classification. I already have done the pretraining for my data. However, since the pre-training python script that you have provided is only for NER, I was wondering how should I implement tokenization. Should I just load the model and tokenized my text like this?
from exBERT import BertTokenizer, BertForSequenceClassification
model = BertForSequenceClassification(path_to_config_file_of_the_OFF_THE_SHELF_MODEL, config_and_vocab/exBERT_no_ex_vocab/bert_config_ex_s3.json, len(list_of_lables))
tokenizer = BertTokenizer(path_to_off_the_shelf_model_vocab)
and just use it as a regular model in hugging face, or I have to add certain lines for handling the new vocabulary (tokens that start with ##)
Thanks for your help in advance!
The text was updated successfully, but these errors were encountered:
# Get your imports
from exBERT import BertForSequenceClassification, BertConfig
from transformers import BertTokenizer
# Load in your config files
bert_config_1 = BertConfig.from_json_file('path_to_off_the_shelf_config_file')
bert_config_2 = BertConfig.from_json_file('updated_config_file_with_new_vocab_size')
# Initialize your classification object
num_labels = 2
model = BertForSequenceClassification(bert_config_1,bert_config_2, num_labels=num_labels)
# Load in your pretrained state dict
model.load_state_dict(torch.load('path_to_state_dict_from_pretraining'), strict=False)
# Initialize tokenizer
tokenizer = BertTokenizer(vocab_file='path_to_augmented_vocab.txt')
# Tokenize input
input_ids = []
attention_masks = []
for sentence in sentences:
encoded_dict = tokenizer.encode_plus(
sentence, # Sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = 512, # Pad & truncate all sentences.
pad_to_max_length = True,
return_attention_mask = True, # Construct attn. masks.
return_tensors = 'pt', # Return pytorch tensors.
)
# Add the encoded sentence to the list.
input_ids.append(encoded_dict['input_ids'])
# And its attention mask (simply differentiates padding from non-padding).
attention_masks.append(encoded_dict['attention_mask'])
# At this point you'll convert your input_ids, attention_masks and labels to pytorch tensors
# Get model output
# You should probably batch-ify this
(loss, logits) = model(input_ids,
token_type_ids=None,
attention_mask=attention_masks,
labels=labels)
Hi,
I am implementing fine-tuning exBERT for sequence classification. I already have done the pretraining for my data. However, since the pre-training python script that you have provided is only for NER, I was wondering how should I implement tokenization. Should I just load the model and tokenized my text like this?
from exBERT import BertTokenizer, BertForSequenceClassification
model = BertForSequenceClassification(path_to_config_file_of_the_OFF_THE_SHELF_MODEL, config_and_vocab/exBERT_no_ex_vocab/bert_config_ex_s3.json, len(list_of_lables))
tokenizer = BertTokenizer(path_to_off_the_shelf_model_vocab)
and just use it as a regular model in hugging face, or I have to add certain lines for handling the new vocabulary (tokens that start with ##)
Thanks for your help in advance!
The text was updated successfully, but these errors were encountered: