This repository contains pre-trained BERT models trained on the Portuguese language. BERT-Base and BERT-Large Cased variants were trained on the BrWaC (Brazilian Web as Corpus), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask. Model artifacts for TensorFlow and PyTorch can be found below.
The models are a result of an ongoing Master's Program. The text submission for Qualifying Exam is also included in the repository in PDF format, which contains more details about the pre-training procedure, vocabulary generation and downstream usage in the task of Named Entity Recognition.
Model | TensorFlow checkpoint | PyTorch checkpoint | Vocabulary |
---|---|---|---|
BERTimbau Base (aka bert-base-portuguese-cased ) |
Download | Download | Download |
BERTimbau Large (aka bert-large-portuguese-cased ) |
Download | Download | Download |
The models were benchmarked on three tasks (Sentence Textual Similarity, Recognizing Textual Entailment and Named Entity Recognition) and compared to previous published results and Multilingual BERT. Metrics are: Pearson's correlation for STS and F1-score for RTE and NER.
Task | Test Dataset | BERTimbau-Large | BERTimbau-Base | mBERT | Previous SOTA |
---|---|---|---|---|---|
STS | ASSIN2 | 0.852 | 0.836 | 0.809 | 0.83 [1] |
RTE | ASSIN2 | 90.0 | 89.2 | 86.8 | 88.3 [1] |
NER | MiniHAREM (5 classes) | 83.7 | 83.1 | 79.2 | 82.3 [2] |
NER | MiniHAREM (10 classes) | 78.5 | 77.6 | 73.1 | 74.6 [2] |
Code and instructions to reproduce the Named Entity Recognition experiments are in ner_evaluation/
directory.
Our PyTorch artifacts are compatible with the 🤗Huggingface Transformers library and are also available on the Community models:
from transformers import AutoModel, AutoTokenizer
# Using the community model
# BERT Base
tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased')
model = AutoModel.from_pretrained('neuralmind/bert-base-portuguese-cased')
# BERT Large
tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-large-portuguese-cased')
model = AutoModel.from_pretrained('neuralmind/bert-large-portuguese-cased')
# or, using BertModel and BertTokenizer directly
from transformers import BertModel, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('path/to/vocab.txt', do_lower_case=False)
model = BertModel.from_pretrained('path/to/bert_dir') # Or other BERT model class
We would like to thank Google for Cloud credits under a research grant that allowed us to train these models.
[1] Multilingual Transformer Ensembles for Portuguese Natural Language Task
[2] Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition
@inproceedings{souza2020bertimbau,
author = {Souza, F{\'a}bio and Nogueira, Rodrigo and Lotufo, Roberto},
title = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
year = {2020}
}
@article{souza2019portuguese,
title={Portuguese Named Entity Recognition using BERT-CRF},
author={Souza, F{\'a}bio and Nogueira, Rodrigo and Lotufo, Roberto},
journal={arXiv preprint arXiv:1909.10649},
url={http://arxiv.org/abs/1909.10649},
year={2019}
}