From 772871e03ea568393e2cc27142338eede719193c Mon Sep 17 00:00:00 2001 From: Stefan Schweter Date: Thu, 5 Nov 2020 22:44:37 +0100 Subject: [PATCH] =?UTF-8?q?[model=5Fcards]=20Update=20Italian=20BERT=20mod?= =?UTF-8?q?els=20and=20introduce=20new=20Italian=20XXL=20ELECTRA=20model?= =?UTF-8?q?=20=F0=9F=8E=89?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../dbmdz/bert-base-italian-cased/README.md | 60 +++++++--- .../dbmdz/bert-base-italian-uncased/README.md | 60 +++++++--- .../bert-base-italian-xxl-cased/README.md | 60 +++++++--- .../bert-base-italian-xxl-uncased/README.md | 60 +++++++--- .../README.md | 110 ++++++++++++++++++ .../README.md | 110 ++++++++++++++++++ 6 files changed, 404 insertions(+), 56 deletions(-) create mode 100644 model_cards/dbmdz/electra-base-italian-xxl-cased-discriminator/README.md create mode 100644 model_cards/dbmdz/electra-base-italian-xxl-cased-generator/README.md diff --git a/model_cards/dbmdz/bert-base-italian-cased/README.md b/model_cards/dbmdz/bert-base-italian-cased/README.md index dbe1e5587674a4..43c9de3da0c6e2 100644 --- a/model_cards/dbmdz/bert-base-italian-cased/README.md +++ b/model_cards/dbmdz/bert-base-italian-cased/README.md @@ -1,12 +1,14 @@ --- language: it license: mit +datasets: +- wikipedia --- -# 🤗 + 📚 dbmdz BERT models +# 🤗 + 📚 dbmdz BERT and ELECTRA models In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State -Library open sources Italian BERT models 🎉 +Library open sources Italian BERT and ELECTRA models 🎉 # Italian BERT @@ -22,23 +24,35 @@ For the XXL Italian models, we use the same training data from OPUS and extend it with data from the Italian part of the [OSCAR corpus](https://traces1.inria.fr/oscar/). Thus, the final training corpus has a size of 81GB and 13,138,379,147 tokens. +Note: Unfortunately, a wrong vocab size was used when training the XXL models. +This explains the mismatch of the "real" vocab size of 31102, compared to the +vocab size specified in `config.json`. However, the model is working and all +evaluations were done under those circumstances. +See [this issue](https://github.com/dbmdz/berts/issues/7) for more information. + +The Italian ELECTRA model was trained on the "XXL" corpus for 1M steps in total using a batch +size of 128. We pretty much following the ELECTRA training procedure as used for +[BERTurk](https://github.com/stefan-it/turkish-bert/tree/master/electra). + ## Model weights Currently only PyTorch-[Transformers](https://github.com/huggingface/transformers) compatible weights are available. If you need access to TensorFlow checkpoints, please raise an issue! -| Model | Downloads -| --------------------------------------- | --------------------------------------------------------------------------------------------------------------- -| `dbmdz/bert-base-italian-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/vocab.txt) -| `dbmdz/bert-base-italian-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/vocab.txt) -| `dbmdz/bert-base-italian-xxl-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/vocab.txt) -| `dbmdz/bert-base-italian-xxl-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/vocab.txt) +| Model | Downloads +| ---------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- +| `dbmdz/bert-base-italian-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/vocab.txt) +| `dbmdz/bert-base-italian-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/vocab.txt) +| `dbmdz/bert-base-italian-xxl-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/vocab.txt) +| `dbmdz/bert-base-italian-xxl-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/vocab.txt) +| `dbmdz/electra-base-italian-xxl-cased-discriminator` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/dbmdz/electra-base-italian-xxl-cased-discriminator/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-discriminator/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-discriminator/vocab.txt) +| `dbmdz/electra-base-italian-xxl-cased-generator` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/dbmdz/electra-base-italian-xxl-cased-generator/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-generator/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-generator/vocab.txt) ## Results For results on downstream tasks like NER or PoS tagging, please refer to -[this repository](https://github.com/stefan-it/fine-tuned-berts-seq). +[this repository](https://github.com/stefan-it/italian-bertelectra). ## Usage @@ -47,8 +61,11 @@ With Transformers >= 2.3 our Italian BERT models can be loaded like: ```python from transformers import AutoModel, AutoTokenizer -tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-cased") -model = AutoModel.from_pretrained("dbmdz/bert-base-italian-cased") +model_name = "dbmdz/bert-base-italian-cased" + +tokenizer = AutoTokenizer.from_pretrained(model_name) + +model = AutoModel.from_pretrained(model_name) ``` To load the (recommended) Italian XXL BERT models, just use: @@ -56,8 +73,23 @@ To load the (recommended) Italian XXL BERT models, just use: ```python from transformers import AutoModel, AutoTokenizer -tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased") -model = AutoModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased") +model_name = "dbmdz/bert-base-italian-xxl-cased" + +tokenizer = AutoTokenizer.from_pretrained(model_name) + +model = AutoModel.from_pretrained(model_name) +``` + +To load the Italian XXL ELECTRA model (discriminator), just use: + +```python +from transformers import AutoModel, AutoTokenizer + +model_name = "dbmdz/electra-base-italian-xxl-cased-discriminator" + +tokenizer = AutoTokenizer.from_pretrained(model_name) + +model = AutoModelWithLMHead.from_pretrained(model_name) ``` # Huggingface model hub @@ -66,7 +98,7 @@ All models are available on the [Huggingface model hub](https://huggingface.co/d # Contact (Bugs, Feedback, Contribution and more) -For questions about our BERT models just open an issue +For questions about our BERT/ELECTRA models just open an issue [here](https://github.com/dbmdz/berts/issues/new) 🤗 # Acknowledgments diff --git a/model_cards/dbmdz/bert-base-italian-uncased/README.md b/model_cards/dbmdz/bert-base-italian-uncased/README.md index dbe1e5587674a4..43c9de3da0c6e2 100644 --- a/model_cards/dbmdz/bert-base-italian-uncased/README.md +++ b/model_cards/dbmdz/bert-base-italian-uncased/README.md @@ -1,12 +1,14 @@ --- language: it license: mit +datasets: +- wikipedia --- -# 🤗 + 📚 dbmdz BERT models +# 🤗 + 📚 dbmdz BERT and ELECTRA models In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State -Library open sources Italian BERT models 🎉 +Library open sources Italian BERT and ELECTRA models 🎉 # Italian BERT @@ -22,23 +24,35 @@ For the XXL Italian models, we use the same training data from OPUS and extend it with data from the Italian part of the [OSCAR corpus](https://traces1.inria.fr/oscar/). Thus, the final training corpus has a size of 81GB and 13,138,379,147 tokens. +Note: Unfortunately, a wrong vocab size was used when training the XXL models. +This explains the mismatch of the "real" vocab size of 31102, compared to the +vocab size specified in `config.json`. However, the model is working and all +evaluations were done under those circumstances. +See [this issue](https://github.com/dbmdz/berts/issues/7) for more information. + +The Italian ELECTRA model was trained on the "XXL" corpus for 1M steps in total using a batch +size of 128. We pretty much following the ELECTRA training procedure as used for +[BERTurk](https://github.com/stefan-it/turkish-bert/tree/master/electra). + ## Model weights Currently only PyTorch-[Transformers](https://github.com/huggingface/transformers) compatible weights are available. If you need access to TensorFlow checkpoints, please raise an issue! -| Model | Downloads -| --------------------------------------- | --------------------------------------------------------------------------------------------------------------- -| `dbmdz/bert-base-italian-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/vocab.txt) -| `dbmdz/bert-base-italian-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/vocab.txt) -| `dbmdz/bert-base-italian-xxl-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/vocab.txt) -| `dbmdz/bert-base-italian-xxl-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/vocab.txt) +| Model | Downloads +| ---------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- +| `dbmdz/bert-base-italian-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/vocab.txt) +| `dbmdz/bert-base-italian-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/vocab.txt) +| `dbmdz/bert-base-italian-xxl-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/vocab.txt) +| `dbmdz/bert-base-italian-xxl-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/vocab.txt) +| `dbmdz/electra-base-italian-xxl-cased-discriminator` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/dbmdz/electra-base-italian-xxl-cased-discriminator/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-discriminator/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-discriminator/vocab.txt) +| `dbmdz/electra-base-italian-xxl-cased-generator` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/dbmdz/electra-base-italian-xxl-cased-generator/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-generator/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-generator/vocab.txt) ## Results For results on downstream tasks like NER or PoS tagging, please refer to -[this repository](https://github.com/stefan-it/fine-tuned-berts-seq). +[this repository](https://github.com/stefan-it/italian-bertelectra). ## Usage @@ -47,8 +61,11 @@ With Transformers >= 2.3 our Italian BERT models can be loaded like: ```python from transformers import AutoModel, AutoTokenizer -tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-cased") -model = AutoModel.from_pretrained("dbmdz/bert-base-italian-cased") +model_name = "dbmdz/bert-base-italian-cased" + +tokenizer = AutoTokenizer.from_pretrained(model_name) + +model = AutoModel.from_pretrained(model_name) ``` To load the (recommended) Italian XXL BERT models, just use: @@ -56,8 +73,23 @@ To load the (recommended) Italian XXL BERT models, just use: ```python from transformers import AutoModel, AutoTokenizer -tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased") -model = AutoModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased") +model_name = "dbmdz/bert-base-italian-xxl-cased" + +tokenizer = AutoTokenizer.from_pretrained(model_name) + +model = AutoModel.from_pretrained(model_name) +``` + +To load the Italian XXL ELECTRA model (discriminator), just use: + +```python +from transformers import AutoModel, AutoTokenizer + +model_name = "dbmdz/electra-base-italian-xxl-cased-discriminator" + +tokenizer = AutoTokenizer.from_pretrained(model_name) + +model = AutoModelWithLMHead.from_pretrained(model_name) ``` # Huggingface model hub @@ -66,7 +98,7 @@ All models are available on the [Huggingface model hub](https://huggingface.co/d # Contact (Bugs, Feedback, Contribution and more) -For questions about our BERT models just open an issue +For questions about our BERT/ELECTRA models just open an issue [here](https://github.com/dbmdz/berts/issues/new) 🤗 # Acknowledgments diff --git a/model_cards/dbmdz/bert-base-italian-xxl-cased/README.md b/model_cards/dbmdz/bert-base-italian-xxl-cased/README.md index dbe1e5587674a4..43c9de3da0c6e2 100644 --- a/model_cards/dbmdz/bert-base-italian-xxl-cased/README.md +++ b/model_cards/dbmdz/bert-base-italian-xxl-cased/README.md @@ -1,12 +1,14 @@ --- language: it license: mit +datasets: +- wikipedia --- -# 🤗 + 📚 dbmdz BERT models +# 🤗 + 📚 dbmdz BERT and ELECTRA models In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State -Library open sources Italian BERT models 🎉 +Library open sources Italian BERT and ELECTRA models 🎉 # Italian BERT @@ -22,23 +24,35 @@ For the XXL Italian models, we use the same training data from OPUS and extend it with data from the Italian part of the [OSCAR corpus](https://traces1.inria.fr/oscar/). Thus, the final training corpus has a size of 81GB and 13,138,379,147 tokens. +Note: Unfortunately, a wrong vocab size was used when training the XXL models. +This explains the mismatch of the "real" vocab size of 31102, compared to the +vocab size specified in `config.json`. However, the model is working and all +evaluations were done under those circumstances. +See [this issue](https://github.com/dbmdz/berts/issues/7) for more information. + +The Italian ELECTRA model was trained on the "XXL" corpus for 1M steps in total using a batch +size of 128. We pretty much following the ELECTRA training procedure as used for +[BERTurk](https://github.com/stefan-it/turkish-bert/tree/master/electra). + ## Model weights Currently only PyTorch-[Transformers](https://github.com/huggingface/transformers) compatible weights are available. If you need access to TensorFlow checkpoints, please raise an issue! -| Model | Downloads -| --------------------------------------- | --------------------------------------------------------------------------------------------------------------- -| `dbmdz/bert-base-italian-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/vocab.txt) -| `dbmdz/bert-base-italian-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/vocab.txt) -| `dbmdz/bert-base-italian-xxl-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/vocab.txt) -| `dbmdz/bert-base-italian-xxl-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/vocab.txt) +| Model | Downloads +| ---------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- +| `dbmdz/bert-base-italian-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/vocab.txt) +| `dbmdz/bert-base-italian-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/vocab.txt) +| `dbmdz/bert-base-italian-xxl-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/vocab.txt) +| `dbmdz/bert-base-italian-xxl-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/vocab.txt) +| `dbmdz/electra-base-italian-xxl-cased-discriminator` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/dbmdz/electra-base-italian-xxl-cased-discriminator/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-discriminator/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-discriminator/vocab.txt) +| `dbmdz/electra-base-italian-xxl-cased-generator` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/dbmdz/electra-base-italian-xxl-cased-generator/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-generator/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-generator/vocab.txt) ## Results For results on downstream tasks like NER or PoS tagging, please refer to -[this repository](https://github.com/stefan-it/fine-tuned-berts-seq). +[this repository](https://github.com/stefan-it/italian-bertelectra). ## Usage @@ -47,8 +61,11 @@ With Transformers >= 2.3 our Italian BERT models can be loaded like: ```python from transformers import AutoModel, AutoTokenizer -tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-cased") -model = AutoModel.from_pretrained("dbmdz/bert-base-italian-cased") +model_name = "dbmdz/bert-base-italian-cased" + +tokenizer = AutoTokenizer.from_pretrained(model_name) + +model = AutoModel.from_pretrained(model_name) ``` To load the (recommended) Italian XXL BERT models, just use: @@ -56,8 +73,23 @@ To load the (recommended) Italian XXL BERT models, just use: ```python from transformers import AutoModel, AutoTokenizer -tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased") -model = AutoModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased") +model_name = "dbmdz/bert-base-italian-xxl-cased" + +tokenizer = AutoTokenizer.from_pretrained(model_name) + +model = AutoModel.from_pretrained(model_name) +``` + +To load the Italian XXL ELECTRA model (discriminator), just use: + +```python +from transformers import AutoModel, AutoTokenizer + +model_name = "dbmdz/electra-base-italian-xxl-cased-discriminator" + +tokenizer = AutoTokenizer.from_pretrained(model_name) + +model = AutoModelWithLMHead.from_pretrained(model_name) ``` # Huggingface model hub @@ -66,7 +98,7 @@ All models are available on the [Huggingface model hub](https://huggingface.co/d # Contact (Bugs, Feedback, Contribution and more) -For questions about our BERT models just open an issue +For questions about our BERT/ELECTRA models just open an issue [here](https://github.com/dbmdz/berts/issues/new) 🤗 # Acknowledgments diff --git a/model_cards/dbmdz/bert-base-italian-xxl-uncased/README.md b/model_cards/dbmdz/bert-base-italian-xxl-uncased/README.md index dbe1e5587674a4..43c9de3da0c6e2 100644 --- a/model_cards/dbmdz/bert-base-italian-xxl-uncased/README.md +++ b/model_cards/dbmdz/bert-base-italian-xxl-uncased/README.md @@ -1,12 +1,14 @@ --- language: it license: mit +datasets: +- wikipedia --- -# 🤗 + 📚 dbmdz BERT models +# 🤗 + 📚 dbmdz BERT and ELECTRA models In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State -Library open sources Italian BERT models 🎉 +Library open sources Italian BERT and ELECTRA models 🎉 # Italian BERT @@ -22,23 +24,35 @@ For the XXL Italian models, we use the same training data from OPUS and extend it with data from the Italian part of the [OSCAR corpus](https://traces1.inria.fr/oscar/). Thus, the final training corpus has a size of 81GB and 13,138,379,147 tokens. +Note: Unfortunately, a wrong vocab size was used when training the XXL models. +This explains the mismatch of the "real" vocab size of 31102, compared to the +vocab size specified in `config.json`. However, the model is working and all +evaluations were done under those circumstances. +See [this issue](https://github.com/dbmdz/berts/issues/7) for more information. + +The Italian ELECTRA model was trained on the "XXL" corpus for 1M steps in total using a batch +size of 128. We pretty much following the ELECTRA training procedure as used for +[BERTurk](https://github.com/stefan-it/turkish-bert/tree/master/electra). + ## Model weights Currently only PyTorch-[Transformers](https://github.com/huggingface/transformers) compatible weights are available. If you need access to TensorFlow checkpoints, please raise an issue! -| Model | Downloads -| --------------------------------------- | --------------------------------------------------------------------------------------------------------------- -| `dbmdz/bert-base-italian-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/vocab.txt) -| `dbmdz/bert-base-italian-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/vocab.txt) -| `dbmdz/bert-base-italian-xxl-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/vocab.txt) -| `dbmdz/bert-base-italian-xxl-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/vocab.txt) +| Model | Downloads +| ---------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- +| `dbmdz/bert-base-italian-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/vocab.txt) +| `dbmdz/bert-base-italian-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/vocab.txt) +| `dbmdz/bert-base-italian-xxl-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/vocab.txt) +| `dbmdz/bert-base-italian-xxl-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/vocab.txt) +| `dbmdz/electra-base-italian-xxl-cased-discriminator` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/dbmdz/electra-base-italian-xxl-cased-discriminator/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-discriminator/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-discriminator/vocab.txt) +| `dbmdz/electra-base-italian-xxl-cased-generator` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/dbmdz/electra-base-italian-xxl-cased-generator/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-generator/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-generator/vocab.txt) ## Results For results on downstream tasks like NER or PoS tagging, please refer to -[this repository](https://github.com/stefan-it/fine-tuned-berts-seq). +[this repository](https://github.com/stefan-it/italian-bertelectra). ## Usage @@ -47,8 +61,11 @@ With Transformers >= 2.3 our Italian BERT models can be loaded like: ```python from transformers import AutoModel, AutoTokenizer -tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-cased") -model = AutoModel.from_pretrained("dbmdz/bert-base-italian-cased") +model_name = "dbmdz/bert-base-italian-cased" + +tokenizer = AutoTokenizer.from_pretrained(model_name) + +model = AutoModel.from_pretrained(model_name) ``` To load the (recommended) Italian XXL BERT models, just use: @@ -56,8 +73,23 @@ To load the (recommended) Italian XXL BERT models, just use: ```python from transformers import AutoModel, AutoTokenizer -tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased") -model = AutoModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased") +model_name = "dbmdz/bert-base-italian-xxl-cased" + +tokenizer = AutoTokenizer.from_pretrained(model_name) + +model = AutoModel.from_pretrained(model_name) +``` + +To load the Italian XXL ELECTRA model (discriminator), just use: + +```python +from transformers import AutoModel, AutoTokenizer + +model_name = "dbmdz/electra-base-italian-xxl-cased-discriminator" + +tokenizer = AutoTokenizer.from_pretrained(model_name) + +model = AutoModelWithLMHead.from_pretrained(model_name) ``` # Huggingface model hub @@ -66,7 +98,7 @@ All models are available on the [Huggingface model hub](https://huggingface.co/d # Contact (Bugs, Feedback, Contribution and more) -For questions about our BERT models just open an issue +For questions about our BERT/ELECTRA models just open an issue [here](https://github.com/dbmdz/berts/issues/new) 🤗 # Acknowledgments diff --git a/model_cards/dbmdz/electra-base-italian-xxl-cased-discriminator/README.md b/model_cards/dbmdz/electra-base-italian-xxl-cased-discriminator/README.md new file mode 100644 index 00000000000000..43c9de3da0c6e2 --- /dev/null +++ b/model_cards/dbmdz/electra-base-italian-xxl-cased-discriminator/README.md @@ -0,0 +1,110 @@ +--- +language: it +license: mit +datasets: +- wikipedia +--- + +# 🤗 + 📚 dbmdz BERT and ELECTRA models + +In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State +Library open sources Italian BERT and ELECTRA models 🎉 + +# Italian BERT + +The source data for the Italian BERT model consists of a recent Wikipedia dump and +various texts from the [OPUS corpora](http://opus.nlpl.eu/) collection. The final +training corpus has a size of 13GB and 2,050,057,573 tokens. + +For sentence splitting, we use NLTK (faster compared to spacy). +Our cased and uncased models are training with an initial sequence length of 512 +subwords for ~2-3M steps. + +For the XXL Italian models, we use the same training data from OPUS and extend +it with data from the Italian part of the [OSCAR corpus](https://traces1.inria.fr/oscar/). +Thus, the final training corpus has a size of 81GB and 13,138,379,147 tokens. + +Note: Unfortunately, a wrong vocab size was used when training the XXL models. +This explains the mismatch of the "real" vocab size of 31102, compared to the +vocab size specified in `config.json`. However, the model is working and all +evaluations were done under those circumstances. +See [this issue](https://github.com/dbmdz/berts/issues/7) for more information. + +The Italian ELECTRA model was trained on the "XXL" corpus for 1M steps in total using a batch +size of 128. We pretty much following the ELECTRA training procedure as used for +[BERTurk](https://github.com/stefan-it/turkish-bert/tree/master/electra). + +## Model weights + +Currently only PyTorch-[Transformers](https://github.com/huggingface/transformers) +compatible weights are available. If you need access to TensorFlow checkpoints, +please raise an issue! + +| Model | Downloads +| ---------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- +| `dbmdz/bert-base-italian-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/vocab.txt) +| `dbmdz/bert-base-italian-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/vocab.txt) +| `dbmdz/bert-base-italian-xxl-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/vocab.txt) +| `dbmdz/bert-base-italian-xxl-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/vocab.txt) +| `dbmdz/electra-base-italian-xxl-cased-discriminator` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/dbmdz/electra-base-italian-xxl-cased-discriminator/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-discriminator/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-discriminator/vocab.txt) +| `dbmdz/electra-base-italian-xxl-cased-generator` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/dbmdz/electra-base-italian-xxl-cased-generator/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-generator/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-generator/vocab.txt) + +## Results + +For results on downstream tasks like NER or PoS tagging, please refer to +[this repository](https://github.com/stefan-it/italian-bertelectra). + +## Usage + +With Transformers >= 2.3 our Italian BERT models can be loaded like: + +```python +from transformers import AutoModel, AutoTokenizer + +model_name = "dbmdz/bert-base-italian-cased" + +tokenizer = AutoTokenizer.from_pretrained(model_name) + +model = AutoModel.from_pretrained(model_name) +``` + +To load the (recommended) Italian XXL BERT models, just use: + +```python +from transformers import AutoModel, AutoTokenizer + +model_name = "dbmdz/bert-base-italian-xxl-cased" + +tokenizer = AutoTokenizer.from_pretrained(model_name) + +model = AutoModel.from_pretrained(model_name) +``` + +To load the Italian XXL ELECTRA model (discriminator), just use: + +```python +from transformers import AutoModel, AutoTokenizer + +model_name = "dbmdz/electra-base-italian-xxl-cased-discriminator" + +tokenizer = AutoTokenizer.from_pretrained(model_name) + +model = AutoModelWithLMHead.from_pretrained(model_name) +``` + +# Huggingface model hub + +All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz). + +# Contact (Bugs, Feedback, Contribution and more) + +For questions about our BERT/ELECTRA models just open an issue +[here](https://github.com/dbmdz/berts/issues/new) 🤗 + +# Acknowledgments + +Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). +Thanks for providing access to the TFRC ❤️ + +Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team, +it is possible to download both cased and uncased models from their S3 storage 🤗 diff --git a/model_cards/dbmdz/electra-base-italian-xxl-cased-generator/README.md b/model_cards/dbmdz/electra-base-italian-xxl-cased-generator/README.md new file mode 100644 index 00000000000000..43c9de3da0c6e2 --- /dev/null +++ b/model_cards/dbmdz/electra-base-italian-xxl-cased-generator/README.md @@ -0,0 +1,110 @@ +--- +language: it +license: mit +datasets: +- wikipedia +--- + +# 🤗 + 📚 dbmdz BERT and ELECTRA models + +In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State +Library open sources Italian BERT and ELECTRA models 🎉 + +# Italian BERT + +The source data for the Italian BERT model consists of a recent Wikipedia dump and +various texts from the [OPUS corpora](http://opus.nlpl.eu/) collection. The final +training corpus has a size of 13GB and 2,050,057,573 tokens. + +For sentence splitting, we use NLTK (faster compared to spacy). +Our cased and uncased models are training with an initial sequence length of 512 +subwords for ~2-3M steps. + +For the XXL Italian models, we use the same training data from OPUS and extend +it with data from the Italian part of the [OSCAR corpus](https://traces1.inria.fr/oscar/). +Thus, the final training corpus has a size of 81GB and 13,138,379,147 tokens. + +Note: Unfortunately, a wrong vocab size was used when training the XXL models. +This explains the mismatch of the "real" vocab size of 31102, compared to the +vocab size specified in `config.json`. However, the model is working and all +evaluations were done under those circumstances. +See [this issue](https://github.com/dbmdz/berts/issues/7) for more information. + +The Italian ELECTRA model was trained on the "XXL" corpus for 1M steps in total using a batch +size of 128. We pretty much following the ELECTRA training procedure as used for +[BERTurk](https://github.com/stefan-it/turkish-bert/tree/master/electra). + +## Model weights + +Currently only PyTorch-[Transformers](https://github.com/huggingface/transformers) +compatible weights are available. If you need access to TensorFlow checkpoints, +please raise an issue! + +| Model | Downloads +| ---------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- +| `dbmdz/bert-base-italian-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/vocab.txt) +| `dbmdz/bert-base-italian-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/vocab.txt) +| `dbmdz/bert-base-italian-xxl-cased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/vocab.txt) +| `dbmdz/bert-base-italian-xxl-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/vocab.txt) +| `dbmdz/electra-base-italian-xxl-cased-discriminator` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/dbmdz/electra-base-italian-xxl-cased-discriminator/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-discriminator/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-discriminator/vocab.txt) +| `dbmdz/electra-base-italian-xxl-cased-generator` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/dbmdz/electra-base-italian-xxl-cased-generator/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-generator/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/electra-base-italian-xxl-cased-generator/vocab.txt) + +## Results + +For results on downstream tasks like NER or PoS tagging, please refer to +[this repository](https://github.com/stefan-it/italian-bertelectra). + +## Usage + +With Transformers >= 2.3 our Italian BERT models can be loaded like: + +```python +from transformers import AutoModel, AutoTokenizer + +model_name = "dbmdz/bert-base-italian-cased" + +tokenizer = AutoTokenizer.from_pretrained(model_name) + +model = AutoModel.from_pretrained(model_name) +``` + +To load the (recommended) Italian XXL BERT models, just use: + +```python +from transformers import AutoModel, AutoTokenizer + +model_name = "dbmdz/bert-base-italian-xxl-cased" + +tokenizer = AutoTokenizer.from_pretrained(model_name) + +model = AutoModel.from_pretrained(model_name) +``` + +To load the Italian XXL ELECTRA model (discriminator), just use: + +```python +from transformers import AutoModel, AutoTokenizer + +model_name = "dbmdz/electra-base-italian-xxl-cased-discriminator" + +tokenizer = AutoTokenizer.from_pretrained(model_name) + +model = AutoModelWithLMHead.from_pretrained(model_name) +``` + +# Huggingface model hub + +All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz). + +# Contact (Bugs, Feedback, Contribution and more) + +For questions about our BERT/ELECTRA models just open an issue +[here](https://github.com/dbmdz/berts/issues/new) 🤗 + +# Acknowledgments + +Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). +Thanks for providing access to the TFRC ❤️ + +Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team, +it is possible to download both cased and uncased models from their S3 storage 🤗