forked from huggingface/transformers
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Model card for kuisailab/albert-base-arabic (huggingface#6729)
* Create README.md * Update README.md
- Loading branch information
Showing
1 changed file
with
65 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
--- | ||
language: ar | ||
datasets: | ||
- oscar | ||
- wikipedia | ||
tags: | ||
- ar | ||
- masked-lm | ||
- lm-head | ||
--- | ||
|
||
|
||
# Arabic-ALBERT Base | ||
|
||
Arabic edition of ALBERT Base pretrained language model | ||
|
||
## Pretraining data | ||
|
||
The models were pretrained on ~4.4 Billion words: | ||
|
||
- Arabic version of [OSCAR](https://oscar-corpus.com/) (unshuffled version of the corpus) - filtered from [Common Crawl](http://commoncrawl.org/) | ||
- Recent dump of Arabic [Wikipedia](https://dumps.wikimedia.org/backup-index.html) | ||
|
||
__Notes on training data:__ | ||
|
||
- Our final version of corpus contains some non-Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER. | ||
- Although non-Arabic characters were lowered as a preprocessing step, since Arabic characters do not have upper or lower case, there is no cased and uncased version of the model. | ||
- The corpus and vocabulary set are not restricted to Modern Standard Arabic, they contain some dialectical Arabic too. | ||
|
||
## Pretraining details | ||
|
||
- These models were trained using Google ALBERT's github [repository](https://github.com/google-research/albert) on a single TPU v3-8 provided for free from [TFRC](https://www.tensorflow.org/tfrc). | ||
- Our pretraining procedure follows training settings of bert with some changes: trained for 7M training steps with batchsize of 64, instead of 125K with batchsize of 4096. | ||
|
||
## Models | ||
|
||
| | albert-base | albert-large | albert-xlarge | | ||
|:---:|:---:|:---:|:---:| | ||
| Hidden Layers | 12 | 24 | 24 | | ||
| Attention heads | 12 | 16 | 32 | | ||
| Hidden size | 768 | 1024 | 2048 | | ||
|
||
## Results | ||
|
||
For further details on the models performance or any other queries, please refer to [Arabic-ALBERT](https://github.com/KUIS-AI-Lab/Arabic-ALBERT/) | ||
|
||
## How to use | ||
|
||
You can use these models by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this: | ||
|
||
```python | ||
|
||
from transformers import AutoTokenizer, AutoModel | ||
|
||
# loading the tokenizer | ||
base_tokenizer = AutoTokenizer.from_pretrained("kuisailab/albert-base-arabic") | ||
|
||
# loading the model | ||
base_model = AutoModel.from_pretrained("kuisailab/albert-base-arabic") | ||
|
||
``` | ||
|
||
## Acknowledgement | ||
|
||
Thanks to Google for providing free TPU for the training process and for Huggingface for hosting these models on their servers 😊 |