Added 12 model cards for Indian Language Models (#8198)

* Create README.md * added model cards
huggingface · Nov 2, 2020 · aa79aa4 · aa79aa4
1 parent 9bd30f7
commit aa79aa4
Show file tree

Hide file tree

Showing 12 changed files with 344 additions and 0 deletions.
diff --git a/model_cards/neuralspace-reverie/indic-transformers-bn-bert/README.md b/model_cards/neuralspace-reverie/indic-transformers-bn-bert/README.md
@@ -0,0 +1,25 @@
+---
+language: 
+- bn 
+tags:
+- MaskedLM
+- Bengali
+---
+# Indic-Transformers Bengali BERT
+## Model description
+This is a BERT language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
+This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
+## Intended uses & limitations
+#### How to use
+```
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-bert')
+model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-bert')
+text = "আপনি কেমন আছেন?"
+input_ids = tokenizer(text, return_tensors='pt')['input_ids']
+out = model(input_ids)[0]
+print(out.shape)
+# out = [1, 6, 768] 
+```
+#### Limitations and bias
+The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
diff --git a/model_cards/neuralspace-reverie/indic-transformers-bn-distilbert/README.md b/model_cards/neuralspace-reverie/indic-transformers-bn-distilbert/README.md
@@ -0,0 +1,29 @@
+---
+language: 
+- bn 
+tags:
+- MaskedLM
+- Bengali
+- DistilBERT
+- Question-Answering
+- Token Classification
+- Text Classification
+---
+# Indic-Transformers Bengali DistilBERT
+## Model description
+This is a DistilBERT language model pre-trained on ~6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
+This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
+## Intended uses & limitations
+#### How to use
+```
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-distilbert')
+model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-distilbert')
+text = "আপনি কেমন আছেন?"
+input_ids = tokenizer(text, return_tensors='pt')['input_ids']
+out = model(input_ids)[0]
+print(out.shape)
+# out = [1, 5, 768] 
+```
+#### Limitations and bias
+The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
diff --git a/model_cards/neuralspace-reverie/indic-transformers-bn-roberta/README.md b/model_cards/neuralspace-reverie/indic-transformers-bn-roberta/README.md
@@ -0,0 +1,29 @@
+---
+language: 
+- bn 
+tags:
+- MaskedLM
+- Bengali
+- RoBERTa
+- Question-Answering
+- Token Classification
+- Text Classification
+---
+# Indic-Transformers Bengali RoBERTa
+## Model description
+This is a RoBERTa language model pre-trained on ~6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
+This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
+## Intended uses & limitations
+#### How to use
+```
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-roberta')
+model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-roberta')
+text = "আপনি কেমন আছেন?"
+input_ids = tokenizer(text, return_tensors='pt')['input_ids']
+out = model(input_ids)[0]
+print(out.shape)
+# out = [1, 10, 768] 
+```
+#### Limitations and bias
+The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
diff --git a/model_cards/neuralspace-reverie/indic-transformers-bn-xlmroberta/README.md b/model_cards/neuralspace-reverie/indic-transformers-bn-xlmroberta/README.md
@@ -0,0 +1,29 @@
+---
+language: 
+- bn 
+tags:
+- MaskedLM
+- Bengali
+- XLMRoBERTa
+- Question-Answering
+- Token Classification
+- Text Classification
+---
+# Indic-Transformers Bengali XLMRoBERTa
+## Model description
+This is a XLMRoBERTa language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
+This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
+## Intended uses & limitations
+#### How to use
+```
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-xlmroberta')
+model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-xlmroberta')
+text = "আপনি কেমন আছেন?"
+input_ids = tokenizer(text, return_tensors='pt')['input_ids']
+out = model(input_ids)[0]
+print(out.shape)
+# out = [1, 5, 768] 
+```
+#### Limitations and bias
+The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
diff --git a/model_cards/neuralspace-reverie/indic-transformers-hi-bert/README.md b/model_cards/neuralspace-reverie/indic-transformers-hi-bert/README.md
@@ -0,0 +1,29 @@
+---
+language: 
+- hi 
+tags:
+- MaskedLM
+- Hindi
+- BERT
+- Question-Answering
+- Token Classification
+- Text Classification
+---
+# Indic-Transformers Hindi BERT
+## Model description
+This is a BERT language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
+This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
+## Intended uses & limitations
+#### How to use
+```
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-bert')
+model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-bert')
+text = "आपका स्वागत हैं"
+input_ids = tokenizer(text, return_tensors='pt')['input_ids']
+out = model(input_ids)[0]
+print(out.shape)
+# out = [1, 5, 768] 
+```
+#### Limitations and bias
+The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
diff --git a/model_cards/neuralspace-reverie/indic-transformers-hi-distilbert/README.md b/model_cards/neuralspace-reverie/indic-transformers-hi-distilbert/README.md
@@ -0,0 +1,29 @@
+---
+language: 
+- hi 
+tags:
+- MaskedLM
+- Hindi
+- DistilBERT
+- Question-Answering
+- Token Classification
+- Text Classification
+---
+# Indic-Transformers Hindi DistilBERT
+## Model description
+This is a DistilBERT language model pre-trained on ~10 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
+This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
+## Intended uses & limitations
+#### How to use
+```
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-distilbert')
+model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-distilbert')
+text = "आपका स्वागत हैं"
+input_ids = tokenizer(text, return_tensors='pt')['input_ids']
+out = model(input_ids)[0]
+print(out.shape)
+# out = [1, 5, 768] 
+```
+#### Limitations and bias
+The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
diff --git a/model_cards/neuralspace-reverie/indic-transformers-hi-roberta/README.md b/model_cards/neuralspace-reverie/indic-transformers-hi-roberta/README.md
@@ -0,0 +1,29 @@
+---
+language: 
+- hi 
+tags:
+- MaskedLM
+- Hindi
+- RoBERTa
+- Question-Answering
+- Token Classification
+- Text Classification
+---
+# Indic-Transformers Hindi RoBERTa
+## Model description
+This is a RoBERTa language model pre-trained on ~10 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
+This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
+## Intended uses & limitations
+#### How to use
+```
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-roberta')
+model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-roberta')
+text = "आपका स्वागत हैं"
+input_ids = tokenizer(text, return_tensors='pt')['input_ids']
+out = model(input_ids)[0]
+print(out.shape)
+# out = [1, 11, 768] 
+```
+#### Limitations and bias
+The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
diff --git a/model_cards/neuralspace-reverie/indic-transformers-hi-xlmroberta/README.md b/model_cards/neuralspace-reverie/indic-transformers-hi-xlmroberta/README.md
@@ -0,0 +1,29 @@
+---
+language: 
+- hi 
+tags:
+- MaskedLM
+- Hindi
+- XLMRoBERTa
+- Question-Answering
+- Token Classification
+- Text Classification
+---
+# Indic-Transformers Hindi XLMRoBERTa
+## Model description
+This is a XLMRoBERTa language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
+This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
+## Intended uses & limitations
+#### How to use
+```
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-xlmroberta')
+model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-xlmroberta')
+text = "आपका स्वागत हैं"
+input_ids = tokenizer(text, return_tensors='pt')['input_ids']
+out = model(input_ids)[0]
+print(out.shape)
+# out = [1, 5, 768] 
+```
+#### Limitations and bias
+The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
diff --git a/model_cards/neuralspace-reverie/indic-transformers-te-bert/README.md b/model_cards/neuralspace-reverie/indic-transformers-te-bert/README.md
@@ -0,0 +1,29 @@
+---
+language: 
+- te
+tags:
+- MaskedLM
+- Telugu
+- BERT
+- Question-Answering
+- Token Classification
+- Text Classification
+---
+# Indic-Transformers Telugu BERT
+## Model description
+This is a BERT language model pre-trained on ~1.6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
+This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
+## Intended uses & limitations
+#### How to use
+```
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-bert')
+model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-bert')
+text = "మీరు ఎలా ఉన్నారు"
+input_ids = tokenizer(text, return_tensors='pt')['input_ids']
+out = model(input_ids)[0]
+print(out.shape)
+# out = [1, 5, 768] 
+```
+#### Limitations and bias
+The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
diff --git a/model_cards/neuralspace-reverie/indic-transformers-te-distilbert/README.md b/model_cards/neuralspace-reverie/indic-transformers-te-distilbert/README.md
@@ -0,0 +1,29 @@
+---
+language: 
+- te
+tags:
+- MaskedLM
+- Telugu
+- DistilBERT
+- Question-Answering
+- Token Classification
+- Text Classification
+---
+# Indic-Transformers Telugu DistilBERT
+## Model description
+This is a DistilBERT language model pre-trained on ~2 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
+This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
+## Intended uses & limitations
+#### How to use
+```
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-distilbert')
+model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-distilbert')
+text = "మీరు ఎలా ఉన్నారు"
+input_ids = tokenizer(text, return_tensors='pt')['input_ids']
+out = model(input_ids)[0]
+print(out.shape)
+# out = [1, 5, 768] 
+```
+#### Limitations and bias
+The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
diff --git a/model_cards/neuralspace-reverie/indic-transformers-te-roberta/README.md b/model_cards/neuralspace-reverie/indic-transformers-te-roberta/README.md
@@ -0,0 +1,29 @@
+---
+language: 
+- te
+tags:
+- MaskedLM
+- Telugu
+- RoBERTa
+- Question-Answering
+- Token Classification
+- Text Classification
+---
+# Indic-Transformers Telugu RoBERTa
+## Model description
+This is a RoBERTa language model pre-trained on ~2 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
+This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
+## Intended uses & limitations
+#### How to use
+```
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-roberta')
+model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-roberta')
+text = "మీరు ఎలా ఉన్నారు"
+input_ids = tokenizer(text, return_tensors='pt')['input_ids']
+out = model(input_ids)[0]
+print(out.shape)
+# out = [1, 14, 768] 
+```
+#### Limitations and bias
+The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
diff --git a/model_cards/neuralspace-reverie/indic-transformers-te-xlmroberta/README.md b/model_cards/neuralspace-reverie/indic-transformers-te-xlmroberta/README.md
@@ -0,0 +1,29 @@
+---
+language: 
+- te
+tags:
+- MaskedLM
+- Telugu
+- XLMRoBERTa
+- Question-Answering
+- Token Classification
+- Text Classification
+---
+# Indic-Transformers Telugu XLMRoBERTa
+## Model description
+This is a XLMRoBERTa language model pre-trained on ~1.6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
+This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
+## Intended uses & limitations
+#### How to use
+```
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-xlmroberta')
+model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-xlmroberta')
+text = "మీరు ఎలా ఉన్నారు"
+input_ids = tokenizer(text, return_tensors='pt')['input_ids']
+out = model(input_ids)[0]
+print(out.shape)
+# out = [1, 5, 768] 
+```
+#### Limitations and bias
+The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).