Skip to content

Commit

Permalink
Fixed Majority of the Typos in transformers[en] Documentation (hugg…
Browse files Browse the repository at this point in the history
…ingface#33350)

* Fixed typo: insted to instead

* Fixed typo: relase to release

* Fixed typo: nighlty to nightly

* Fixed typos: versatible, benchamarks, becnhmark to versatile, benchmark, benchmarks

* Fixed typo in comment: quantizd to quantized

* Fixed typo: architecutre to architecture

* Fixed typo: contibution to contribution

* Fixed typo: Presequities to Prerequisites

* Fixed typo: faste to faster

* Fixed typo: extendeding to extending

* Fixed typo: segmetantion_maps to segmentation_maps

* Fixed typo: Alternativelly to Alternatively

* Fixed incorrectly defined variable: output to output_disabled

* Fixed typo in library name: tranformers.onnx to transformers.onnx

* Fixed missing import: import tensorflow as tf

* Fixed incorrectly defined variable: token_tensor to tokens_tensor

* Fixed missing import: import torch

* Fixed incorrectly defined variable and typo: uromaize to uromanize

* Fixed incorrectly defined variable and typo: uromaize to uromanize

* Fixed typo in function args: numpy.ndarry to numpy.ndarray

* Fixed Inconsistent Library Name: Torchscript to TorchScript

* Fixed Inconsistent Class Name: OneformerProcessor to OneFormerProcessor

* Fixed Inconsistent Class Named Typo: TFLNetForMultipleChoice to TFXLNetForMultipleChoice

* Fixed Inconsistent Library Name Typo: Pytorch to PyTorch

* Fixed Inconsistent Function Name Typo: captureWarning to captureWarnings

* Fixed Inconsistent Library Name Typo: Pytorch to PyTorch

* Fixed Inconsistent Class Name Typo: TrainingArgument to TrainingArguments

* Fixed Inconsistent Model Name Typo: Swin2R to Swin2SR

* Fixed Inconsistent Model Name Typo: EART to BERT

* Fixed Inconsistent Library Name Typo: TensorFLow to TensorFlow

* Fixed Broken Link for Speech Emotion Classification with Wav2Vec2

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed minor missing word Typo

* Fixed Punctuation: Two commas

* Fixed Punctuation: No Space between XLM-R and is

* Fixed Punctuation: No Space between [~accelerate.Accelerator.backward] and method

* Added backticks to display model.fit() in codeblock

* Added backticks to display openai-community/gpt2 in codeblock

* Fixed Minor Typo: will to with

* Fixed Minor Typo: is to are

* Fixed Minor Typo: in to on

* Fixed Minor Typo: inhibits to exhibits

* Fixed Minor Typo: they need to it needs

* Fixed Minor Typo: cast the load the checkpoints To load the checkpoints

* Fixed Inconsistent Class Name Typo: TFCamembertForCasualLM to TFCamembertForCausalLM

* Fixed typo in attribute name: outputs.last_hidden_states to outputs.last_hidden_state

* Added missing verbosity level: fatal

* Fixed Minor Typo: take To takes

* Fixed Minor Typo: heuristic To heuristics

* Fixed Minor Typo: setting To settings

* Fixed Minor Typo: Content To Contents

* Fixed Minor Typo: millions To million

* Fixed Minor Typo: difference To differences

* Fixed Minor Typo: while extract To which extracts

* Fixed Minor Typo: Hereby To Here

* Fixed Minor Typo: addition To additional

* Fixed Minor Typo: supports To supported

* Fixed Minor Typo: so that benchmark results TO as a consequence, benchmark

* Fixed Minor Typo: a To an

* Fixed Minor Typo: a To an

* Fixed Minor Typo: Chain-of-though To Chain-of-thought
  • Loading branch information
nnilayy authored and BernardZach committed Dec 6, 2024
1 parent 95fd17f commit f4de4b1
Show file tree
Hide file tree
Showing 61 changed files with 74 additions and 71 deletions.
2 changes: 1 addition & 1 deletion docs/source/en/accelerate.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ The next step is to pass all the relevant training objects to the [`~accelerate.

## Backward

The last addition is to replace the typical `loss.backward()` in your training loop with 🤗 Accelerate's [`~accelerate.Accelerator.backward`]method:
The last addition is to replace the typical `loss.backward()` in your training loop with 🤗 Accelerate's [`~accelerate.Accelerator.backward`] method:

```py
>>> for epoch in range(num_epochs):
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/autoclass_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ Now you can access the `feature_maps` object from the first stage of the backbon

## AutoFeatureExtractor

For audio tasks, a feature extractor processes the audio signal the correct input format.
For audio tasks, a feature extractor processes the audio signal into the correct input format.

Load a feature extractor with [`AutoFeatureExtractor.from_pretrained`]:

Expand Down
4 changes: 2 additions & 2 deletions docs/source/en/benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ The classes [`PyTorchBenchmark`] and [`TensorFlowBenchmark`] allow to flexibly b

<Tip>

Hereby, _inference_ is defined by a single forward pass, and _training_ is defined by a single forward pass and
Here, _inference_ is defined by a single forward pass, and _training_ is defined by a single forward pass and
backward pass.

</Tip>
Expand Down Expand Up @@ -368,7 +368,7 @@ This section lists a couple of best practices one should be aware of when benchm
memory measurement it is recommended to run each memory benchmark in a separate process by making sure
`no_multi_processing` is set to `True`.
- One should always state the environment information when sharing the results of a model benchmark. Results can vary
heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very
heavily between different GPU devices, library versions, etc., as a consequence, benchmark results on their own are not very
useful for the community.


Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/bertology.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,5 +37,5 @@ help people access the inner representations, mainly adapted from the great work
- retrieving heads output values and gradients to be able to compute head importance score and prune head as explained
in https://arxiv.org/abs/1905.10650.

To help you understand and use these features, we have added a specific example script: [bertology.py](https://github.com/huggingface/transformers/tree/main/examples/research_projects/bertology/run_bertology.py) while extract information and prune a model pre-trained on
To help you understand and use these features, we have added a specific example script: [bertology.py](https://github.com/huggingface/transformers/tree/main/examples/research_projects/bertology/run_bertology.py) which extracts information and prune a model pre-trained on
GLUE.
2 changes: 1 addition & 1 deletion docs/source/en/community.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ This page regroups resources around 🤗 Transformers developed by the community
| [Evaluate LUKE on TACRED, a relation extraction dataset](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_tacred.ipynb) | How to evaluate *LukeForEntityPairClassification* on the TACRED dataset | [Ikuya Yamada](https://github.com/ikuyamada) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_tacred.ipynb) |
| [Evaluate LUKE on CoNLL-2003, an important NER benchmark](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb) | How to evaluate *LukeForEntitySpanClassification* on the CoNLL-2003 dataset | [Ikuya Yamada](https://github.com/ikuyamada) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb) |
| [Evaluate BigBird-Pegasus on PubMed dataset](https://github.com/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb) | How to evaluate *BigBirdPegasusForConditionalGeneration* on PubMed dataset | [Vasudev Gupta](https://github.com/vasudevgupta7) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb) |
| [Speech Emotion Classification with Wav2Vec2](https://github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) | How to leverage a pretrained Wav2Vec2 model for Emotion Classification on the MEGA dataset | [Mehrdad Farahani](https://github.com/m3hrdadfi) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) |
| [Speech Emotion Classification with Wav2Vec2](https://github.com/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) | How to leverage a pretrained Wav2Vec2 model for Emotion Classification on the MEGA dataset | [Mehrdad Farahani](https://github.com/m3hrdadfi) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) |
| [Detect objects in an image with DETR](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DETR/DETR_minimal_example_(with_DetrFeatureExtractor).ipynb) | How to use a trained *DetrForObjectDetection* model to detect objects in an image and visualize attention | [Niels Rogge](https://github.com/NielsRogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DETR/DETR_minimal_example_(with_DetrFeatureExtractor).ipynb) |
| [Fine-tune DETR on a custom object detection dataset](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb) | How to fine-tune *DetrForObjectDetection* on a custom object detection dataset | [Niels Rogge](https://github.com/NielsRogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb) |
| [Finetune T5 for Named Entity Recognition](https://github.com/ToluClassics/Notebooks/blob/main/T5_Ner_Finetuning.ipynb) | How to fine-tune *T5* on a Named Entity Recognition Task | [Ogundepo Odunayo](https://github.com/ToluClassics) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1obr78FY_cBmWY5ODViCmzdY6O1KB65Vc?usp=sharing) |
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/debugging.md
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ This feature can be used with any `nn.Module`-based model.
</Tip>
If you start getting `loss=NaN` or the model inhibits some other abnormal behavior due to `inf` or `nan` in
If you start getting `loss=NaN` or the model exhibits some other abnormal behavior due to `inf` or `nan` in
activations or weights one needs to discover where the first underflow or overflow happens and what led to it. Luckily
you can accomplish that easily by activating a special module that will do the detection automatically.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/generation_strategies.md
Original file line number Diff line number Diff line change
Expand Up @@ -456,7 +456,7 @@ just like in multinomial sampling. However, in assisted decoding, reducing the t
['Alice and Bob, a couple of friends of mine, who are both in the same office as']
```

Alternativelly, you can also set the `prompt_lookup_num_tokens` to trigger n-gram based assisted decoding, as opposed
Alternatively, you can also set the `prompt_lookup_num_tokens` to trigger n-gram based assisted decoding, as opposed
to model based assisted decoding. You can read more about it [here](https://twitter.com/joao_gante/status/1747322413006643259).
### DoLa Decoding

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ pip install 'transformers[tf-cpu]'

M1 / ARM Users

You will need to install the following before installing TensorFLow 2.0
You will need to install the following before installing TensorFlow 2.0
```bash
brew install cmake
brew install pkg-config
Expand Down
8 changes: 4 additions & 4 deletions docs/source/en/main_classes/logging.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ transformers.logging.set_verbosity_info()
```

You can also use the environment variable `TRANSFORMERS_VERBOSITY` to override the default verbosity. You can set it
to one of the following: `debug`, `info`, `warning`, `error`, `critical`. For example:
to one of the following: `debug`, `info`, `warning`, `error`, `critical`, `fatal`. For example:

```bash
TRANSFORMERS_VERBOSITY=error ./myprogram.py
Expand Down Expand Up @@ -65,7 +65,7 @@ verbose to the most verbose), those levels (with their corresponding int values
critical errors.
- `transformers.logging.ERROR` (int value, 40): only report errors.
- `transformers.logging.WARNING` or `transformers.logging.WARN` (int value, 30): only reports error and
warnings. This the default level used by the library.
warnings. This is the default level used by the library.
- `transformers.logging.INFO` (int value, 20): reports error, warnings and basic information.
- `transformers.logging.DEBUG` (int value, 10): report all information.

Expand All @@ -77,10 +77,10 @@ Python has two logging systems that are often used in conjunction: `logging`, wh
which allows further classification of warnings in specific buckets, e.g., `FutureWarning` for a feature or path
that has already been deprecated and `DeprecationWarning` to indicate an upcoming deprecation.

We use both in the `transformers` library. We leverage and adapt `logging`'s `captureWarning` method to allow
We use both in the `transformers` library. We leverage and adapt `logging`'s `captureWarnings` method to allow
management of these warning messages by the verbosity setters above.

What does that mean for developers of the library? We should respect the following heuristic:
What does that mean for developers of the library? We should respect the following heuristics:
- `warnings` should be favored for developers of the library and libraries dependent on `transformers`
- `logging` should be used for end-users of the library using it in every-day projects

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/main_classes/optimizer_schedules.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ The `.optimization` module provides:

## Schedules

### Learning Rate Schedules (Pytorch)
### Learning Rate Schedules (PyTorch)

[[autodoc]] SchedulerType

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/main_classes/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ an optional `attentions` attribute. Here we have the `loss` since we passed alon

<Tip>

When passing `output_hidden_states=True` you may expect the `outputs.hidden_states[-1]` to match `outputs.last_hidden_states` exactly.
When passing `output_hidden_states=True` you may expect the `outputs.hidden_states[-1]` to match `outputs.last_hidden_state` exactly.
However, this is not always the case. Some models apply normalization or subsequent process to the last hidden state when it's returned.

</Tip>
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/main_classes/trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.

The [`Trainer`] class provides an API for feature-complete training in PyTorch, and it supports distributed training on multiple GPUs/TPUs, mixed precision for [NVIDIA GPUs](https://nvidia.github.io/apex/), [AMD GPUs](https://rocm.docs.amd.com/en/latest/rocm.html), and [`torch.amp`](https://pytorch.org/docs/stable/amp.html) for PyTorch. [`Trainer`] goes hand-in-hand with the [`TrainingArguments`] class, which offers a wide range of options to customize how a model is trained. Together, these two classes provide a complete training API.

[`Seq2SeqTrainer`] and [`Seq2SeqTrainingArguments`] inherit from the [`Trainer`] and [`TrainingArgument`] classes and they're adapted for training models for sequence-to-sequence tasks such as summarization or translation.
[`Seq2SeqTrainer`] and [`Seq2SeqTrainingArguments`] inherit from the [`Trainer`] and [`TrainingArguments`] classes and they're adapted for training models for sequence-to-sequence tasks such as summarization or translation.

<Tip warning={true}>

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/model_doc/camembert.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ as the information relative to the inputs and outputs.

[[autodoc]] TFCamembertModel

## TFCamembertForCasualLM
## TFCamembertForCausalLM

[[autodoc]] TFCamembertForCausalLM

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/model_doc/clipseg.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ rendered properly in your Markdown viewer.
## Overview

The CLIPSeg model was proposed in [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke
and Alexander Ecker. CLIPSeg adds a minimal decoder on top of a frozen [CLIP](clip) model for zero- and one-shot image segmentation.
and Alexander Ecker. CLIPSeg adds a minimal decoder on top of a frozen [CLIP](clip) model for zero-shot and one-shot image segmentation.

The abstract from the paper is the following:

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/model_doc/code_llama.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ This model was contributed by [ArthurZucker](https://huggingface.co/ArthurZ). Th

The `Llama2` family models, on which Code Llama is based, were trained using `bfloat16`, but the original inference uses `float16`. Let's look at the different precisions:

* `float32`: PyTorch convention on model initialization is to load models in `float32`, no matter with which `dtype` the model weights were stored. `transformers` also follows this convention for consistency with PyTorch. This will be picked by default. If you want the `AutoModel` API to cast the load the checkpoints with the storage weights type, you must specify `torch_dtype="auto"`, e.g. `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`.
* `float32`: PyTorch convention on model initialization is to load models in `float32`, no matter with which `dtype` the model weights were stored. `transformers` also follows this convention for consistency with PyTorch. This will be picked by default. If you want the `AutoModel` API to load the checkpoints with the storage weights type, you must specify `torch_dtype="auto"`, e.g. `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`.
* `bfloat16`: Code Llama was trained with this precision, so we recommend using it for further training or fine-tuning.
* `float16`: We recommend running inference using this precision, as it's usually faster than `bfloat16`, and evaluation metrics show no discernible degradation with respect to `bfloat16`. You can also run inference using `bfloat16`, and we recommend you check inference results with both `float16` and `bfloat16` after fine-tuning.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/model_doc/falcon_mamba.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Due to its architecture, FalconMamba is significantly faster at inference and re

Tips:

- FalconMamba is mostly based on Mamba architecutre, the same [tips and best practices](./mamba) would be relevant here.
- FalconMamba is mostly based on Mamba architecture, the same [tips and best practices](./mamba) would be relevant here.

The model has been trained on approximtely 6T tokens consisting a mixture of many data sources such as RefineWeb, Cosmopedia and Math data.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/model_doc/hiera.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ alt="drawing" width="600"/>

<small> Hiera architecture. Taken from the <a href="https://arxiv.org/abs/2306.00989">original paper.</a> </small>

This model was a joint contibution by [EduardoPacheco](https://huggingface.co/EduardoPacheco) and [namangarg110](https://huggingface.co/namangarg110). The original code can be found [here] (https://github.com/facebookresearch/hiera).
This model was a joint contribution by [EduardoPacheco](https://huggingface.co/EduardoPacheco) and [namangarg110](https://huggingface.co/namangarg110). The original code can be found [here] (https://github.com/facebookresearch/hiera).

## Resources

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/model_doc/jamba.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ alt="drawing" width="600"/>

## Usage

### Presequities
### Prerequisites

Jamba requires you use `transformers` version 4.39.0 or higher:
```bash
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/model_doc/matcha.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ print(processor.decode(predictions[0], skip_special_tokens=True))

## Fine-tuning

To fine-tune MatCha, refer to the pix2struct [fine-tuning notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb). For `Pix2Struct` models, we have found out that fine-tuning the model with Adafactor and cosine learning rate scheduler leads to faste convergence:
To fine-tune MatCha, refer to the pix2struct [fine-tuning notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb). For `Pix2Struct` models, we have found out that fine-tuning the model with Adafactor and cosine learning rate scheduler leads to faster convergence:
```python
from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/model_doc/mbart.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ keyword, and target text format passed with the `text_label` keyword argument.
## Overview of MBart-50

MBart-50 was introduced in the [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) paper by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav
Chaudhary, Jiatao Gu, Angela Fan. MBart-50 is created using the original *mbart-large-cc25* checkpoint by extendeding
Chaudhary, Jiatao Gu, Angela Fan. MBart-50 is created using the original *mbart-large-cc25* checkpoint by extending
its embedding layers with randomly initialized vectors for an extra set of 25 language tokens and then pretrained on 50
languages.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/model_doc/mixtral.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Mixtral-8x7B is the second large language model (LLM) released by [mistral.ai](h
Mixtral-8x7B is a decoder-only Transformer with the following architectural choices:

- Mixtral is a Mixture of Experts (MoE) model with 8 experts per MLP, with a total of 45 billion parameters. To learn more about mixture-of-experts, refer to the [blog post](https://huggingface.co/blog/moe).
- Despite the model having 45 billion parameters,, the compute required for a single forward pass is the same as that of a 14 billion parameter model. This is because even though each of the experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dispatched twice (top 2 routing) and thus the compute (the operation required at each forward computation) is just 2 X sequence_length.
- Despite the model having 45 billion parameters, the compute required for a single forward pass is the same as that of a 14 billion parameter model. This is because even though each of the experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dispatched twice (top 2 routing) and thus the compute (the operation required at each forward computation) is just 2 X sequence_length.

The following implementation details are shared with Mistral AI's first model [Mistral-7B](mistral):
- Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
Expand Down
6 changes: 3 additions & 3 deletions docs/source/en/model_doc/mms.md
Original file line number Diff line number Diff line change
Expand Up @@ -242,7 +242,7 @@ export UROMAN=$(pwd)
```

You can then pre-process the text input using the following code snippet. You can either rely on using the bash variable
`UROMAN` to point to the uroman repository, or you can pass the uroman directory as an argument to the `uromaize` function:
`UROMAN` to point to the uroman repository, or you can pass the uroman directory as an argument to the `uromanize` function:

```python
import torch
Expand Down Expand Up @@ -270,9 +270,9 @@ def uromanize(input_string, uroman_path):
return stdout.decode()[:-1]

text = "이봐 무슨 일이야"
uromaized_text = uromanize(text, uroman_path=os.environ["UROMAN"])
uromanized_text = uromanize(text, uroman_path=os.environ["UROMAN"])

inputs = tokenizer(text=uromaized_text, return_tensors="pt")
inputs = tokenizer(text=uromanized_text, return_tensors="pt")

set_seed(555) # make deterministic
with torch.no_grad():
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/model_doc/mpt.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.

## Overview

The MPT model was proposed by the [MosaicML](https://www.mosaicml.com/) team and released with multiple sizes and finetuned variants. The MPT models is a series of open source and commercially usable LLMs pre-trained on 1T tokens.
The MPT model was proposed by the [MosaicML](https://www.mosaicml.com/) team and released with multiple sizes and finetuned variants. The MPT models are a series of open source and commercially usable LLMs pre-trained on 1T tokens.

MPT models are GPT-style decoder-only transformers with several improvements: performance-optimized layer implementations, architecture changes that provide greater training stability, and the elimination of context length limits by replacing positional embeddings with ALiBi.

Expand Down
Loading

0 comments on commit f4de4b1

Please sign in to comment.