Revert "Fix doc errors and typos across the board (huggingface#8139)"

This reverts commit 9d97b80.
fabiocapsouza · Nov 15, 2020 · 5928117 · 5928117
1 parent c3f83d0
commit 5928117
Show file tree

Hide file tree

Showing 160 changed files with 364 additions and 342 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -96,7 +96,7 @@ folder.
 
 ## Start contributing! (Pull Requests)
 
-Before writing code, we strongly advise you to search through the existing PRs or
+Before writing code, we strongly advise you to search through the exising PRs or
 issues to make sure that nobody is already working on the same thing. If you are
 unsure, it is always a good idea to open an issue to get some feedback.
 
@@ -235,7 +235,7 @@ Follow these steps to start contributing:
 ### Checklist
 
 1. The title of your pull request should be a summary of its contribution;
-2. If your pull request addresses an issue, please mention the issue number in
+2. If your pull request adresses an issue, please mention the issue number in
    the pull request description to make sure they are linked (and people
    consulting the issue know you are working on it);
 3. To indicate a work in progress please prefix the title with `[WIP]`. These

diff --git a/docs/source/installation.md b/docs/source/installation.md
@@ -80,9 +80,9 @@ cache home followed by ``/transformers/`` (even if you don't have PyTorch instal
 So if you don't have any specific environment variable set, the cache directory will be at
 ``~/.cache/torch/transformers/``.
 
-**Note:** If you have set a shell environment variable for one of the predecessors of this library
+**Note:** If you have set a shell enviromnent variable for one of the predecessors of this library
 (``PYTORCH_TRANSFORMERS_CACHE`` or ``PYTORCH_PRETRAINED_BERT_CACHE``), those will be used if there is no shell
-environment variable for ``TRANSFORMERS_CACHE``.
+enviromnent variable for ``TRANSFORMERS_CACHE``.
 
 ### Note on model downloads (Continuous Integration or large-scale deployments)
 

diff --git a/docs/source/migration.md b/docs/source/migration.md
@@ -20,7 +20,7 @@ Here is a quick summary of what you should take care of when migrating from `pyt
 
 The main breaking change when migrating from `pytorch-pretrained-bert` to 🤗 Transformers is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.
 
-The exact content of the tuples for each model are detailed in the models' docstrings and the [documentation](https://huggingface.co/transformers/).
+The exact content of the tuples for each model are detailled in the models' docstrings and the [documentation](https://huggingface.co/transformers/).
 
 In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`.
 
@@ -109,7 +109,7 @@ for batch in train_data:
     loss.backward()
     optimizer.step()
 
-### In 🤗 Transformers, optimizer and schedules are split and instantiated like this:
+### In 🤗 Transformers, optimizer and schedules are splitted and instantiated like this:
 optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)  # To reproduce BertAdam specific behavior set correct_bias=False
 scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)  # PyTorch scheduler
 ### and used like this:

diff --git a/docs/source/model_sharing.rst b/docs/source/model_sharing.rst
@@ -119,7 +119,7 @@ Other files can safely be deleted.
 Upload your model with the CLI
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Now go in a terminal and run the following command. It should be in the virtual environment where you installed 🤗
+Now go in a terminal and run the following command. It should be in the virtual enviromnent where you installed 🤗
 Transformers, since that command :obj:`transformers-cli` comes from the library.
 
 .. code-block::

diff --git a/docs/source/task_summary.rst b/docs/source/task_summary.rst
@@ -510,8 +510,8 @@ As a default all models apply *Top-K* sampling when used in pipelines, as config
 
 
 Here, the model generates a random text with a total maximal length of *50* tokens from context *"As far as I am
-concerned, I will"*. The default arguments of ``PreTrainedModel.generate()`` can be directly overridden in the
-pipeline, as is shown above for the argument ``max_length``.
+concerned, I will"*. The default arguments of ``PreTrainedModel.generate()`` can be directly overriden in the pipeline,
+as is shown above for the argument ``max_length``.
 
 Here is an example of text generation using ``XLNet`` and its tokenzier.
 

diff --git a/examples/adversarial/utils_hans.py b/examples/adversarial/utils_hans.py
@@ -291,9 +291,10 @@ def hans_convert_examples_to_features(
 
     Args:
         examples: List of ``InputExamples`` containing the examples.
-        label_list: List of labels. Can be obtained from the processor using the ``processor.get_labels()`` method.
-        max_length: Maximum example length.
         tokenizer: Instance of a tokenizer that will tokenize the examples.
+        max_length: Maximum example length.
+        label_list: List of labels. Can be obtained from the processor using the ``processor.get_labels()`` method.
+        output_mode: String indicating the output mode. Either ``regression`` or ``classification``.
 
     Returns:
         A list of task-specific ``InputFeatures`` which can be fed to the model.

diff --git a/examples/bert-loses-patience/pabee/modeling_pabee_bert.py b/examples/bert-loses-patience/pabee/modeling_pabee_bert.py
@@ -155,7 +155,7 @@ def forward(
         extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
 
         # If a 2D ou 3D attention mask is provided for the cross-attention
-        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]
         if self.config.is_decoder and encoder_hidden_states is not None:
             encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
             encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)

diff --git a/examples/deebert/src/modeling_highway_bert.py b/examples/deebert/src/modeling_highway_bert.py
@@ -198,7 +198,7 @@ def forward(
         extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
 
         # If a 2D ou 3D attention mask is provided for the cross-attention
-        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]
         if encoder_attention_mask.dim() == 3:
             encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]
         if encoder_attention_mask.dim() == 2:
@@ -260,7 +260,7 @@ def forward(self, encoder_outputs):
 
         # BertModel
         bmodel_output = (pooler_input, pooler_output) + encoder_outputs[1:]
-        # "return" bmodel_output
+        # "return" bodel_output
 
         # Dropout and classification
         pooled_output = bmodel_output[1]

diff --git a/examples/distillation/distiller.py b/examples/distillation/distiller.py
@@ -265,7 +265,7 @@ def prepare_batch_clm(self, batch):
         -------
             token_ids: `torch.tensor(bs, seq_length)` - The token ids after the modifications for MLM.
             attn_mask: `torch.tensor(bs, seq_length)` - The attention mask for the self-attention.
-            clm_labels: `torch.tensor(bs, seq_length)` - The causal language modeling labels. There is a -100 where there is nothing to predict.
+            clm_labels: `torch.tensor(bs, seq_length)` - The causal languge modeling labels. There is a -100 where there is nothing to predict.
         """
         token_ids, lengths = batch
         token_ids, lengths = self.round_batch(x=token_ids, lengths=lengths)
@@ -401,9 +401,9 @@ def step(self, input_ids: torch.tensor, attention_mask: torch.tensor, lm_labels:
         # https://github.com/peterliht/knowledge-distillation-pytorch/blob/master/model/net.py#L100
         # https://github.com/peterliht/knowledge-distillation-pytorch/issues/2
         if self.params.restrict_ce_to_mask:
-            mask = (lm_labels > -1).unsqueeze(-1).expand_as(s_logits)  # (bs, seq_length, voc_size)
+            mask = (lm_labels > -1).unsqueeze(-1).expand_as(s_logits)  # (bs, seq_lenth, voc_size)
         else:
-            mask = attention_mask.unsqueeze(-1).expand_as(s_logits)  # (bs, seq_length, voc_size)
+            mask = attention_mask.unsqueeze(-1).expand_as(s_logits)  # (bs, seq_lenth, voc_size)
         s_logits_slct = torch.masked_select(s_logits, mask)  # (bs * seq_length * voc_size) modulo the 1s in mask
         s_logits_slct = s_logits_slct.view(-1, s_logits.size(-1))  # (bs * seq_length, voc_size) modulo the 1s in mask
         t_logits_slct = torch.masked_select(t_logits, mask)  # (bs * seq_length * voc_size) modulo the 1s in mask

diff --git a/examples/distillation/lm_seqs_dataset.py b/examples/distillation/lm_seqs_dataset.py
@@ -61,7 +61,7 @@ def check(self):
 
     def remove_long_sequences(self):
         """
-        Sequences that are too long are split by chunk of max_model_input_size.
+        Sequences that are too long are splitted by chunk of max_model_input_size.
         """
         max_len = self.params.max_model_input_size
         indices = self.lengths > max_len
@@ -138,8 +138,8 @@ def print_statistics(self):
         # logger.info(f'{data_len} tokens ({nb_unique_tokens} unique)')
 
         # unk_idx = self.params.special_tok_ids['unk_token']
-        # nb_unknown = sum([(t==unk_idx).sum() for t in self.token_ids])
-        # logger.info(f'{nb_unknown} unknown tokens (covering {100*nb_unknown/data_len:.2f}% of the data)')
+        # nb_unkown = sum([(t==unk_idx).sum() for t in self.token_ids])
+        # logger.info(f'{nb_unkown} unknown tokens (covering {100*nb_unkown/data_len:.2f}% of the data)')
 
     def batch_sequences(self, batch):
         """

diff --git a/examples/distillation/scripts/extract.py b/examples/distillation/scripts/extract.py
@@ -96,7 +96,7 @@
         compressed_sd["lm_head.weight"] = state_dict["lm_head.weight"]
 
     print(f"N layers selected for distillation: {std_idx}")
-    print(f"Number of params transferred for distillation: {len(compressed_sd.keys())}")
+    print(f"Number of params transfered for distillation: {len(compressed_sd.keys())}")
 
-    print(f"Save transferred checkpoint to {args.dump_checkpoint}.")
+    print(f"Save transfered checkpoint to {args.dump_checkpoint}.")
     torch.save(compressed_sd, args.dump_checkpoint)
diff --git a/examples/lxmert/modeling_frcnn.py b/examples/lxmert/modeling_frcnn.py
@@ -266,14 +266,14 @@ def find_top_rpn_proposals(
 ):
     """Args:
         proposals (list[Tensor]): (L, N, Hi*Wi*A, 4).
-        pred_objectness_logits: tensors of length L.
+        pred_objectness_logits: tensors of lenngth L.
         nms_thresh (float): IoU threshold to use for NMS
         pre_nms_topk (int): before nms
         post_nms_topk (int): after nms
         min_box_side_len (float): minimum proposal box side
         training (bool): True if proposals are to be used in training,
     Returns:
-        results (List[Dict]): stores post_nms_topk object proposals for image i.
+        resuls (List[Dict]): stores post_nms_topk object proposals for image i.
     """
     num_images = len(images)
     device = proposals[0].device
@@ -648,7 +648,7 @@ def __init__(
             images (ImageList): :class:`ImageList` instance representing N input images
             pred_objectness_logits (list[Tensor]): A list of L elements. Element i is a tensor of shape (N, A, Hi, W)
             pred_anchor_deltas (list[Tensor]): A list of L elements. Element i is a tensor of shape (N, A*4, Hi, Wi)
-            anchors (list[torch.Tensor]): nested list of boxes. anchors[i][j] at (n, l) stores anchor array for feature map l
+            anchors (list[torch.Tensor]): nested list ofboxes. anchors[i][j] at (n, l) stores anchor array for feature map l
             boundary_threshold (int): if >= 0, then anchors that extend beyond the image boundary by more than boundary_thresh are not used in training.
             gt_boxes (list[Boxes], optional): A list of N elements.
             smooth_l1_beta (float): The transition point between L1 and L2 lossn. When set to 0, the loss becomes L1. When +inf, it is ignored
@@ -1186,7 +1186,7 @@ def inference(
         attr_probs_all, attrs_all = self._predict_attrs(attr_logits, preds_per_image)
         features = features.split(preds_per_image, dim=0)
 
-        # fun for each image too, also I can experiment and do multiple images
+        # fun for each image too, also I can expirement and do multiple images
         final_results = []
         zipped = zip(boxes_all, obj_scores_all, attr_probs_all, attrs_all, sizes)
         for i, (boxes, obj_scores, attr_probs, attrs, size) in enumerate(zipped):
@@ -1412,7 +1412,7 @@ def grid_anchors(self, grid_sizes):
 
     def generate_cell_anchors(self, sizes=(32, 64, 128, 256, 512), aspect_ratios=(0.5, 1, 2)):
         """
-        anchors are continuous geometric rectangles
+        anchors are continious geometric rectangles
         centered on one feature map point sample.
         We can later build the set of anchors
         for the entire feature map by tiling these tensors
@@ -1865,7 +1865,7 @@ def inference(
         scales_yx=None,
         **kwargs,
     ):
-        # run images through backbone
+        # run images through bacbone
         original_sizes = image_shapes * scales_yx
         features = self.backbone(images)
 

diff --git a/examples/lxmert/processing_image.py b/examples/lxmert/processing_image.py
@@ -116,7 +116,7 @@ def __call__(self, images, single_image=False):
             images = self.aug(images)
             # transpose images and convert to torch tensors
             # images = [torch.as_tensor(i.astype("float32")).permute(2, 0, 1).to(self.device) for i in images]
-            # now normalize before pad to avoid useless arithmetic
+            # now normalize before pad to aoid useless arithmatic
             images = [self.normalizer(x) for x in images]
             # now pad them to do the following operations
             images, sizes = self.pad(images)

diff --git a/examples/lxmert/utils.py b/examples/lxmert/utils.py
@@ -236,7 +236,7 @@ def compare(in_tensor):
     ), f"{sum([1 for x in np.isclose(n1, n2, rtol=0.01, atol=0.1).flatten() if x == False])/len(n1.flatten())*100:.4f} % element-wise mismatch"
     raise Exception("tensors are all good")
 
-    # Hugging face functions below
+    # Hugging face functiions below
 
 
 def is_remote_url(url_or_filename):
@@ -520,7 +520,7 @@ def get_image_from_url(url):
     return img
 
 
-# to load legacy frcnn checkpoint from detectron
+# to load legace frcnn checkpoint from detectron
 def load_frcnn_pkl_from_url(url):
     fn = url.split("/")[-1]
     if fn not in os.listdir(os.getcwd()):

diff --git a/examples/movement-pruning/counts_parameters.py b/examples/movement-pruning/counts_parameters.py
@@ -33,7 +33,7 @@ def main(args):
     remaining_count = 0  # Number of remaining (not pruned) params in the encoder
     encoder_count = 0  # Number of params in the encoder
 
-    print("name".ljust(60, " "), "Remaining Weights %", "Remaining Weight")
+    print("name".ljust(60, " "), "Remaining Weights %", "Remaning Weight")
     for name, param in st.items():
         if "encoder" not in name:
             continue

diff --git a/examples/movement-pruning/emmental/modeling_bert_masked.py b/examples/movement-pruning/emmental/modeling_bert_masked.py
@@ -591,7 +591,7 @@ def forward(
         extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
 
         # If a 2D ou 3D attention mask is provided for the cross-attention
-        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]
         if self.config.is_decoder and encoder_hidden_states is not None:
             encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
             encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
@@ -631,7 +631,7 @@ def forward(
                 )  # We can specify head_mask for each layer
             head_mask = head_mask.to(
                 dtype=next(self.parameters()).dtype
-            )  # switch to float if need + fp16 compatibility
+            )  # switch to fload if need + fp16 compatibility
         else:
             head_mask = [None] * self.config.num_hidden_layers
 

diff --git a/examples/movement-pruning/masked_run_glue.py b/examples/movement-pruning/masked_run_glue.py
@@ -225,7 +225,7 @@ def train(args, train_dataset, model, tokenizer, teacher=None):
         desc="Epoch",
         disable=args.local_rank not in [-1, 0],
     )
-    set_seed(args)  # Added here for reproducibility
+    set_seed(args)  # Added here for reproductibility
     for _ in train_iterator:
         epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
         for step, batch in enumerate(epoch_iterator):
@@ -705,7 +705,7 @@ def main():
         "--final_lambda",
         default=0.0,
         type=float,
-        help="Regularization intensity (used in conjunction with `regularization`.",
+        help="Regularization intensity (used in conjunction with `regulariation`.",
     )
 
     parser.add_argument("--global_topk", action="store_true", help="Global TopK on the Scores.")
@@ -816,7 +816,7 @@ def main():
     if args.local_rank == -1 or args.no_cuda:
         device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
         args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()
-    else:  # Initializes the distributed backend which will take care of synchronizing nodes/GPUs
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
         torch.cuda.set_device(args.local_rank)
         device = torch.device("cuda", args.local_rank)
         torch.distributed.init_process_group(backend="nccl")

diff --git a/examples/movement-pruning/masked_run_squad.py b/examples/movement-pruning/masked_run_squad.py
@@ -231,7 +231,7 @@ def train(args, train_dataset, model, tokenizer, teacher=None):
     train_iterator = trange(
         epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
     )
-    # Added here for reproducibility
+    # Added here for reproductibility
     set_seed(args)
 
     for _ in train_iterator:
@@ -824,7 +824,7 @@ def main():
         "--final_lambda",
         default=0.0,
         type=float,
-        help="Regularization intensity (used in conjunction with `regularization`.",
+        help="Regularization intensity (used in conjunction with `regulariation`.",
     )
 
     parser.add_argument("--global_topk", action="store_true", help="Global TopK on the Scores.")
@@ -977,7 +977,7 @@ def main():
     if args.local_rank == -1 or args.no_cuda:
         device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
         args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()
-    else:  # Initializes the distributed backend which will take care of synchronizing nodes/GPUs
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
         torch.cuda.set_device(args.local_rank)
         device = torch.device("cuda", args.local_rank)
         torch.distributed.init_process_group(backend="nccl")

diff --git a/examples/rag/distributed_retriever.py b/examples/rag/distributed_retriever.py
@@ -16,7 +16,7 @@
 class RagPyTorchDistributedRetriever(RagRetriever):
     """
     A distributed retriever built on top of the ``torch.distributed`` communication package. During training all workers
-    initialize their own instance of the retriever, however, only the main worker loads the index into memory. The index is stored
+    initalize their own instance of the retriever, however, only the main worker loads the index into memory. The index is stored
     in cpu memory. The index will also work well in a non-distributed setup.
 
     Args:
@@ -45,7 +45,7 @@ def __init__(self, config, question_encoder_tokenizer, generator_tokenizer, inde
 
     def init_retrieval(self, distributed_port: int):
         """
-        Retriever initialization function, needs to be called from the training process. The function sets some common parameters
+        Retriever initalization function, needs to be called from the training process. The function sets some common parameters
         and environment variables. On top of that, (only) the main process in the process group loads the index into memory.
 
         Args:
@@ -56,7 +56,7 @@ def init_retrieval(self, distributed_port: int):
 
         logger.info("initializing retrieval")
 
-        # initializing a separate process group for retrieval as the default
+        # initializing a separate process group for retrievel as the default
         # nccl backend doesn't support gather/scatter operations while gloo
         # is too slow to replace nccl for the core gpu communication
         if dist.is_initialized():
@@ -101,7 +101,7 @@ def retrieve(self, question_hidden_states: np.ndarray, n_docs: int) -> Tuple[np.
             n_docs (:obj:`int`):
                 The number of docs retrieved per query.
 
-        Output:
+        Ouput:
             retrieved_doc_embeds (:obj:`np.ndarray` of shape :obj:`(batch_size, n_docs, dim)`
                 The retrieval embeddings of the retrieved docs per query.
             doc_ids (:obj:`np.ndarray` of shape :obj:`batch_size, n_docs`)