[docs] Add docs explaining metrics. (#3498)

* Add docs explaining metrics. * Slightly change title
facebookresearch · Mar 8, 2021 · 7224486 · 7224486
1 parent c110f73
commit 7224486
Showing 1 changed file with 43 additions and 1 deletion.
diff --git a/docs/source/tutorial_metrics.md b/docs/source/tutorial_metrics.md
@@ -1,9 +1,13 @@
-# Understanding and adding new metrics
+# Understanding and adding metrics
 
 Author: Stephen Roller
 
 ## Introduction and Standard Metrics
 
+:::{tip} List of metrics
+If you're not sure what a metric means, refer to our [List of metrics](#list-of-metrics).
+:::
+
 ParlAI contains a number of built-in metrics that are automatically computed when
 we train and evaluate models. Some of these metrics are _text generation_ metrics,
 which happen any time we generate a text: this includes F1, BLEU and Accuracy.
@@ -53,6 +57,7 @@ One nice thing about metrics is that they are automatically logged to the
 statements into your code.
 
 
+
 ### Agent-specific metrics
 
 Some agents include their own metrics that are computed for them. For example,
@@ -402,3 +407,40 @@ __Under the hood__: Local metrics work by including a "metrics" field in the
 return message. This is a dictionary which maps field name to a metric value.
 When the teacher receives the response from the model, it utilizes the metrics
 field to update counters on its side.
+
+## List of Metrics
+
+Below is a list of metrics and a brief explanation of each.
+
+:::{note} List of metrics
+If you find a metric not listed here,
+please [file an issue on GitHub](https://github.com/facebookresearch/ParlAI/issues/new?assignees=&labels=Docs,Metrics&template=other.md).
+:::
+
+| Metric                  | Explanation  |
+| ----------------------- | ------------ |
+| `accuracy`              | Exact match text accuracy |
+| `bleu-4`                | BLEU-4 of the generation, under a standardized (model-independent) tokenizer |
+| `clip`                  | Fraction of batches with clipped gradients |
+| `ctpb`                  | Context tokens per batch |
+| `ctps`                  | Context tokens per second |
+| `exps`                  | Examples per second |
+| `exs`                   | Number of examples processed since last print |
+| `f1`                    | Unigram F1 overlap, under a standardized (model-independent) tokenizer |
+| `gnorm`                 | Gradient norm |
+| `gpu_mem`               | Fraction of GPU memory used. May slightly underestimate true value. |
+| `hits@1`, `hits@5`, ... | Fraction of correct choices in K guesses. (Similar to recall@K) |
+| `interdistinct-1`, `interdictinct-2` | Fraction of n-grams unique across _all_ generations |
+| `intradistinct-1`, `intradictinct-2` | Fraction of n-grams unique _within_ each utterance |
+| `jga`                   | Joint Goal Accuracy |
+| `loss`                  | Loss |
+| `lr`                    | The most recent learning rate applied |
+| `ltpb`                  | Label tokens per batch |
+| `ltps`                  | Label tokens per second |
+| `rouge-1`, `rouge-1`, `rouge-L` | ROUGE metrics |
+| `token_acc`             | Token-wise accuracy (generative only) |
+| `token_em`              | Utterance-level token accuracy. Roughly corresponds to perfection under greedy search (generative only) |
+| `total_train_updates`   | Number of SGD steps taken across all batches |
+| `tpb`                   | Total tokens (context + label) per batch |
+| `tps`                   | Total tokens (context + label) per second |
+| `ups`                   | Updates per second (approximate) |