From 2ec1b354a162d317d362d6aaad7ce99aeee40b0a Mon Sep 17 00:00:00 2001 From: Stephen Roller Date: Wed, 12 Aug 2020 19:25:09 -0400 Subject: [PATCH] Remove metrics tutorial from PR. --- docs/source/tutorial_metrics.md | 372 -------------------------------- 1 file changed, 372 deletions(-) delete mode 100644 docs/source/tutorial_metrics.md diff --git a/docs/source/tutorial_metrics.md b/docs/source/tutorial_metrics.md deleted file mode 100644 index 650ba4ceb7e..00000000000 --- a/docs/source/tutorial_metrics.md +++ /dev/null @@ -1,372 +0,0 @@ -# Understanding and adding new metrics - -Author: Stephen Roller - -## Introduction and Standard Metrics - -ParlAI contains a number of built-in metrics that are automatically computed when -we train and evaluate models. Some of these metrics are _text generation_ metrics, -which happen any time we generate a text: this includes F1, BLEU and Accuracy. - -For example, let's try a Fixed Response model, which always returns a given fixed -response, and evaluate on the DailyDialog dataset: - -``` -$ parlai eval_model -m fixed_response -t dailydialog --fixed-response "how may i help you ?" -... after a while ... -14:41:40 | Evaluating task dailydialog using datatype valid. -14:41:40 | creating task(s): dailydialog -14:41:41 | Finished evaluating tasks ['dailydialog'] using datatype valid - accuracy bleu-4 exs f1 - .0001239 .002617 8069 .1163 -``` - -We see that we got 0.01239% accuracy, 0.26% BLEU-4 score, and 11.63% F1 across -8069 examples. What do those metrics means? - -- Accuracy: this is perfect, exact, matching of the response, averaged across - all examples in the dataset -- BLEU-4: this is the [BLEU score](https://en.wikipedia.org/wiki/BLEU) between - the predicted response and the reference response. It is measured on - tokenized text, and uses NLTK to compute it. -- F1: This is the [Unigram](https://en.wikipedia.org/wiki/N-gram) F1 overlap - between your text and the reference response. -- exs: the number of examples we have evaluated - -If you don't see the BLEU-4 score, you may need to install NLTK with -`pip install nltk`. - -We can also measure ROUGE. Note that we need to `pip install py-rouge` for this -functionality: - -``` -$ parlai eval_model -m fixed_response -t dailydialog --fixed-response "how may i help you ?" --metrics rouge -14:47:24 | creating task(s): dailydialog -14:47:31 | Finished evaluating tasks ['dailydialog'] using datatype valid - accuracy exs f1 rouge_1 rouge_2 rouge_L - .0001239 8069 .1163 .09887 .007285 .09525 -``` - -### Agent-specific metrics - -Some agents include their own metrics that are computed for them. For example, -generative models automatically compute `ppl` -([perplexity](https://en.wikipedia.org/wiki/Perplexity)) and `token_acc`, both -which measure the generative model's ability to predict indivdual tokens. As -an example, let's evaluate the [BlenderBot](https://parl.ai/projects/recipes/) -90M model on DailyDialog: - -``` -$ parlai eval_model --task dailydialog -mf zoo:blender/blender_90M/model -bs 32 -... -14:54:14 | Evaluating task dailydialog using datatype valid. -14:54:14 | creating task(s): dailydialog -... -15:26:19 | Finished evaluating tasks ['dailydialog'] using datatype valid - accuracy bleu-4 ctpb ctps exps exs f1 gpu_mem loss lr ltpb ltps ppl token_acc total_train_updates tpb tps - 0 .002097 14202 442.5 6.446 8069 .1345 .0384 2.979 7.5e-06 3242 101 19.67 .4133 339012 17445 543.5 -``` - -Here we see a number of extra metrics, each of which we explain below: -- `tpb`, `ctpb`, `ltpb`: stand for tokens per batch, context-tokens per batch, - and label-tokens per batch. These are useful for measuring how dense the - batches are, and are helpful when experimenting with [dynamic - batching](tutorial_fast). tpb is always the sum of ctpb and lptb. -- `tps`, `ctps`, `ltps`: are similar, but stand for "tokens per second". They - measure how fast we are training. Similarly, `exps` measures examples per - second. -- `gpu_mem`: measures _roughly_ how much GPU memory your model is using, but it - is only approximate. This is useful for determining if you can possibly increase - the model size or the batch size. -- `loss`: the loss metric -- `ppl` and `token_acc`: the perplexity and per-token accuracy. these are generative - performance metrics. -- `total_train_updates`: the number of SGD updates this model was trained for. - You will see this increase during training, but not during evaluation. - -## Adding custom metrics - -Of course, you may wish to add your own custom metrics: whether this is because -you are developing a special model, special dataset, or otherwise want other -information accessible to you. Metrics can be computed by either _the teacher_ OR -_the model_. Within the model, they may be computed either _locally_ or _globally_. -There are different reasons for why and where you would want to choose each -location: - -- __Teacher metrics__: This is the best spot for computing metrics that depend - on a specific dataset. These metrics will only be available when evaluating - on this dataset. They have the advantage of being easy to compute and - understand. -- __Global metrics__: Global metrics are computed by the model, and are globally - tracked. These metrics are easy to understand and track, but work poorly - when doing multitasking. -- __Local metrics__: Local metrics are the model-analogue of teacher metrics. - They are computed and recorded on a per-example basis, and so they work well - when multitasking. They can be extremely complicated for some models, however. - -We will take you through writing each of these methods in turn, and demonstrate -examples of how to add these metrics in your setup. - -## Teacher metrics - -Teacher metrics are useful for items that depend on a specific dataset. -For example, in some of our task oriented datasets, like -[`google_sgd`](https://github.com/facebookresearch/ParlAI/blob/master/parlai/tasks/google_sgd/agents.py), -we want to additionally compute metrics around slots. - -Teacher metrics can be added by adding the following method to your teacher: - -```python - def custom_evaluation( - self, - teacher_action: Message, - labels: Optional[Tuple[str]], - model_response: Message, - ) -> None: - pass -``` - -The signature for this method is as follows: -- `teacher_action`: this is the last message the teacher sent to the model. This likely - contains a "text" and "labels" field, as well as any custom fields you might - have. -- `labels`: The gold label(s). This can also be found as information in the - `teacher_action`, but it is conveniently extracted for you. -- `model_response`: The full model response, including any extra fields the model - may have sent. - -Let's take an actual example. We will add a custom metric which calculates -how often the model says the word "hello", and call it `hello_avg`. - -We will add a [custom teacher](tutorial_task). For this example, we will use -the `@register` syntax you may have seen in our [quickstart -tutorial](tutorial_quick). - -```python -from parlai.core.loader import register_teacher -from parlai.core.metrics import AverageMetric -from parlai.tasks.dailydialog.agents import DefaultTeacher as DailyDialogTeacher - -@register_teacher("hello_daily") -class CustomDailyDialogTeacher(DailyDialogTeacher): - def custom_evaluation( - self, teacher_action, labels, model_response - ) -> None: - if 'text' not in model_response: - # model didn't speak, skip this example - return - model_text = model_response['text'] - if 'hello' in model_text: - # count 1 / 1 messages having "hello" - self.metrics.add('hello_avg', AverageMetric(1, 1)) - else: - # count 0 / 1 messages having "hello" - self.metrics.add('hello_avg', AverageMetric(0, 1)) - -if __name__ == '__main__': - from parlai.scripts.eval_model import EvalModel - - EvalModel.main( - task='hello_daily', - model_file='zoo:blender/blender_90M/model', - batchsize=32, - ) -``` - -If we run the script, we will have a new metric in our output: - -``` -18:07:30 | Finished evaluating tasks ['hello_daily'] using datatype valid - accuracy bleu-4 ctpb ctps exps exs f1 gpu_mem hello_avg loss ltpb ltps ppl token_acc tpb tps - 0 .002035 2172 230 3.351 8069 .1346 .05211 .1228 2.979 495.9 52.52 19.67 .4133 2668 282.6 -``` - -__What is AverageMetric?__ - -Wait, what is this -[AverageMetric](parlai.core.metrics.AverageMetric)? All metrics -you want to create in ParlAI should be a -[Metric](Metric) object. Metric objects -define a way of instantiating the metric, a way of combining it with a -like-metric, and a way of rendering it as a single float value. For an -AverageMetric, this means we need to define a numerator and a denominator; the -combination of AverageMetrics adds their numerators and denominators -separately. As we do this across all examples, the numerator will be the number -of examples with "hello" in it, and the denominator will be the total number of -examples. When we go to print the metric, the division will be computed at the -last second. - -If you're used to writing machine learning code in one-off scripts, you may ask -why do I need to use this metric? Can't I just count and divide myself? While -you can do this, your code could not be run in [_distributed -mode_](tutorial_fast). If we only returned a single float, we would not be able -to know if some distributed workers received more or fewer examples than -others. However, when we explicitly store the numerator and denominator, we can -combine and reduce the across multiple nodes, enabling us to train on hundreds -of GPUs, while still ensuring correctness in all our metrics. - -In addition to AverageMetric, there is also -[SumMetric](parlai.core.metrics.SumMetric), which keeps a running -sum. SumMetric and AverageMetric are the most common ways to construct custom -metrics, but others exist as well. For a full list (and views into advanced -cases), please see the [metrics API documentation](metrics_api). - -## Agent (model) level metrics - -In the above example, we worked on a metric defined by a Teacher. However, -sometimes our models will have special metrics that only they want to compute, -which we call an Agent-level metric. Perplexity is one example. - -To compute model-level metrics, we can define either a global metric, or a -local metric. Global metrics can be computed anywhere, and are easy to use, -but cannot distinguish between different teachers when multitasking. We'll -look at another example, counting the number of times the teacher says "hello". - -### Global metrics - -A global metric is computed anywhere in the model, and has an -interface similar to that of the teacher: - -```python -agent.global_metrics.add('my_metric', AverageMetric(1, 2)) -``` - -Global metrics are called as such because they can be called anywhere in agent -code. For example, we can add a metric that counts the number of times the -model sees the word "hello" in `observe`. We'll do this while extending -the `TransformerGeneratorAgent`, so that we can combined it with the BlenderBot -model we used earlier. - -```python -from parlai.core.metrics import AverageMetric -from parlai.core.loader import register_agent -from parlai.agents.transformer.transformer import TransformerGeneratorAgent - - -@register_agent('GlobalHelloCounter') -class GlobalHelloCounterAgent(TransformerGeneratorAgent): - def observe(self, observation): - retval = super().observe(observation) - if 'text' in observation: - text = observation['text'] - self.global_metrics.add( - 'global_hello', AverageMetric(int('hello' in text), 1) - ) - return retval - - -if __name__ == '__main__': - from parlai.scripts.eval_model import EvalModel - - EvalModel.main( - task='dailydialog', - model='GlobalHelloCounter', - model_file='zoo:blender/blender_90M/model', - batchsize=32, - ) -``` - -Running the script, we see that our new metric appears. Note that it is -different than the metric in the first half of the tutorial: that is because -previously we were counting the number of times the model said hello (a lot), -but now we are counting how often the dataset says hello. - -``` -21:57:50 | Finished evaluating tasks ['dailydialog'] using datatype valid - accuracy bleu-4 ctpb ctps exps exs f1 global_hello gpu_mem loss ltpb ltps ppl token_acc tpb tps - 0 .002097 14202 435.1 6.338 8069 .1345 .0009914 .02795 2.979 3242 99.32 19.67 .4133 17445 534.4 -``` - -The global metric works well, but have some drawbacks: if we were to start -training on a multitask datasets, we would not be able to distinguish the -`global_hello` of the two datasets, and we could only compute the micro-average -of the combination of the two. Below is an excerpt from a training log with -the above agents: - -``` -09:14:52 | time:112s total_exs:90180 epochs:0.41 - clip ctpb ctps exps exs global_hello gnorm gpu_mem loss lr ltpb ltps ppl token_acc total_train_updates tpb tps ups - all 1 9831 66874 841.9 8416 .01081 2.018 .3474 5.078 1 1746 11878 163.9 .2370 729 11577 78752 6.803 - convai2 3434 .01081 5.288 197.9 .2120 - dailydialog 4982 .01081 4.868 130 .2620 -``` - -Notice how `global_hello` is the same in both, because the model is unable to -distinguish between the two settings. In the next section we'll show how to fix -this with local metrics. - -__On placement__: In the example above, we recorded the global metric inside -the `observe` function. However, global metrics can be recorded from anywhere. - -### Local metrics - -Having observed the limitation of global metrics being unable to distinguish -settings in multitasking, we would like to improve upon this. Let's add a local -metric, which is recorded _per example_. By recording this metric per example, -we can unambiguously identify which metrics came from which dataset, and report -averages correctly. - -Local metrics have a limitation: they can only be computed inside the scope of -`batch_act`. This includes common places like `compute_loss` or `generate`, -where we often want to instrument specific behavior. - -Let's look at an example. We'll add a metric inside the `batchify` function, -which is called from within `batch_act`, and is used to convert from a list of -[Messages](messages) objects to a -[Batch](torch_agent.html#parlai.core.torch_agent.Batch) object. It is where we do things like -padding, etc. We'll do something slightly different than our previous runs. -In this case, we'll count the number of _tokens_ which are the word "hello". - -```python -from parlai.core.metrics import AverageMetric -from parlai.core.loader import register_agent -from parlai.agents.transformer.transformer import TransformerGeneratorAgent - - -@register_agent('LocalHelloCounter') -class LocalHelloCounterAgent(TransformerGeneratorAgent): - def batchify(self, observations): - batch = super().batchify(observations) - if hasattr(batch, 'text_vec'): - num_hello = ["hello" in o['text'] for o in observations] - self.record_local_metric( - 'local_hello', - # AverageMetric.many(seq1) is shorthand for - # [AverageMetric(item) for item in seq) - AverageMetric.many(num_hello), - ) - return batch - - -if __name__ == '__main__': - from parlai.scripts.train_model import TrainModel - - TrainModel.main( - task='dailydialog,convai2', - model='LocalHelloCounter', - dict_file='zoo:blender/blender_90M/model.dict', - batchsize=32, - ) -``` - -When we run this training script, we get one such output: -``` -09:49:00 | time:101s total_exs:56160 epochs:0.26 - clip ctpb ctps exps exs gnorm gpu_mem local_hello loss lr ltpb ltps ppl token_acc total_train_updates tpb tps ups - all 1 3676 63204 550.2 5504 2.146 .1512 .01423 4.623 1 436.2 7500 101.8 .2757 1755 4112 70704 17.2 - convai2 3652 .02793 4.659 105.5 .2651 - dailydialog 1852 .00054 4.587 98.17 .2863 -``` - -Notice how the `local_hello` metric can now distinguish between hellos coming from -convai2 and those coming from daily dialog? The average hides the fact that one -dataset has many hellos, and the other does not. - -Local metrics are primarily worth the implementation when you care about the -fidelity of _train time metrics_. During evaluation time, we evaluate each -dataset individually, so we can ensure global metrics are not mixed up. - -__Under the hood__: Local metrics work by including a "metrics" field in the -return message. This is a dictionary which maps field name to a metric value. -When the teacher receives the response from the model, it utilizes the metrics -field to update counters on its side.