This repository has been archived by the owner on Nov 3, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
2ec1b35
commit 8e87394
Showing
1 changed file
with
372 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,372 @@ | ||
# Understanding and adding new metrics | ||
|
||
Author: Stephen Roller | ||
|
||
## Introduction and Standard Metrics | ||
|
||
ParlAI contains a number of built-in metrics that are automatically computed when | ||
we train and evaluate models. Some of these metrics are _text generation_ metrics, | ||
which happen any time we generate a text: this includes F1, BLEU and Accuracy. | ||
|
||
For example, let's try a Fixed Response model, which always returns a given fixed | ||
response, and evaluate on the DailyDialog dataset: | ||
|
||
``` | ||
$ parlai eval_model -m fixed_response -t dailydialog --fixed-response "how may i help you ?" | ||
... after a while ... | ||
14:41:40 | Evaluating task dailydialog using datatype valid. | ||
14:41:40 | creating task(s): dailydialog | ||
14:41:41 | Finished evaluating tasks ['dailydialog'] using datatype valid | ||
accuracy bleu-4 exs f1 | ||
.0001239 .002617 8069 .1163 | ||
``` | ||
|
||
We see that we got 0.01239% accuracy, 0.26% BLEU-4 score, and 11.63% F1 across | ||
8069 examples. What do those metrics means? | ||
|
||
- Accuracy: this is perfect, exact, matching of the response, averaged across | ||
all examples in the dataset | ||
- BLEU-4: this is the [BLEU score](https://en.wikipedia.org/wiki/BLEU) between | ||
the predicted response and the reference response. It is measured on | ||
tokenized text, and uses NLTK to compute it. | ||
- F1: This is the [Unigram](https://en.wikipedia.org/wiki/N-gram) F1 overlap | ||
between your text and the reference response. | ||
- exs: the number of examples we have evaluated | ||
|
||
If you don't see the BLEU-4 score, you may need to install NLTK with | ||
`pip install nltk`. | ||
|
||
We can also measure ROUGE. Note that we need to `pip install py-rouge` for this | ||
functionality: | ||
|
||
``` | ||
$ parlai eval_model -m fixed_response -t dailydialog --fixed-response "how may i help you ?" --metrics rouge | ||
14:47:24 | creating task(s): dailydialog | ||
14:47:31 | Finished evaluating tasks ['dailydialog'] using datatype valid | ||
accuracy exs f1 rouge_1 rouge_2 rouge_L | ||
.0001239 8069 .1163 .09887 .007285 .09525 | ||
``` | ||
|
||
### Agent-specific metrics | ||
|
||
Some agents include their own metrics that are computed for them. For example, | ||
generative models automatically compute `ppl` | ||
([perplexity](https://en.wikipedia.org/wiki/Perplexity)) and `token_acc`, both | ||
which measure the generative model's ability to predict indivdual tokens. As | ||
an example, let's evaluate the [BlenderBot](https://parl.ai/projects/recipes/) | ||
90M model on DailyDialog: | ||
|
||
``` | ||
$ parlai eval_model --task dailydialog -mf zoo:blender/blender_90M/model -bs 32 | ||
... | ||
14:54:14 | Evaluating task dailydialog using datatype valid. | ||
14:54:14 | creating task(s): dailydialog | ||
... | ||
15:26:19 | Finished evaluating tasks ['dailydialog'] using datatype valid | ||
accuracy bleu-4 ctpb ctps exps exs f1 gpu_mem loss lr ltpb ltps ppl token_acc total_train_updates tpb tps | ||
0 .002097 14202 442.5 6.446 8069 .1345 .0384 2.979 7.5e-06 3242 101 19.67 .4133 339012 17445 543.5 | ||
``` | ||
|
||
Here we see a number of extra metrics, each of which we explain below: | ||
- `tpb`, `ctpb`, `ltpb`: stand for tokens per batch, context-tokens per batch, | ||
and label-tokens per batch. These are useful for measuring how dense the | ||
batches are, and are helpful when experimenting with [dynamic | ||
batching](tutorial_fast). tpb is always the sum of ctpb and lptb. | ||
- `tps`, `ctps`, `ltps`: are similar, but stand for "tokens per second". They | ||
measure how fast we are training. Similarly, `exps` measures examples per | ||
second. | ||
- `gpu_mem`: measures _roughly_ how much GPU memory your model is using, but it | ||
is only approximate. This is useful for determining if you can possibly increase | ||
the model size or the batch size. | ||
- `loss`: the loss metric | ||
- `ppl` and `token_acc`: the perplexity and per-token accuracy. these are generative | ||
performance metrics. | ||
- `total_train_updates`: the number of SGD updates this model was trained for. | ||
You will see this increase during training, but not during evaluation. | ||
|
||
## Adding custom metrics | ||
|
||
Of course, you may wish to add your own custom metrics: whether this is because | ||
you are developing a special model, special dataset, or otherwise want other | ||
information accessible to you. Metrics can be computed by either _the teacher_ OR | ||
_the model_. Within the model, they may be computed either _locally_ or _globally_. | ||
There are different reasons for why and where you would want to choose each | ||
location: | ||
|
||
- __Teacher metrics__: This is the best spot for computing metrics that depend | ||
on a specific dataset. These metrics will only be available when evaluating | ||
on this dataset. They have the advantage of being easy to compute and | ||
understand. | ||
- __Global metrics__: Global metrics are computed by the model, and are globally | ||
tracked. These metrics are easy to understand and track, but work poorly | ||
when doing multitasking. | ||
- __Local metrics__: Local metrics are the model-analogue of teacher metrics. | ||
They are computed and recorded on a per-example basis, and so they work well | ||
when multitasking. They can be extremely complicated for some models, however. | ||
|
||
We will take you through writing each of these methods in turn, and demonstrate | ||
examples of how to add these metrics in your setup. | ||
|
||
## Teacher metrics | ||
|
||
Teacher metrics are useful for items that depend on a specific dataset. | ||
For example, in some of our task oriented datasets, like | ||
[`google_sgd`](https://github.com/facebookresearch/ParlAI/blob/master/parlai/tasks/google_sgd/agents.py), | ||
we want to additionally compute metrics around slots. | ||
|
||
Teacher metrics can be added by adding the following method to your teacher: | ||
|
||
```python | ||
def custom_evaluation( | ||
self, | ||
teacher_action: Message, | ||
labels: Optional[Tuple[str]], | ||
model_response: Message, | ||
) -> None: | ||
pass | ||
``` | ||
|
||
The signature for this method is as follows: | ||
- `teacher_action`: this is the last message the teacher sent to the model. This likely | ||
contains a "text" and "labels" field, as well as any custom fields you might | ||
have. | ||
- `labels`: The gold label(s). This can also be found as information in the | ||
`teacher_action`, but it is conveniently extracted for you. | ||
- `model_response`: The full model response, including any extra fields the model | ||
may have sent. | ||
|
||
Let's take an actual example. We will add a custom metric which calculates | ||
how often the model says the word "hello", and call it `hello_avg`. | ||
|
||
We will add a [custom teacher](tutorial_task). For this example, we will use | ||
the `@register` syntax you may have seen in our [quickstart | ||
tutorial](tutorial_quick). | ||
|
||
```python | ||
from parlai.core.loader import register_teacher | ||
from parlai.core.metrics import AverageMetric | ||
from parlai.tasks.dailydialog.agents import DefaultTeacher as DailyDialogTeacher | ||
|
||
@register_teacher("hello_daily") | ||
class CustomDailyDialogTeacher(DailyDialogTeacher): | ||
def custom_evaluation( | ||
self, teacher_action, labels, model_response | ||
) -> None: | ||
if 'text' not in model_response: | ||
# model didn't speak, skip this example | ||
return | ||
model_text = model_response['text'] | ||
if 'hello' in model_text: | ||
# count 1 / 1 messages having "hello" | ||
self.metrics.add('hello_avg', AverageMetric(1, 1)) | ||
else: | ||
# count 0 / 1 messages having "hello" | ||
self.metrics.add('hello_avg', AverageMetric(0, 1)) | ||
|
||
if __name__ == '__main__': | ||
from parlai.scripts.eval_model import EvalModel | ||
|
||
EvalModel.main( | ||
task='hello_daily', | ||
model_file='zoo:blender/blender_90M/model', | ||
batchsize=32, | ||
) | ||
``` | ||
|
||
If we run the script, we will have a new metric in our output: | ||
|
||
``` | ||
18:07:30 | Finished evaluating tasks ['hello_daily'] using datatype valid | ||
accuracy bleu-4 ctpb ctps exps exs f1 gpu_mem hello_avg loss ltpb ltps ppl token_acc tpb tps | ||
0 .002035 2172 230 3.351 8069 .1346 .05211 .1228 2.979 495.9 52.52 19.67 .4133 2668 282.6 | ||
``` | ||
|
||
__What is AverageMetric?__ | ||
|
||
Wait, what is this | ||
[AverageMetric](parlai.core.metrics.AverageMetric)? All metrics | ||
you want to create in ParlAI should be a | ||
[Metric](Metric) object. Metric objects | ||
define a way of instantiating the metric, a way of combining it with a | ||
like-metric, and a way of rendering it as a single float value. For an | ||
AverageMetric, this means we need to define a numerator and a denominator; the | ||
combination of AverageMetrics adds their numerators and denominators | ||
separately. As we do this across all examples, the numerator will be the number | ||
of examples with "hello" in it, and the denominator will be the total number of | ||
examples. When we go to print the metric, the division will be computed at the | ||
last second. | ||
|
||
If you're used to writing machine learning code in one-off scripts, you may ask | ||
why do I need to use this metric? Can't I just count and divide myself? While | ||
you can do this, your code could not be run in [_distributed | ||
mode_](tutorial_fast). If we only returned a single float, we would not be able | ||
to know if some distributed workers received more or fewer examples than | ||
others. However, when we explicitly store the numerator and denominator, we can | ||
combine and reduce the across multiple nodes, enabling us to train on hundreds | ||
of GPUs, while still ensuring correctness in all our metrics. | ||
|
||
In addition to AverageMetric, there is also | ||
[SumMetric](parlai.core.metrics.SumMetric), which keeps a running | ||
sum. SumMetric and AverageMetric are the most common ways to construct custom | ||
metrics, but others exist as well. For a full list (and views into advanced | ||
cases), please see the [metrics API documentation](metrics_api). | ||
|
||
## Agent (model) level metrics | ||
|
||
In the above example, we worked on a metric defined by a Teacher. However, | ||
sometimes our models will have special metrics that only they want to compute, | ||
which we call an Agent-level metric. Perplexity is one example. | ||
|
||
To compute model-level metrics, we can define either a global metric, or a | ||
local metric. Global metrics can be computed anywhere, and are easy to use, | ||
but cannot distinguish between different teachers when multitasking. We'll | ||
look at another example, counting the number of times the teacher says "hello". | ||
|
||
### Global metrics | ||
|
||
A global metric is computed anywhere in the model, and has an | ||
interface similar to that of the teacher: | ||
|
||
```python | ||
agent.global_metrics.add('my_metric', AverageMetric(1, 2)) | ||
``` | ||
|
||
Global metrics are called as such because they can be called anywhere in agent | ||
code. For example, we can add a metric that counts the number of times the | ||
model sees the word "hello" in `observe`. We'll do this while extending | ||
the `TransformerGeneratorAgent`, so that we can combined it with the BlenderBot | ||
model we used earlier. | ||
|
||
```python | ||
from parlai.core.metrics import AverageMetric | ||
from parlai.core.loader import register_agent | ||
from parlai.agents.transformer.transformer import TransformerGeneratorAgent | ||
|
||
|
||
@register_agent('GlobalHelloCounter') | ||
class GlobalHelloCounterAgent(TransformerGeneratorAgent): | ||
def observe(self, observation): | ||
retval = super().observe(observation) | ||
if 'text' in observation: | ||
text = observation['text'] | ||
self.global_metrics.add( | ||
'global_hello', AverageMetric(int('hello' in text), 1) | ||
) | ||
return retval | ||
|
||
|
||
if __name__ == '__main__': | ||
from parlai.scripts.eval_model import EvalModel | ||
|
||
EvalModel.main( | ||
task='dailydialog', | ||
model='GlobalHelloCounter', | ||
model_file='zoo:blender/blender_90M/model', | ||
batchsize=32, | ||
) | ||
``` | ||
|
||
Running the script, we see that our new metric appears. Note that it is | ||
different than the metric in the first half of the tutorial: that is because | ||
previously we were counting the number of times the model said hello (a lot), | ||
but now we are counting how often the dataset says hello. | ||
|
||
``` | ||
21:57:50 | Finished evaluating tasks ['dailydialog'] using datatype valid | ||
accuracy bleu-4 ctpb ctps exps exs f1 global_hello gpu_mem loss ltpb ltps ppl token_acc tpb tps | ||
0 .002097 14202 435.1 6.338 8069 .1345 .0009914 .02795 2.979 3242 99.32 19.67 .4133 17445 534.4 | ||
``` | ||
|
||
The global metric works well, but have some drawbacks: if we were to start | ||
training on a multitask datasets, we would not be able to distinguish the | ||
`global_hello` of the two datasets, and we could only compute the micro-average | ||
of the combination of the two. Below is an excerpt from a training log with | ||
the above agents: | ||
|
||
``` | ||
09:14:52 | time:112s total_exs:90180 epochs:0.41 | ||
clip ctpb ctps exps exs global_hello gnorm gpu_mem loss lr ltpb ltps ppl token_acc total_train_updates tpb tps ups | ||
all 1 9831 66874 841.9 8416 .01081 2.018 .3474 5.078 1 1746 11878 163.9 .2370 729 11577 78752 6.803 | ||
convai2 3434 .01081 5.288 197.9 .2120 | ||
dailydialog 4982 .01081 4.868 130 .2620 | ||
``` | ||
|
||
Notice how `global_hello` is the same in both, because the model is unable to | ||
distinguish between the two settings. In the next section we'll show how to fix | ||
this with local metrics. | ||
|
||
__On placement__: In the example above, we recorded the global metric inside | ||
the `observe` function. However, global metrics can be recorded from anywhere. | ||
|
||
### Local metrics | ||
|
||
Having observed the limitation of global metrics being unable to distinguish | ||
settings in multitasking, we would like to improve upon this. Let's add a local | ||
metric, which is recorded _per example_. By recording this metric per example, | ||
we can unambiguously identify which metrics came from which dataset, and report | ||
averages correctly. | ||
|
||
Local metrics have a limitation: they can only be computed inside the scope of | ||
`batch_act`. This includes common places like `compute_loss` or `generate`, | ||
where we often want to instrument specific behavior. | ||
|
||
Let's look at an example. We'll add a metric inside the `batchify` function, | ||
which is called from within `batch_act`, and is used to convert from a list of | ||
[Messages](messages) objects to a | ||
[Batch](torch_agent.html#parlai.core.torch_agent.Batch) object. It is where we do things like | ||
padding, etc. We'll do something slightly different than our previous runs. | ||
In this case, we'll count the number of _tokens_ which are the word "hello". | ||
|
||
```python | ||
from parlai.core.metrics import AverageMetric | ||
from parlai.core.loader import register_agent | ||
from parlai.agents.transformer.transformer import TransformerGeneratorAgent | ||
|
||
|
||
@register_agent('LocalHelloCounter') | ||
class LocalHelloCounterAgent(TransformerGeneratorAgent): | ||
def batchify(self, observations): | ||
batch = super().batchify(observations) | ||
if hasattr(batch, 'text_vec'): | ||
num_hello = ["hello" in o['text'] for o in observations] | ||
self.record_local_metric( | ||
'local_hello', | ||
# AverageMetric.many(seq1) is shorthand for | ||
# [AverageMetric(item) for item in seq) | ||
AverageMetric.many(num_hello), | ||
) | ||
return batch | ||
|
||
|
||
if __name__ == '__main__': | ||
from parlai.scripts.train_model import TrainModel | ||
|
||
TrainModel.main( | ||
task='dailydialog,convai2', | ||
model='LocalHelloCounter', | ||
dict_file='zoo:blender/blender_90M/model.dict', | ||
batchsize=32, | ||
) | ||
``` | ||
|
||
When we run this training script, we get one such output: | ||
``` | ||
09:49:00 | time:101s total_exs:56160 epochs:0.26 | ||
clip ctpb ctps exps exs gnorm gpu_mem local_hello loss lr ltpb ltps ppl token_acc total_train_updates tpb tps ups | ||
all 1 3676 63204 550.2 5504 2.146 .1512 .01423 4.623 1 436.2 7500 101.8 .2757 1755 4112 70704 17.2 | ||
convai2 3652 .02793 4.659 105.5 .2651 | ||
dailydialog 1852 .00054 4.587 98.17 .2863 | ||
``` | ||
|
||
Notice how the `local_hello` metric can now distinguish between hellos coming from | ||
convai2 and those coming from daily dialog? The average hides the fact that one | ||
dataset has many hellos, and the other does not. | ||
|
||
Local metrics are primarily worth the implementation when you care about the | ||
fidelity of _train time metrics_. During evaluation time, we evaluate each | ||
dataset individually, so we can ensure global metrics are not mixed up. | ||
|
||
__Under the hood__: Local metrics work by including a "metrics" field in the | ||
return message. This is a dictionary which maps field name to a metric value. | ||
When the teacher receives the response from the model, it utilizes the metrics | ||
field to update counters on its side. |