-
Notifications
You must be signed in to change notification settings - Fork 2.1k
refactor taskmaster2 to use tod metrics #3147
Conversation
SlotF1Metric + SlotF1Metric test imported from an internal implementation. Will go through and refactor + test existing tasks to use this next.
Not quiet sure if returning "nans" for F1 metrics is the best way of going about things, but probably okay as a start. Also assuming the Metrics base class will handle addition correctly. Added the test case to make sure there weren't any glaring syntax errors with how I implemented SlotMetrics Nominally should probably add more test cases to validate slot-based domain metrics are working correctly, but that seems lower risk/lower pri and will cover it anyway when I update existing tasks.
`docformatter -i --pre-summary-newline --wrap-descriptions 88 --wrap-summaries 88 --make-summary-multi-line` on the relevant file
…shing so all my mispellings don't go into the commit log...
Noticed while testing on taskmaster that jga was higher than both slot_r and slot_p... put in some prints and turns out it was counting the {} == {} case. This doesn't really make sense to do in general, though there are some scnearios where not having any slots is the correct response... so put in a flag to compensate.
Your PR contains a change to a task. Please paste the results of the following command into a comment: python tests/datatests/test_new_tasks.py |
Tested by running ``` parlai eval -t taskmaster2 -m <model> -mf <recently trained model on food-ordering> --domains food-ordering movies ``` and validating that the output made sense.
a929bcf
to
3edf04c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, very clean. Would approve except for add_metrics
guess=model_response['text'], | ||
labels=labels, | ||
teacher_domains=[domain], | ||
delex_guess=delex_text, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good. Heads up that delex vs lexicalized May need to be an option in the future, but I'm good with this as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, it's implemented as an Optional inside of NlgMetrics
right now, so that should already be fine.
bleu_metric = BleuMetric.compute(delex_text, [delex_label]) | ||
self.metrics.add('delex_bleu', bleu_metric) | ||
self.metrics.add(f'{domain}_delex_bleu', bleu_metric) | ||
self.metrics.add_metrics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I like this. I just think there's no implementation of add_metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in the "metrics_collections" branch ("slot_teachers" is rebased onto that, while this one is rebased on "slot_teachers").
This may be uh. Me trying to recreate diff stacks from Mecurial when that's not how things work in git land. Maybe the lesson I should be learning here is to not try to mirror diff stacks by using branches on top of branches... 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context, this is the branch that includes the add_metrics
: #3145
(Given Github's UI... next time, perhaps I'll just do all of these changes in one pull request, since it's not super nice on stacked branches.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, stacked diffs aren't a thing in github. Sometimes I do the PRs on top of PRs but generally not.
There's a tool called ghstack that the pytorch people use to emulate fairly successfully, but I don't use it myself.
fea599b
to
6b9a0a4
Compare
* Add test for interactive_web * Spinlock * Hm. * Lint.
* Allow missing init opt opts * Add part of unit test * Work on unit test * Test fixes * Fix second test * Fix test * Check obsolete arg does not exist
…ltiple metrics be added to it See #3138 for context and use
…addressing comments = adding comments + fixing grammar, mostly).
…3145) * Add notion of metrics collections, which can have other Metrics of multiple metrics be added to it See #3138 for context and use * right, having different arguments for the same function aren't a thing in python... (alas, that's what I get for mostly coding in C++ for the past few years. :P) * fixed a bug while integrating into taskmaster2 * address comments (get rid of separate class, add func to Metrics directly) * actually do the things the last comment
…ltiple metrics be added to it See #3138 for context and use
…3145) * Add notion of metrics collections, which can have other Metrics of multiple metrics be added to it See #3138 for context and use * right, having different arguments for the same function aren't a thing in python... (alas, that's what I get for mostly coding in C++ for the past few years. :P) * fixed a bug while integrating into taskmaster2 * address comments (get rid of separate class, add func to Metrics directly) * actually do the things the last comment
…ltiple metrics be added to it See #3138 for context and use
Abandoning this; will do internally and sync once it's all nice. |
Tested by running
and validating that the output made sense.