-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's room to discuss alternatives to the problem buckets flag. I don't have anything immediately in mind, but perhaps after the weekend.
if self.use_problem_buckets: | ||
dialog_has_problems = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a note, I'm finding the numerous portions that are gated by use_problem_buckets
throughout to be somewhat hard to follow. In this case, we're even creating variable to be used by other scopes only if the attribute is set.
It could be that the overall compile_results()
code is rather monolithic, so I have a hard time seeing where it could be broken down or have the self.use_problem_buckets
idea pulled out.
Not necessarily blocking, especially if you believe this to be the only type of branching that could occur in these analysis scripts, but worth discussing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that the gating of functionality with self.use_problem_buckets
adds complexity to an already very complex method, .compile_results()
: it's not ideal to do this, but in my mind I feel like this is the most straightforward way to achieve this without completely rewriting the method.
Soon I'll be having a larger discussion about how we want to structure all analysis code going forward, so my hope is that, in the medium-term, this code will be overhauled more thoroughly in order to make it less monolithic; thus, I see this PR as a stopgap solution to provide needed functionality. Happy to discuss if you think there is a better stopgap solution for this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have a brainstorming session planned for that overhaul, then I don't see a need to delay this step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only real concern here is addressed above, and is noted as a stop-gap solution to unblock some level of abstraction for this script.
Patch description
Generalize the analysis code for the
model_chat
crowdsourcing task, to allow it to analyze HITs of human+model chats in which we disabled either personas or buckets that annotate problems with the model's responses. This will allow this code to be more easily subclassed for analyzing other human+model chat tasks that don't use personas and/or annotation buckets.Minor changes:
._add_additional_per_turn_stats()
, to allow for subclasses to add additional stats to the output dataframe(Apologies for the very big PR! The bulk of the pertinent changes are in
parlai/crowdsourcing/tasks/model_chat/analysis/compile_results.py
, and most of the other changes modify sample inputs/output files for CI checks)Testing steps
pytest tests/crowdsourcing/tasks/model_chat/test_model_chat_analysis.py