-
Notifications
You must be signed in to change notification settings - Fork 2.1k
CMU_DoG #3593
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this up!
# For seen/unseen split, the full set of dialogs is split | ||
# across train, valid, test seen, and test unseen | ||
for split in ['train', 'valid', 'test']: | ||
datafiles.append(_datafile(split, SplitType.SEEN)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[No-op, just a comment] As a sidenote,
list(train_seen + valid_seen + test_seen + test_unseen) = list(train_original_deduped + valid_original_deduped + test_original_deduped)
so we could do that... but the code is probably clearer this way.
(On that note, this does mean that we will be duplicate counting conversations when we do F1 on the original split, but it seems like we're deciding to be okay with that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list(train_seen + valid_seen + test_seen + test_unseen) = list(train_original_deduped + valid_original_deduped + test_original_deduped)
I'm leaning toward leaving these separate for code clarity. But it might be good to add this relationship as a unit test so that it's documented.
we will be duplicate counting conversations when we do F1 on the original split
When you say "original split" you mean the one that isn't deduplicated, right? I guess I'm kind of assuming that if someone is intentionally using the split with duplicate conversations, it's probably desirable that the conversations be duplicated in the metric calculations as well.
Maybe I should add a warning print if people use the original
split type, and point them to your issue (festvox/datasets-CMU_DoG#2) so they know what they're getting into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with what you say spencer - yes, if we're providing the original split, we should calculate metrics on the duplicates as well. the warning below looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! see comments, though as this is merged i think the build comment is the only one necessary to address (others are mostly nits)
# For seen/unseen split, the full set of dialogs is split | ||
# across train, valid, test seen, and test unseen | ||
for split in ['train', 'valid', 'test']: | ||
datafiles.append(_datafile(split, SplitType.SEEN)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with what you say spencer - yes, if we're providing the original split, we should calculate metrics on the duplicates as well. the warning below looks good!
"--cmu-dog-only-with-knowledge", | ||
type=bool, | ||
default=True, | ||
help="Optionally train only the sides of the conversation that have access to knowledge.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: as this defaults to True, and is presumably the desired arg, might be good to update the help comment here, like "Set False to optionally train both sides of conversation"
cmu_dog.add_argument( | ||
"--cmu-dog-include-knowledge-keys", | ||
type=str, | ||
default='cast,critical_response,director,genre,introduction,movieName,rating,year', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make movieName
be the default here? or at least the recommended
? as at least for our future experimentation we'll like to use that split.
alternatively, maybe we can provide a teacher, like MovieNameTeacher
, which sets this arg for you (so you can do something like parlai dd -t cmu_dog:movie_name
and it'll work appropriately)
'http://parl.ai/downloads/cmu_dog/cmu_dog.tar.gz', | ||
'cmu_dog.tar.gz', | ||
'30d2bac0dae6b4e4c0b94ba581fffaf1acb46838480f7ad6736ad03d9312ae9d', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
messaged offline about this but we should be downloading the dataset from the source, and building the splits ourselves (in the build
function below)
Patch description
Adding CMU_DoG dataset. This is mostly a copypasta of code that wasn't yet checked in, and was originally written mostly by Moya and myself.
Testing steps
Unit tests