Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

CMU_DoG #3593

Merged
merged 8 commits into from
Apr 19, 2021
Merged

CMU_DoG #3593

merged 8 commits into from
Apr 19, 2021

Conversation

spencerp
Copy link
Contributor

@spencerp spencerp commented Apr 14, 2021

Patch description
Adding CMU_DoG dataset. This is mostly a copypasta of code that wasn't yet checked in, and was originally written mostly by Moya and myself.

Testing steps

λ parlai dd -t cmu_dog
λ parlai dd -t cmu_dog --cmu-dog-split-type original
λ parlai dd -t cmu_dog --cmu-dog-split-type seen --datatype test
λ parlai dd -t cmu_dog --cmu-dog-split-type unseen --datatype test
λ parlai dd -t cmu_dog --cmu-dog-include-knowledge-keys movieName
λ parlai dd -t cmu_dog --cmu-dog-provide-movie-context False --mutators knowledge_only_when_updated,prepend_knowledge_to_message
λ parlai dd -t cmu_dog --cmu-dog-only-with-knowledge False --cmu-dog-rating 1 --cmu-dog-multi-msg-delimiter "__" --cmu-dog-fact-delimiter "/"

Unit tests

λ pytest -k test_zootasks
λ pytest --force-regen parlai/tasks/cmu_dog/test.py  # fails initially, as expected
λ pytest parlai/tasks/cmu_dog/test.py  # succeeds

@spencerp spencerp marked this pull request as ready for review April 15, 2021 18:18
Copy link
Contributor

@moyapchen moyapchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this up!

# For seen/unseen split, the full set of dialogs is split
# across train, valid, test seen, and test unseen
for split in ['train', 'valid', 'test']:
datafiles.append(_datafile(split, SplitType.SEEN))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[No-op, just a comment] As a sidenote,

list(train_seen + valid_seen + test_seen + test_unseen) = list(train_original_deduped + valid_original_deduped + test_original_deduped)

so we could do that... but the code is probably clearer this way.

(On that note, this does mean that we will be duplicate counting conversations when we do F1 on the original split, but it seems like we're deciding to be okay with that.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list(train_seen + valid_seen + test_seen + test_unseen) = list(train_original_deduped + valid_original_deduped + test_original_deduped)

I'm leaning toward leaving these separate for code clarity. But it might be good to add this relationship as a unit test so that it's documented.

we will be duplicate counting conversations when we do F1 on the original split

When you say "original split" you mean the one that isn't deduplicated, right? I guess I'm kind of assuming that if someone is intentionally using the split with duplicate conversations, it's probably desirable that the conversations be duplicated in the metric calculations as well.

Maybe I should add a warning print if people use the original split type, and point them to your issue (festvox/datasets-CMU_DoG#2) so they know what they're getting into.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with what you say spencer - yes, if we're providing the original split, we should calculate metrics on the duplicates as well. the warning below looks good!

@spencerp spencerp merged commit 7d26830 into master Apr 19, 2021
@spencerp spencerp deleted the cmu-dog branch April 19, 2021 15:13
Copy link
Contributor

@klshuster klshuster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! see comments, though as this is merged i think the build comment is the only one necessary to address (others are mostly nits)

# For seen/unseen split, the full set of dialogs is split
# across train, valid, test seen, and test unseen
for split in ['train', 'valid', 'test']:
datafiles.append(_datafile(split, SplitType.SEEN))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with what you say spencer - yes, if we're providing the original split, we should calculate metrics on the duplicates as well. the warning below looks good!

"--cmu-dog-only-with-knowledge",
type=bool,
default=True,
help="Optionally train only the sides of the conversation that have access to knowledge.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as this defaults to True, and is presumably the desired arg, might be good to update the help comment here, like "Set False to optionally train both sides of conversation"

cmu_dog.add_argument(
"--cmu-dog-include-knowledge-keys",
type=str,
default='cast,critical_response,director,genre,introduction,movieName,rating,year',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make movieName be the default here? or at least the recommended? as at least for our future experimentation we'll like to use that split.

alternatively, maybe we can provide a teacher, like MovieNameTeacher, which sets this arg for you (so you can do something like parlai dd -t cmu_dog:movie_name and it'll work appropriately)

Comment on lines +12 to +14
'http://parl.ai/downloads/cmu_dog/cmu_dog.tar.gz',
'cmu_dog.tar.gz',
'30d2bac0dae6b4e4c0b94ba581fffaf1acb46838480f7ad6736ad03d9312ae9d',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

messaged offline about this but we should be downloading the dataset from the source, and building the splits ourselves (in the build function below)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants