CMU_DoG #3593

spencerp · 2021-04-14T23:34:59Z

Patch description
Adding CMU_DoG dataset. This is mostly a copypasta of code that wasn't yet checked in, and was originally written mostly by Moya and myself.

Testing steps

λ parlai dd -t cmu_dog
λ parlai dd -t cmu_dog --cmu-dog-split-type original
λ parlai dd -t cmu_dog --cmu-dog-split-type seen --datatype test
λ parlai dd -t cmu_dog --cmu-dog-split-type unseen --datatype test
λ parlai dd -t cmu_dog --cmu-dog-include-knowledge-keys movieName
λ parlai dd -t cmu_dog --cmu-dog-provide-movie-context False --mutators knowledge_only_when_updated,prepend_knowledge_to_message
λ parlai dd -t cmu_dog --cmu-dog-only-with-knowledge False --cmu-dog-rating 1 --cmu-dog-multi-msg-delimiter "__" --cmu-dog-fact-delimiter "/"

Unit tests

λ pytest -k test_zootasks
λ pytest --force-regen parlai/tasks/cmu_dog/test.py  # fails initially, as expected
λ pytest parlai/tasks/cmu_dog/test.py  # succeeds

moyapchen

Thanks for putting this up!

moyapchen · 2021-04-17T14:21:16Z

parlai/tasks/cmu_dog/agents.py

+        # For seen/unseen split, the full set of dialogs is split
+        # across train, valid, test seen, and test unseen
+        for split in ['train', 'valid', 'test']:
+            datafiles.append(_datafile(split, SplitType.SEEN))


[No-op, just a comment] As a sidenote,

list(train_seen + valid_seen + test_seen + test_unseen) = list(train_original_deduped + valid_original_deduped + test_original_deduped)

so we could do that... but the code is probably clearer this way.

(On that note, this does mean that we will be duplicate counting conversations when we do F1 on the original split, but it seems like we're deciding to be okay with that.)

list(train_seen + valid_seen + test_seen + test_unseen) = list(train_original_deduped + valid_original_deduped + test_original_deduped)

I'm leaning toward leaving these separate for code clarity. But it might be good to add this relationship as a unit test so that it's documented.

we will be duplicate counting conversations when we do F1 on the original split

When you say "original split" you mean the one that isn't deduplicated, right? I guess I'm kind of assuming that if someone is intentionally using the split with duplicate conversations, it's probably desirable that the conversations be duplicated in the metric calculations as well.

Maybe I should add a warning print if people use the original split type, and point them to your issue (festvox/datasets-CMU_DoG#2) so they know what they're getting into.

agree with what you say spencer - yes, if we're providing the original split, we should calculate metrics on the duplicates as well. the warning below looks good!

klshuster

thanks! see comments, though as this is merged i think the build comment is the only one necessary to address (others are mostly nits)

klshuster · 2021-04-19T15:12:38Z

parlai/tasks/cmu_dog/agents.py

+        # For seen/unseen split, the full set of dialogs is split
+        # across train, valid, test seen, and test unseen
+        for split in ['train', 'valid', 'test']:
+            datafiles.append(_datafile(split, SplitType.SEEN))


agree with what you say spencer - yes, if we're providing the original split, we should calculate metrics on the duplicates as well. the warning below looks good!

klshuster · 2021-04-19T15:14:30Z

parlai/tasks/cmu_dog/agents.py

+            "--cmu-dog-only-with-knowledge",
+            type=bool,
+            default=True,
+            help="Optionally train only the sides of the conversation that have access to knowledge.",


nit: as this defaults to True, and is presumably the desired arg, might be good to update the help comment here, like "Set False to optionally train both sides of conversation"

klshuster · 2021-04-19T15:16:10Z

parlai/tasks/cmu_dog/agents.py

+        cmu_dog.add_argument(
+            "--cmu-dog-include-knowledge-keys",
+            type=str,
+            default='cast,critical_response,director,genre,introduction,movieName,rating,year',


should we make movieName be the default here? or at least the recommended? as at least for our future experimentation we'll like to use that split.

alternatively, maybe we can provide a teacher, like MovieNameTeacher, which sets this arg for you (so you can do something like parlai dd -t cmu_dog:movie_name and it'll work appropriately)

klshuster · 2021-04-19T15:20:21Z

parlai/tasks/cmu_dog/build.py

+        'http://parl.ai/downloads/cmu_dog/cmu_dog.tar.gz',
+        'cmu_dog.tar.gz',
+        '30d2bac0dae6b4e4c0b94ba581fffaf1acb46838480f7ad6736ad03d9312ae9d',


messaged offline about this but we should be downloading the dataset from the source, and building the splits ourselves (in the build function below)

spencerp added 2 commits April 14, 2021 16:07

porting cmu_dog from internal to public

c067bbd

correct file name

66a3cf9

spencerp requested review from klshuster and moyapchen April 14, 2021 23:34

facebook-github-bot added the CLA Signed label Apr 14, 2021

spencerp added 3 commits April 15, 2021 10:54

correct rare f1 reference corpus for seen/unseen

f46a488

Merge branch 'master' into cmu-dog

81fbea2

add tests

f7ffff5

spencerp marked this pull request as ready for review April 15, 2021 18:18

spencerp added 2 commits April 16, 2021 13:36

move build()

8ce12f3

bump data version

9986e8d

moyapchen approved these changes Apr 17, 2021

View reviewed changes

original warning + word count unittest

da418d9

spencerp merged commit 7d26830 into master Apr 19, 2021

spencerp deleted the cmu-dog branch April 19, 2021 15:13

klshuster reviewed Apr 19, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CMU_DoG #3593

CMU_DoG #3593

spencerp commented Apr 14, 2021 •

edited

Loading

moyapchen left a comment

moyapchen Apr 17, 2021

spencerp Apr 19, 2021

klshuster Apr 19, 2021

klshuster left a comment

klshuster Apr 19, 2021

klshuster Apr 19, 2021

klshuster Apr 19, 2021

klshuster Apr 19, 2021

CMU_DoG #3593

CMU_DoG #3593

Conversation

spencerp commented Apr 14, 2021 • edited Loading

moyapchen left a comment

Choose a reason for hiding this comment

moyapchen Apr 17, 2021

Choose a reason for hiding this comment

spencerp Apr 19, 2021

Choose a reason for hiding this comment

klshuster Apr 19, 2021

Choose a reason for hiding this comment

klshuster left a comment

Choose a reason for hiding this comment

klshuster Apr 19, 2021

Choose a reason for hiding this comment

klshuster Apr 19, 2021

Choose a reason for hiding this comment

klshuster Apr 19, 2021

Choose a reason for hiding this comment

klshuster Apr 19, 2021

Choose a reason for hiding this comment

spencerp commented Apr 14, 2021 •

edited

Loading