RFC: Data Mutators #3434

stephenroller · 2021-02-02T21:43:19Z

stephenroller
Feb 2, 2021

Proposal for Data mutators

We would like a way to do generic data transformations on ParlAI teachers, without requiring extending and overriding code a specific teacher.

Examples:

Swapping speakers - swap the speakers in the conversation for who is text/label. e.g. in Wizard of Wikipedia, a useful teacher is the Apprentice teacher, or in TOD we may want to train on User utterances
Flatten: Concatenate and flatten the history
Data augmentations: Automatically perform substitutions (either at training time or test time) to expand training data

Existing solutions

Wrappers
-t wrapper is used in place. There are some questions about how accurate this, and if it's compatible with arbitrary teachers, as it relies on get()

Teacher overriding
Frequently we do this by subclassing a teacher and implementing. This depends quite a bit on what is the abstract class of the Teacher: ChunkTeacher, FixedDataTeacher or DataTeacher.

Requirements

Composable. e.g. --mutators shuffle_turns,flatten
We may want to adjust the number of expisodes/examples (Flatten as one example) on the fly
Mutations are ideally online
1. doesn't delay training
2. any mutations done involving randomness can get multiple chances at training time

Proposed UI

UI

Any time a teacher is used, the --mutator option is available. You can compose mutators.

Examples:

parlai train_model -t convai2 --mutator flatten
parlai eval_model -t multiwoz --mutator shuffle_turns
parlai token_stats -t babi --mutator shuffle_turns,flatten

Unanswered: what about multitasking?

Proposed API

Three separate abstract mutator classes are made available, with increasing complexity and power:

ExampleMutator: A 1:1 mapping of examples and episodes. Good for simple formatting changes. Episode_done will be hidden from the user.
EpisodeMutator: Maintains episode boundaries, but you can change the examples within arbitrarily. Episode_done will be hidden from the user
TeacherMutator: Arbitrary mutations. Roughly equivalent to overriding setup_data

class ExampleMutator(AbstractMutator):
     @abstractmethod
     def example_mutation(self, example: Message) -> Message:
          # Performs f(example) and returns that new example. episode_done is leaved unchanged
          pass

class EpisodeMutator(AbstractMutator):
    @abstractmethod
    def episode_mutation(self, episode: List[Message]) -> List[Message]:
        # Returns a new conversation. Returned list is a new full episode.
        # episode_done is stripped from all messages before being seen, and added back automatically to the returned values

class TeacherMutator(AbstractMutator):
    @abstractmethod
    def teacher_mutation(self, examples: Iterator[Tuple[Message, bool]]) -> Iterator[Tuple[Message, bool]]:
        # Arbitrary episode manipulation. The user simply reads from the iterator (similar to `setup_data`) and yields new things

stephenroller · 2021-02-02T23:51:39Z

stephenroller
Feb 2, 2021
Author

One comment from viewer: we may wish to clarify when the best time is to create a new teacher vs a new mutator.

0 replies

stephenroller · 2021-02-03T03:04:05Z

stephenroller
Feb 3, 2021
Author

Example implementation of a turn shuffler

@register_mutator("turn_shuffle")
class ShuffleMutator(EpisodeMutator):
    def __init__(self, opt: Opt):
        super().__init__(opt)
        self.rng = random.Random(42)

    def episode_mutation(self, episode: List[Message]) -> List[Message]:
        self.rng.shuffle(episode)
        return episode

Example of a within-turn word shuffler:

@register_mutator("word_shuffle")
class WordShuffleMutator(ExampleMutator):
    def example_mutation(self, example: Message) -> List[Message]:
        tokens = example['text'].split()  # switch to a standard tokenizer here?
        self.rng.shuffle(tokens)
        retval = example.copy()
        retval.force_set('text', ' '.join(tokens))
        return retval

Example of a flattener:

@register_mutator("flatten")
class FlattenMutator(TeacherMutator):
    def teacher_mutation(self, examples):
        context = []
        for message, is_new_episode in examples:
            if is_new_episode:
                context = []
            context.append(message['text'])
            m = message.copy()
            m.force_set('text', '\n'.join(context))
            yield m, True

0 replies

emilydinan · 2021-02-03T13:40:33Z

emilydinan
Feb 3, 2021

Advantage of mutator over simple teacher changes is not clear to me. Would also like some clarification around when one would be better than another (e.g. when would you want to override setup_data or whatever vs. adding a custom abstract mutation).

These days the most common "mutation" I perform to a teacher is adding control tokens -- would something like this be easily supported with this idea? Maybe a good idea to crowdsource other common ideas such that they can be explicitly supported.

2 replies

stephenroller Feb 3, 2021
Author

Re-usability is the main argument. For example, when I look at your MD Gender teachers, instead of writing a fresh teacher per-dataset, perhaps you could've just written one or two mutators (pronoun subs, etc).

Mutators are primarily useful when you have multiple datasets or dataset variants and you want to apply the same mutation to all of them.

emilydinan Feb 3, 2021

Unfortunately for MDGender, every single dataset was annotated and /or constructed differently, so that may not have been possible. For the Queens are Powerful Too paper that would have been possible though, and we have a wrapper like thing implemented.

stephenroller · 2021-02-03T13:59:36Z

stephenroller
Feb 3, 2021
Author

Another possible mutator might be randomly sampling the data

2 replies

ankitade Feb 3, 2021

Another useful mutator might be concatenating / stitching 2 of more dialogs

stephenroller Feb 4, 2021
Author

How do you think this one might work?

chinnadhurai · 2021-02-04T19:04:18Z

chinnadhurai
Feb 4, 2021

Appreciate the granularity of EpisodeMutator, ExampleMutator. Would be also good to think about ways to add nontrivial mutators like synonym mutators, paraphrase mutators, ASR error proxy mutators, external knowledge retrieval mutators.

0 replies

stephenroller · 2021-03-01T21:12:15Z

stephenroller
Mar 1, 2021
Author

Implemented in #3448

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Data Mutators #3434

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

RFC: Data Mutators #3434

stephenroller Feb 2, 2021

Proposal for Data mutators

Examples:

Existing solutions

Requirements

Proposed UI

Proposed API

Replies: 6 comments · 4 replies

stephenroller Feb 2, 2021 Author

stephenroller Feb 3, 2021 Author

emilydinan Feb 3, 2021

stephenroller Feb 3, 2021 Author

emilydinan Feb 3, 2021

stephenroller Feb 3, 2021 Author

ankitade Feb 3, 2021

stephenroller Feb 4, 2021 Author

chinnadhurai Feb 4, 2021

stephenroller Mar 1, 2021 Author

stephenroller
Feb 2, 2021

Replies: 6 comments 4 replies

stephenroller
Feb 2, 2021
Author

stephenroller
Feb 3, 2021
Author

emilydinan
Feb 3, 2021

stephenroller Feb 3, 2021
Author

stephenroller
Feb 3, 2021
Author

stephenroller Feb 4, 2021
Author

chinnadhurai
Feb 4, 2021

stephenroller
Mar 1, 2021
Author