Add prompts for DiaBLa dataset #759

rbawden · 2022-04-29T07:35:12Z

Added a range of prompts for different tasks:

sentence-level translation
analogy-based translation
contextual translation (with context in different languages, automatically vs. manually translated)
detection of errors in translations
identification of machine translated output compared to human-produced translations

into diabla_prompts

rbawden · 2022-05-09T22:10:02Z

I have added a variety of different tasks, but I am a bit unsure about whether newlines should be included in prompts or not and how this will be handled by the different models. I wouldn't want them to be removed and then the different lines be concatenated together. Does anybody have any advice?

…pletion tasks (simpler?)

stephenbach · 2022-05-17T01:53:55Z

Thanks! A few questions and observations:

Is it necessary to have separate prompts for en2fr, fr2en, and their combination? It seems like, if you want to track metrics separately, you could add this to the evaluation code? That would also save a lot of computation.
In the MT prompts (at least MT1), there are some examples where the languages are mixed, meaning that the source is a mixed between french and english, and then the sentence in one of the languages is repeated in the target. For example:
Is that deliberate?

stephenbach · 2022-05-17T14:52:43Z

Oh, and one other thought related to your question about newlines: good to include them, but IIRC some models will tokenize them as regular old whitespace, so for something like dialog, it will probably be helpful to have explicit indicators for each speaker.

rbawden · 2022-05-17T15:10:01Z

Thanks! A few questions and observations:

Is it necessary to have separate prompts for en2fr, fr2en, and their combination? It seems like, if you want to track metrics separately, you could add this to the evaluation code? That would also save a lot of computation.

I agree, and realised this when we were discussing the fact that it would be computationally expensive. I think I could include just the templates for both directions and then calculate the individual direction scores based on the outputs. I will remove the directional templates now.

In the MT prompts (at least MT1), there are some examples where the languages are mixed, meaning that the source is a mixed between french and english, and then the sentence in one of the languages is repeated in the target. For example:
Is that deliberate?

Yes, although I'm not sure it's the best way to do things. The idea is to translate the previous sentence and then ask the model to translate the current sentence. Sometimes the previous sentence is already in the target language, so I left it as it was, which results in an identical translation. I wanted to do this to test when including only the original utterances, but I also have a version where I include the same language each time (using the MT version of the context if in the other language is different from the current sentence), since this is the version of the contextual sentence seen by the current speaker.

Do you think this will pose problems for the model?

… to be done separately later. Also some slight modifs such as removing excess words and quotes

rbawden · 2022-05-17T15:35:22Z

I have updated the templates according to your suggestion about not including individual language directions!

stephenbach · 2022-05-17T16:09:32Z

Thanks! Just a few more things:

You could also take the "both directions" out of the remaining prompt names now, just so they're a little shorter.
Regarding the mixing of task formats, I don't know whether it will pose a problem for certain models, but I also am unclear what it adds. It seems confusing from a training/testing perspective to mix formats in the same prompt. Is there a way to adjust them so that they have a consistent format?
For all places where you're selecting an element, please use choice not random. So for example in classify errors use {% set label = range(0,5)|choice %} instead of {% set label = range(0,5)|random %}.
For the discriminate prompts, those should be combined into a single prompt where the order is chosen randomly (using choice). While it won't make a difference for evaluation, someone might want to train on them, and having the answer always be A or always be B won't work well.

rbawden · 2022-05-17T20:57:23Z

Thanks! Just a few more things:

You could also take the "both directions" out of the remaining prompt names now, just so they're a little shorter.

Ok, I have done this and renamed some of them to make them clearer

Regarding the mixing of task formats, I don't know whether it will pose a problem for certain models, but I also am unclear what it adds. It seems confusing from a training/testing perspective to mix formats in the same prompt. Is there a way to adjust them so that they have a consistent format?

I am not sure what you mean by mixing formats in the same prompt. Do you mean the fact that some tasks are MT and others are more classification? If so, I don't think this should pose much of a problem given that this is just an evaluation set and not meant for testing. Does that matter if there are different tasks that test different aspects of the dataset? Or do you mean the inclusion of more or less context?

For all places where you're selecting an element, please use choice not random. So for example in classify errors use {% set label = range(0,5)|choice %} instead of {% set label = range(0,5)|random %}.

Thank you! I have updated this.

For the discriminate prompts, those should be combined into a single prompt where the order is chosen randomly (using choice). While it won't make a difference for evaluation, someone might want to train on them, and having the answer always be A or always be B won't work well.

I have combined these as I think that would be ok, but the dataset was designed as a test set, so I wasn't expecting people to train on it. I created the two with the idea that it would be more comparable to always test both possibilities when comparing models, but there are probably enough examples to do this with a single prompt.

Thanks again!

stephenbach · 2022-05-18T15:26:39Z

Awesome, changes look great!

I am not sure what you mean by mixing formats in the same prompt. Do you mean the fact that some tasks are MT and others are more classification? If so, I don't think this should pose much of a problem given that this is just an evaluation set and not meant for testing. Does that matter if there are different tasks that test different aspects of the dataset? Or do you mean the inclusion of more or less context?

Whoops, sorry, I said this in an unclear way. I was referring back to this part:

Yes, although I'm not sure it's the best way to do things. The idea is to translate the previous sentence and then ask the model to translate the current sentence. Sometimes the previous sentence is already in the target language, so I left it as it was, which results in an identical translation. I wanted to do this to test when including only the original utterances, but I also have a version where I include the same language each time (using the MT version of the context if in the other language is different from the current sentence), since this is the version of the contextual sentence seen by the current speaker.

Rereading what you said, I think I misunderstood it. If I understand correctly now, you're saying that you want to provide an example of a translated sentence using the previous sentence, but depending on where the example is in the dialog, the previous sentence might already be in the target language. There's also a version where you use the translated version so that the language direction (e.g., en to fr) is consistent. Got it, that all makes sense.

Is it desired to give an example of translation though? So far in promptsource, all our prompts are zero-shot. It sounds like this makes it a one-shot problem?

rbawden · 2022-05-19T06:07:38Z

Awesome, changes look great!

I am not sure what you mean by mixing formats in the same prompt. Do you mean the fact that some tasks are MT and others are more classification? If so, I don't think this should pose much of a problem given that this is just an evaluation set and not meant for testing. Does that matter if there are different tasks that test different aspects of the dataset? Or do you mean the inclusion of more or less context?

Whoops, sorry, I said this in an unclear way. I was referring back to this part:

Yes, although I'm not sure it's the best way to do things. The idea is to translate the previous sentence and then ask the model to translate the current sentence. Sometimes the previous sentence is already in the target language, so I left it as it was, which results in an identical translation. I wanted to do this to test when including only the original utterances, but I also have a version where I include the same language each time (using the MT version of the context if in the other language is different from the current sentence), since this is the version of the contextual sentence seen by the current speaker.

Rereading what you said, I think I misunderstood it. If I understand correctly now, you're saying that you want to provide an example of a translated sentence using the previous sentence, but depending on where the example is in the dialog, the previous sentence might already be in the target language. There's also a version where you use the translated version so that the language direction (e.g., en to fr) is consistent. Got it, that all makes sense.

Is it desired to give an example of translation though? So far in promptsource, all our prompts are zero-shot. It sounds like this makes it a one-shot problem?

I agree that this makes it a little more few-shot in some of these cases - but making sure that the examples given necessarily come from the directly previous context (and not from another dialogue). However it is actually quite a realistic scenario: the participants were interacting by using MT systems, so at any given point the MT system has access to the original utterances and the machine translated versions of the context, with the aim of producing something like the reference for the given sentence. Each sentence is produced in the context of having seen the other participants' machine translated utterances, so I wanted to replicate this scenario too. I included the scenario of having the original context (when same language) and the reference context (when a different language) for comparison purposes. Do you think this sounds ok?

rbawden · 2022-05-20T11:54:09Z

Hi there! Is this ok to merge or do I need to do some more updates?

stephenbach

LGTM! @awebson confirmed that the MT tasks with the various contexts are ok to merge.

stephenbach · 2022-05-22T16:22:49Z

Thanks @rbawden for your work with this dataset! Merging now!

Rachel Bawden and others added 9 commits April 29, 2022 01:16

added first templates for DiaBLa dataset rbawden/DiaBLa

da5e046

declared previous_ref at beginning of templates

77ba6e9

declared more variables at beginning of templates

fa96aa0

declared more variables at beginning of templates

3a19cfb

declared more variables at beginning of templates

44c7551

declared more variables at beginning of templates

9379706

corrected ref for mt in one template

f5fc2a9

corrected ref for mt in one template

ec3d5e8

Merge branch 'eval-hackathon' into diabla_prompts

64787de

rbawden changed the title ~~Add prompte for DiaBLa dataset~~ Add prompts for DiaBLa dataset Apr 29, 2022

Rachel Bawden added 5 commits April 29, 2022 16:47

moved condition to just target side rather than around entire prompt

ffee147

Merge branch 'diabla_prompts' of https://github.com/rbawden/promptsource

3434d9d

into diabla_prompts

allow template even when no context (beginning of dialogue)

7659b3d

corrected templates that use past history

88127d4

corrected multiple choice answer field

bf51583

Rachel Bawden added 3 commits May 11, 2022 16:57

updated templates - simplified some targets and added translation com…

5aa9190

…pletion tasks (simpler?)

corrected duplicate name

2728caf

updated duplicate definition

d358fe1

rbawden mentioned this pull request May 11, 2022

Added DiaBLa task bigscience-workshop/lm-evaluation-harness#57

Merged

Rachel Bawden added 3 commits May 12, 2022 09:57

corrected error of only two pipes in answer choices

da98d59

corrected -2 index to -1 - duplicate defintiions

54abd8a

Merge branch 'eval-hackathon' into diabla_prompts

833801e

stephenbach self-assigned this May 12, 2022

rbawden and others added 4 commits May 13, 2022 18:18

Merge branch 'eval-hackathon' into diabla_prompts

54b8082

merged with eval-hackathon updates

f240020

merge

de6cb66

corrected discriminate mt ref

26419f9

removed directional templates and only keep both directions (analysis…

013bb63

… to be done separately later. Also some slight modifs such as removing excess words and quotes

simplified names, changed random to choice

7b73d78

Merge branch 'eval-hackathon' into diabla_prompts

c803500

stephenbach approved these changes May 22, 2022

View reviewed changes

stephenbach merged commit fb4c27d into bigscience-workshop:eval-hackathon May 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prompts for DiaBLa dataset #759

Add prompts for DiaBLa dataset #759

rbawden commented Apr 29, 2022

rbawden commented May 9, 2022

stephenbach commented May 17, 2022

stephenbach commented May 17, 2022

rbawden commented May 17, 2022

rbawden commented May 17, 2022

stephenbach commented May 17, 2022

rbawden commented May 17, 2022

stephenbach commented May 18, 2022

rbawden commented May 19, 2022

rbawden commented May 20, 2022

stephenbach left a comment

stephenbach commented May 22, 2022

Add prompts for DiaBLa dataset #759

Add prompts for DiaBLa dataset #759

Conversation

rbawden commented Apr 29, 2022

rbawden commented May 9, 2022

stephenbach commented May 17, 2022

stephenbach commented May 17, 2022

rbawden commented May 17, 2022

rbawden commented May 17, 2022

stephenbach commented May 17, 2022

rbawden commented May 17, 2022

stephenbach commented May 18, 2022

rbawden commented May 19, 2022

rbawden commented May 20, 2022

stephenbach left a comment

Choose a reason for hiding this comment

stephenbach commented May 22, 2022