Skip to content

Add prompts for DiaBLa dataset #759

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
May 22, 2022

Conversation

rbawden
Copy link
Contributor

@rbawden rbawden commented Apr 29, 2022

Added a range of prompts for different tasks:

  • sentence-level translation
  • analogy-based translation
  • contextual translation (with context in different languages, automatically vs. manually translated)
  • detection of errors in translations
  • identification of machine translated output compared to human-produced translations

@rbawden rbawden changed the title Add prompte for DiaBLa dataset Add prompts for DiaBLa dataset Apr 29, 2022
@rbawden
Copy link
Contributor Author

rbawden commented May 9, 2022

I have added a variety of different tasks, but I am a bit unsure about whether newlines should be included in prompts or not and how this will be handled by the different models. I wouldn't want them to be removed and then the different lines be concatenated together. Does anybody have any advice?

@stephenbach stephenbach self-assigned this May 12, 2022
@stephenbach
Copy link
Member

Thanks! A few questions and observations:

  • Is it necessary to have separate prompts for en2fr, fr2en, and their combination? It seems like, if you want to track metrics separately, you could add this to the evaluation code? That would also save a lot of computation.
  • In the MT prompts (at least MT1), there are some examples where the languages are mixed, meaning that the source is a mixed between french and english, and then the sentence in one of the languages is repeated in the target. For example: Screen Shot 2022-05-16 at 9 52 45 PM
    Is that deliberate?

@stephenbach
Copy link
Member

Oh, and one other thought related to your question about newlines: good to include them, but IIRC some models will tokenize them as regular old whitespace, so for something like dialog, it will probably be helpful to have explicit indicators for each speaker.

@rbawden
Copy link
Contributor Author

rbawden commented May 17, 2022

Thanks! A few questions and observations:

  • Is it necessary to have separate prompts for en2fr, fr2en, and their combination? It seems like, if you want to track metrics separately, you could add this to the evaluation code? That would also save a lot of computation.

I agree, and realised this when we were discussing the fact that it would be computationally expensive. I think I could include just the templates for both directions and then calculate the individual direction scores based on the outputs. I will remove the directional templates now.

  • In the MT prompts (at least MT1), there are some examples where the languages are mixed, meaning that the source is a mixed between french and english, and then the sentence in one of the languages is repeated in the target. For example: Screen Shot 2022-05-16 at 9 52 45 PM
    Is that deliberate?

Yes, although I'm not sure it's the best way to do things. The idea is to translate the previous sentence and then ask the model to translate the current sentence. Sometimes the previous sentence is already in the target language, so I left it as it was, which results in an identical translation. I wanted to do this to test when including only the original utterances, but I also have a version where I include the same language each time (using the MT version of the context if in the other language is different from the current sentence), since this is the version of the contextual sentence seen by the current speaker.

Do you think this will pose problems for the model?

… to be done separately later. Also some slight modifs such as removing excess words and quotes
@rbawden
Copy link
Contributor Author

rbawden commented May 17, 2022

I have updated the templates according to your suggestion about not including individual language directions!

@stephenbach
Copy link
Member

Thanks! Just a few more things:

  • You could also take the "both directions" out of the remaining prompt names now, just so they're a little shorter.
  • Regarding the mixing of task formats, I don't know whether it will pose a problem for certain models, but I also am unclear what it adds. It seems confusing from a training/testing perspective to mix formats in the same prompt. Is there a way to adjust them so that they have a consistent format?
  • For all places where you're selecting an element, please use choice not random. So for example in classify errors use {% set label = range(0,5)|choice %} instead of {% set label = range(0,5)|random %}.
  • For the discriminate prompts, those should be combined into a single prompt where the order is chosen randomly (using choice). While it won't make a difference for evaluation, someone might want to train on them, and having the answer always be A or always be B won't work well.

@rbawden
Copy link
Contributor Author

rbawden commented May 17, 2022

Thanks! Just a few more things:

  • You could also take the "both directions" out of the remaining prompt names now, just so they're a little shorter.

Ok, I have done this and renamed some of them to make them clearer

  • Regarding the mixing of task formats, I don't know whether it will pose a problem for certain models, but I also am unclear what it adds. It seems confusing from a training/testing perspective to mix formats in the same prompt. Is there a way to adjust them so that they have a consistent format?

I am not sure what you mean by mixing formats in the same prompt. Do you mean the fact that some tasks are MT and others are more classification? If so, I don't think this should pose much of a problem given that this is just an evaluation set and not meant for testing. Does that matter if there are different tasks that test different aspects of the dataset? Or do you mean the inclusion of more or less context?

  • For all places where you're selecting an element, please use choice not random. So for example in classify errors use {% set label = range(0,5)|choice %} instead of {% set label = range(0,5)|random %}.

Thank you! I have updated this.

  • For the discriminate prompts, those should be combined into a single prompt where the order is chosen randomly (using choice). While it won't make a difference for evaluation, someone might want to train on them, and having the answer always be A or always be B won't work well.

I have combined these as I think that would be ok, but the dataset was designed as a test set, so I wasn't expecting people to train on it. I created the two with the idea that it would be more comparable to always test both possibilities when comparing models, but there are probably enough examples to do this with a single prompt.

Thanks again!

@stephenbach
Copy link
Member

Awesome, changes look great!

I am not sure what you mean by mixing formats in the same prompt. Do you mean the fact that some tasks are MT and others are more classification? If so, I don't think this should pose much of a problem given that this is just an evaluation set and not meant for testing. Does that matter if there are different tasks that test different aspects of the dataset? Or do you mean the inclusion of more or less context?

Whoops, sorry, I said this in an unclear way. I was referring back to this part:

Yes, although I'm not sure it's the best way to do things. The idea is to translate the previous sentence and then ask the model to translate the current sentence. Sometimes the previous sentence is already in the target language, so I left it as it was, which results in an identical translation. I wanted to do this to test when including only the original utterances, but I also have a version where I include the same language each time (using the MT version of the context if in the other language is different from the current sentence), since this is the version of the contextual sentence seen by the current speaker.

Rereading what you said, I think I misunderstood it. If I understand correctly now, you're saying that you want to provide an example of a translated sentence using the previous sentence, but depending on where the example is in the dialog, the previous sentence might already be in the target language. There's also a version where you use the translated version so that the language direction (e.g., en to fr) is consistent. Got it, that all makes sense.

Is it desired to give an example of translation though? So far in promptsource, all our prompts are zero-shot. It sounds like this makes it a one-shot problem?

@rbawden
Copy link
Contributor Author

rbawden commented May 19, 2022

Awesome, changes look great!

I am not sure what you mean by mixing formats in the same prompt. Do you mean the fact that some tasks are MT and others are more classification? If so, I don't think this should pose much of a problem given that this is just an evaluation set and not meant for testing. Does that matter if there are different tasks that test different aspects of the dataset? Or do you mean the inclusion of more or less context?

Whoops, sorry, I said this in an unclear way. I was referring back to this part:

Yes, although I'm not sure it's the best way to do things. The idea is to translate the previous sentence and then ask the model to translate the current sentence. Sometimes the previous sentence is already in the target language, so I left it as it was, which results in an identical translation. I wanted to do this to test when including only the original utterances, but I also have a version where I include the same language each time (using the MT version of the context if in the other language is different from the current sentence), since this is the version of the contextual sentence seen by the current speaker.

Rereading what you said, I think I misunderstood it. If I understand correctly now, you're saying that you want to provide an example of a translated sentence using the previous sentence, but depending on where the example is in the dialog, the previous sentence might already be in the target language. There's also a version where you use the translated version so that the language direction (e.g., en to fr) is consistent. Got it, that all makes sense.

Is it desired to give an example of translation though? So far in promptsource, all our prompts are zero-shot. It sounds like this makes it a one-shot problem?

I agree that this makes it a little more few-shot in some of these cases - but making sure that the examples given necessarily come from the directly previous context (and not from another dialogue). However it is actually quite a realistic scenario: the participants were interacting by using MT systems, so at any given point the MT system has access to the original utterances and the machine translated versions of the context, with the aim of producing something like the reference for the given sentence. Each sentence is produced in the context of having seen the other participants' machine translated utterances, so I wanted to replicate this scenario too. I included the scenario of having the original context (when same language) and the reference context (when a different language) for comparison purposes. Do you think this sounds ok?

@rbawden
Copy link
Contributor Author

rbawden commented May 20, 2022

Hi there! Is this ok to merge or do I need to do some more updates?

Copy link
Member

@stephenbach stephenbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @awebson confirmed that the MT tasks with the various contexts are ok to merge.

@stephenbach
Copy link
Member

Thanks @rbawden for your work with this dataset! Merging now!

@stephenbach stephenbach merged commit fb4c27d into bigscience-workshop:eval-hackathon May 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants