update dataset entry to support list of list type (#2879)

#2827 In the [PR to introduce RM for the dataset entry class](#2867) I forgot that if we have RM, we'll have multiple answers per question so `[Q1, (A1, A12)]` but I just introduced questions and answers as `list` types and therefore we could not connect a question to an answer accordingly, e.g. `questions=[Q1, Q2]` and `answers=[A11, A12, A21, A22]` there was no way to figure out that `A11, A12` belong to `Q1` and `A21, A22` to `Q2`. So I introduced a `list[list[str]` type for the answers, so that we can connect question and answers by indices: `questions=[Q1, Q2]` and `answers=[[A11, A12], [A21, A22]`, so `answers[0]` belong to `questions[0]`. Note that this is backwards compatible since `answers` is a union type of `list[str] | list[list[str]]`. Also added tests for this Ran `python check_dataset_appearances.py -d webgpt --cache_dir .cache --mode rm` and found one entry with an empty question: ```python DatasetEntry(questions=[''], answers=[['Lebensraum is a German geopolitical concept that means "living space." The term was originally used to support colonialism, and was later adapted by Nazi leader Adolf Hitler to support his quest for German expansion to the east . German geographer and ethnographer Friedrich Ratzel first published an essay called "Der Lebensraum" ("The Living Space") in 1901, in which he posited that all people, animals, and plants need to expand their living space in order to survive . According to Ratzel, species that successfully adapted to one location would spread naturally to others . Hitler believed that Germany required Lebensraum in order to survive, and this conviction that this living space could be gained only in the east and, specifically, from Russia, shaped his policy after his take-over of power in Germany in 1933 . The Nazi Generalplan Ost policy (\'Master Plan for the East\') was based on the tenets of Lebensraum . It stipulated that Germany required a Lebensraum necessary for its survival and that most of the indigenous populations of Central and Eastern Europe would have to be removed permanently (either through mass deportation to Siberia, extermination, or enslavement) .', 'There are several ways to unblock blocked websites. One way is to use a good web-based proxy server . Another way is to type in the URL of the blocked site you want to access in the address bar, and then press Go or Enter . The web content will be sent to the proxy server where it can then be viewed from your device . This may make browsing a bit slower, but you should still be able to access any of your favorite websites . Another way to unblock blocked websites is to use a VPN (Virtual Private Network) . A VPN can be used to access region-restricted websites, shield your web browsing activities on public WiFi networks, and more .']], context=None, lang=None, length=None, quality=None, humor=None, creativity=None ) ``` So this was the result: ```bash 'Found the following occurances in TRAIN webgpt:' {re.compile('^[\\s\\n]*$'): ['']} ```
LAION-AI · Apr 27, 2023 · 3e06b48 · 3e06b48
1 parent 618ad47
commit 3e06b48
Show file tree

Hide file tree

Showing 6 changed files with 143 additions and 47 deletions.
diff --git a/model/model_training/README.md b/model/model_training/README.md
@@ -212,21 +212,27 @@ deepspeed trainer_sft.py --configs defaults your-model-name --deepspeed
 Here is an uncomplete overview of datasets for sft:
 
 <!-- prettier-ignore -->
-dataset_name        | train_counts | eval_counts | total_counts
+dataset_name                    | train_counts | eval_counts | total_counts
 ----------------------------------------------------------------
 
 <!-- prettier-ignore -->
-webgpt              |     15662    |     3916    |     19578
-squad_v2            |    130319    |    11873    |    142192
-adversarial_qa      |     30000    |     3000    |     33000
-trivia_qa_nocontext |    138384    |    17944    |    156328
-xsum                |    204045    |    11332    |    215377
-cnn_dailymail       |    287113    |    13368    |    300481
-multi_news          |     44972    |     5622    |     50594
-scitldr             |      1992    |      619    |      2611
-joke                |       301    |       76    |       377
-gsm8k               |      7473    |     1319    |      8792
-dive_mt             |      6192    |     1548    |      7740
+joke                            |       301    |      76     |       377
+webgpt                          |     14251    |    3563     |     17814
+gpt4all                         |    313552    |   78388     |    391940
+alpaca                          |     41361    |   10346     |     51707
+code_alpaca                     |     16017    |    4004     |     20021
+vicuna                          |     46939    |   11735     |     58674
+minimath                        |      2304    |     576     |      2880
+humaneval_mbpp_codegen_qa       |       472    |     119     |       591
+humaneval_mbpp_testgen_qa       |       472    |     119     |       591
+grade_school_math_instructions  |      7033    |    1759     |      8792
+recipes                         |      3797    |     950     |      4747
+cmu_wiki_qa                     |      1288    |     322     |      1610
+oa_wiki_qa_bart_10000row        |      8000    |    2000     |     10000
+prosocial_dialogue              |    157160    |   26983     |    184143
+explain_prosocial               |    360708    |   61248     |    421956
+soda                            |    924102    |  231026     |   1155128
+oa_leet10k                      |     18728    |    4683     |     23411
 
 This list can be generated with the following command, but beware that this
 downloads all available datasets (>100GB):

diff --git a/model/model_training/check_dataset_counts.py b/model/model_training/check_dataset_counts.py
@@ -1,10 +1,13 @@
 import argparse
+from collections import Counter
 from enum import Enum
 from pathlib import Path
 from typing import Any
 
 import pandas as pd
 import yaml
+from langdetect import DetectorFactory, detect
+from model_training.custom_datasets.formatting import DatasetEntry
 from model_training.utils.utils import _strtobool, get_dataset
 
 
@@ -54,6 +57,7 @@ def argument_parsing(notebook=False, notebook_args=None):
     )
     parser.add_argument("--mode", dest="mode", type=Mode, choices=list(Mode))
     parser.add_argument("--output_path", dest="output_path", default="dataset_counts.csv")
+    parser.add_argument("--detect_language", default=False, action="store_true")
 
     if notebook:
         args, remaining = parser.parse_known_args(notebook_args)
@@ -93,6 +97,7 @@ def argument_parsing(notebook=False, notebook_args=None):
     conf["output_path"] = args.output_path
     conf["datasets_extra"] = []
     conf["datasets"] = datasets_list
+    conf["detect_language"] = args.detect_language
     # Override config from command-line
     parser = argparse.ArgumentParser()
     for key, value in conf.items():
@@ -111,9 +116,37 @@ def argument_parsing(notebook=False, notebook_args=None):
 if __name__ == "__main__":
     args = argument_parsing()
 
-    train, evals = get_dataset(args, mode=args.mode)
+    train, evals = get_dataset(args, mode=args.mode.value)
     overview_df = pd.DataFrame(columns=["dataset_name", "train_counts", "eval_counts", "total_counts"])
-    for idx, dataset_name in enumerate(args.datasets):
+    language_df = pd.DataFrame()
+    if args.detect_language:
+        DetectorFactory.seed = 0
+    for idx, (dataset_name, dataset) in enumerate(evals.items()):
+        train_lang = Counter()
+        if args.detect_language:
+            length = len(dataset)
+            for idx1, row in enumerate(dataset):
+                if idx1 % 1000 == 0:
+                    print(f"{idx1} of {length} of ds {dataset_name}.")
+                try:
+                    if isinstance(row, (list, tuple)):
+                        train_lang += Counter([detect(k) for k in row])
+                    elif isinstance(row, DatasetEntry):
+                        train_lang += Counter([detect(k) for k in row.questions if k])
+                        if isinstance(row.answers[0], list):
+                            for answers in row.answers:
+                                train_lang += Counter([detect(k) for k in answers if k])
+                        else:
+                            train_lang += Counter([detect(k) for k in row.answers if k])
+                    else:
+                        raise ValueError(
+                            f"Did not expect the type {type(row)}. Should be either list, tuple or DatasetEntry."
+                        )
+                except Exception as e:
+                    print(e)
+        train_lang = dict(train_lang)
+        train_lang["dataset_name"] = dataset_name
+        language_df = pd.concat([language_df, pd.DataFrame([train_lang])])
         eval_count = len(evals.get(dataset_name, []))
         overview_df.loc[idx] = [
             dataset_name,
@@ -122,4 +155,10 @@ def argument_parsing(notebook=False, notebook_args=None):
             len(train.datasets[idx]) + eval_count,
         ]
     print(overview_df)
+    print(language_df)
     overview_df.to_csv(args.output_path, index=False)
+    language_df.to_csv("language_counts.csv", index=False)
+
+# python check_dataset_counts.py --datasets joke webgpt gpt4all alpaca code_alpaca vicuna minimath humaneval_mbpp_codegen_qa humaneval_mbpp_testgen_qa grade_school_math_instructions recipes cmu_wiki_qa oa_wiki_qa_bart_10000row prosocial_dialogue explain_prosocial soda oa_leet10k --mode sft
+# python check_dataset_counts.py --datasets joke webgpt alpaca code_alpaca vicuna minimath humaneval_mbpp_codegen_qa humaneval_mbpp_testgen_qa grade_school_math_instructions recipes cmu_wiki_qa oa_wiki_qa_bart_10000row prosocial_dialogue oa_leet10k --mode sft
+# python check_dataset_counts.py --datasets joke webgpt --mode sft
diff --git a/model/model_training/custom_datasets/formatting.py b/model/model_training/custom_datasets/formatting.py
@@ -1,11 +1,13 @@
 from itertools import zip_longest
-from random import shuffle
+from random import random, shuffle
 
 from langcodes import Language
 from model_training.custom_datasets.entities import Mode
 from pydantic import BaseModel, validator
 from pydantic.fields import ModelField
 
+SYSTEM_PROPERTY_DROP_PROBA = 0.5
+
 QA_SPECIAL_TOKENS = {
     "Question": "<|prompter|>",
     "Answer": "<|assistant|>",
@@ -25,7 +27,7 @@ def format_system_prefix(prefix, eos_token):
 
 class DatasetEntry(BaseModel):
     questions: list[str]
-    answers: list[str]
+    answers: list[str] | list[list[str]]
     context: str | None = None
     lang: str | None = None
     length: int | None = None
@@ -56,17 +58,24 @@ def system_tag(self, eos_token: str) -> str | None:
         relevant_system_infos = [
             (k, v)
             for k, v in self.__dict__.items()
-            if k not in ["questions", "answers"] and v is not None and str(v).replace("\n", "")
+            if k not in ["questions", "answers"]
+            and v is not None
+            and str(v).replace("\n", "")
+            and random() > SYSTEM_PROPERTY_DROP_PROBA
         ]
         if len(relevant_system_infos) > 0:
             shuffle(relevant_system_infos)
             system_tag_key_values = "\n".join([f"{k}: {v}" for k, v in relevant_system_infos])
             system_tag = f"{QA_SPECIAL_TOKENS['System']}{system_tag_key_values}\n{eos_token}"
             return system_tag
 
-    def _get_formatted_rm(self, eos_token: str, max_replies: str, system_tag: None | str):
-        assert len(self.answers) > 1
-        answers = self.answers[:max_replies]
+    def _get_formatted_rm(self, eos_token: str, max_replies: int, system_tag: None | str):
+        if isinstance(self.answers[0], list):
+            answers = self.answers[0]
+        else:
+            answers = self.answers
+        assert len(answers) > 1 and max_replies > 1
+        answers = answers[:max_replies]
         match len(self.questions):
             case 0:
                 question = ""
@@ -79,7 +88,7 @@ def _get_formatted_rm(self, eos_token: str, max_replies: str, system_tag: None |
                 raise ValueError("Received more than one question in RM mode. This is unexpected. Aborting")
         if system_tag is not None:
             question = f"{system_tag}{question}"
-        return (question, answers)  # NotImplementedError("This is currently not implemented.")
+        return (question, answers)
 
     def get_formatted(self, mode: Mode, eos_token: str, **kwargs) -> str | list[str] | tuple[str, list[str]]:
         system_tag = self.system_tag(eos_token)
@@ -97,7 +106,13 @@ def get_formatted(self, mode: Mode, eos_token: str, **kwargs) -> str | list[str]
                 qa_list = [system_tag]
             else:
                 qa_list = list()
-            for q, a in zip_longest(self.questions, self.answers):
+            # check if this is a RM capable dataset (so it has multiple answers to the same question)
+            # and if so, extract just the highest scoring answer
+            if isinstance(self.answers[0], list):
+                answers = [answer[0] for answer in self.answers]
+            else:
+                answers = self.answers
+            for q, a in zip_longest(self.questions, answers):
                 match (q, a):
                     case (str(), str()):
                         qa_list.extend(

diff --git a/model/model_training/custom_datasets/qa_datasets.py b/model/model_training/custom_datasets/qa_datasets.py
@@ -194,8 +194,7 @@ def __init__(self, mode: str = "sft", max_answers: int = 5) -> None:
 
         dataset = load_dataset("openai/webgpt_comparisons")
 
-        self.questions = []
-        self.answers = []
+        self.rows = []
 
         question_answer_dict = defaultdict(dict)
 
@@ -208,25 +207,18 @@ def __init__(self, mode: str = "sft", max_answers: int = 5) -> None:
                 question_answer_dict[question][answer_1] = row["score_1"]
 
         for question, answers in question_answer_dict.items():
-            self.questions.append(question)
             # Sort answer dict with the highest score first (hence the prefactor -1).
             # Then take only the first `max_answers` elements (usually there are just
             # 2, but there are examples where we have more)
             answers_sorted = [x[0] for x in sorted(answers.items(), key=lambda x: -1 * x[1])]
-            self.answers.append(answers_sorted[:max_answers])
+            self.rows.append(DatasetEntry(questions=[question], answers=[answers_sorted[:max_answers]], lang="en"))
 
     def __len__(self) -> int:
-        return len(self.questions)
+        return len(self.rows)
 
-    def __getitem__(self, index) -> list[str] | tuple[list[str], list[str]]:
-        question = self.questions[index]
-        answers = self.answers[index]
-        if self.mode == "sft":
-            return [question, answers[0]]
-        elif self.mode == "rm":
-            return ([question], answers)
-        elif self.mode == "rl":
-            return (question,)
+    def __getitem__(self, index) -> DatasetEntry:
+        dialogue = self.rows[index]
+        return dialogue
 
 
 class SODA(Dataset):
@@ -436,7 +428,7 @@ def load_alpaca_dataset(
     generator = Generator()
     generator.manual_seed(manual_seed)
 
-    def process_split(dataset: Subset) -> list[tuple[str, str]]:
+    def process_split(dataset: Subset, set_lang_as_eng: bool = False) -> list[tuple[str, str]]:
         data = []
 
         for row in dataset:
@@ -448,7 +440,11 @@ def process_split(dataset: Subset) -> list[tuple[str, str]]:
             if (_filter_by_words(input_) is None) or (_filter_by_words(row["output"]) is None):
                 continue
 
-            data.append(DatasetEntry(questions=[input_], answers=[row["output"]]))
+            if set_lang_as_eng is True:
+                ds_entry = DatasetEntry(questions=[input_], answers=[row["output"]], lang="en")
+            else:
+                ds_entry = DatasetEntry(questions=[input_], answers=[row["output"]])
+            data.append(ds_entry)
         return data
 
     if dataset_name == "alpaca":
@@ -527,7 +523,7 @@ def __init__(self, cache_dir: str | Path, mode: str = "sft", input_max_length: i
         )["train"]
         for data in dataset:
             if (qa := self.process_vicuna_conversations(data, input_max_length=input_max_length)) is not None:
-                self.pairs.append(DatasetEntry(questions=qa[0], answers=qa[1]))
+                self.pairs.append(DatasetEntry(questions=qa[0], answers=qa[1], lang="en"))
 
     def __len__(self) -> int:
         return len(self.pairs)

diff --git a/model/model_training/tests/test_formatting.py b/model/model_training/tests/test_formatting.py
@@ -4,6 +4,43 @@
 from model_training.custom_datasets.formatting import QA_SPECIAL_TOKENS, DatasetEntry
 
 
+def test_dataset_entry_rm_mode():
+    ds_entry = DatasetEntry(
+        questions=["Instruction A."],
+        answers=[["Highest Scored Answer to A.", "Second Highest Scored Answer to A"]],
+    )
+
+    eos = "<|endofline|>"
+    formatted_rm = ds_entry.get_formatted(mode=Mode.rm, eos_token=eos)
+    expected_rm = (
+        ["<|prompter|>Instruction A.<|endofline|>"],
+        [
+            "<|assistant|>Highest Scored Answer to A.<|endofline|>",
+            "<|assistant|>Second Highest Scored Answer to A<|endofline|>",
+        ],
+    )
+    assert formatted_rm == expected_rm
+
+
+def test_dataset_entry_sft_mode_compatible_with_rm():
+    ds_entry = DatasetEntry(
+        questions=["Instruction A.", "Followup Instruction B."],
+        answers=[
+            ["Highest Scored Answer to A.", "Second Highest Scored Answer to A"],
+            ["Highest Scored Answer to B.", "Second Highest Scored Answer to B"],
+        ],
+    )
+    eos = "<|endofline|>"
+    formatted_sft = ds_entry.get_formatted(mode=Mode.sft, eos_token=eos)
+    expected_sft = [
+        f"{QA_SPECIAL_TOKENS['Question']}{ds_entry.questions[0]}{eos}",
+        f"{QA_SPECIAL_TOKENS['Answer']}{ds_entry.answers[0][0]}{eos}",
+        f"{QA_SPECIAL_TOKENS['Question']}{ds_entry.questions[1]}{eos}",
+        f"{QA_SPECIAL_TOKENS['Answer']}{ds_entry.answers[1][0]}{eos}",
+    ]
+    assert formatted_sft == expected_sft
+
+
 def test_dataset_entry_formatting_missing_lang():
     ds_entry = DatasetEntry(
         questions=["What is the capital of France?"],

diff --git a/model/model_training/tests/test_ranking_collator.py b/model/model_training/tests/test_ranking_collator.py
@@ -33,43 +33,46 @@ def pythia_tokenizer():
 def test_ranking_collator_system_tag(pythia_tokenizer):
     first_example = DatasetEntry(
         questions=["First instruction."],
-        answers=["Answer to first instruction.", "Answer to first instruction."],
+        answers=[["Answer to first instruction.", "Answer to first instruction."]],
         lang="en",
         quality=0.7,
     )
     second_example = DatasetEntry(
         questions=["Second instruction."],
-        answers=["Answer to second instruction.", "Answer to second instruction."],
+        answers=[["Answer to second instruction.", "Answer to second instruction."]],
         humor=0.1,
         length=1000,
     )
     examples = [first_example, second_example]
+    import pdb
+
+    pdb.set_trace()
     rdc = RankingDataCollator(tokenizer=pythia_tokenizer, padding=True)
     batch, cu_lens = rdc(examples=examples)
     assert len(batch) == 2
-    assert cu_lens == [0, len(first_example.answers), len(first_example.answers) + len(second_example.answers)]
-    assert batch.data["attention_mask"].shape[0] == 4  # we have 5 replies in total
+    assert cu_lens == [0, len(first_example.answers[0]), len(first_example.answers[0]) + len(second_example.answers[0])]
+    assert batch.data["attention_mask"].shape[0] == 4  # we have 4 replies in total
     assert batch.data["input_ids"].shape == batch.data["attention_mask"].shape
     eos = pythia_tokenizer.eos_token
 
     # check each instruction
     first_example_first_answer_decoded = pythia_tokenizer.decode(batch.data["input_ids"][0])
-    f"{QA_SPECIAL_TOKENS['Question']}{first_example.questions[0]}{eos}{QA_SPECIAL_TOKENS['Answer']}{first_example.answers[0]}{eos}" in first_example_first_answer_decoded
+    f"{QA_SPECIAL_TOKENS['Question']}{first_example.questions[0]}{eos}{QA_SPECIAL_TOKENS['Answer']}{first_example.answers[0][0]}{eos}" in first_example_first_answer_decoded
     "lang: en" in first_example_first_answer_decoded
     "quality: 0.7" in first_example_first_answer_decoded
 
     first_example_second_answer_decoded = pythia_tokenizer.decode(batch.data["input_ids"][1])
-    f"{QA_SPECIAL_TOKENS['Question']}{first_example.questions[0]}{eos}{QA_SPECIAL_TOKENS['Answer']}{first_example.answers[1]}{eos}" in first_example_second_answer_decoded
+    f"{QA_SPECIAL_TOKENS['Question']}{first_example.questions[0]}{eos}{QA_SPECIAL_TOKENS['Answer']}{first_example.answers[0][1]}{eos}" in first_example_second_answer_decoded
     "lang: en" in first_example_second_answer_decoded
     "quality: 0.7" in first_example_second_answer_decoded
 
     second_example_first_answer_decoded = pythia_tokenizer.decode(batch.data["input_ids"][2])
-    f"{QA_SPECIAL_TOKENS['Question']}{second_example.questions[0]}{eos}{QA_SPECIAL_TOKENS['Answer']}{second_example.answers[0]}{eos}" in second_example_first_answer_decoded
+    f"{QA_SPECIAL_TOKENS['Question']}{second_example.questions[0]}{eos}{QA_SPECIAL_TOKENS['Answer']}{second_example.answers[0][0]}{eos}" in second_example_first_answer_decoded
     "humor: 0.1" in second_example_first_answer_decoded
     "length: 1000" in second_example_first_answer_decoded
 
     second_example_second_answer_decoded = pythia_tokenizer.decode(batch.data["input_ids"][2])
-    f"{QA_SPECIAL_TOKENS['Question']}{second_example.questions[0]}{eos}{QA_SPECIAL_TOKENS['Answer']}{second_example.answers[0]}{eos}" in second_example_second_answer_decoded
+    f"{QA_SPECIAL_TOKENS['Question']}{second_example.questions[0]}{eos}{QA_SPECIAL_TOKENS['Answer']}{second_example.answers[0][0]}{eos}" in second_example_second_answer_decoded
     "humor: 0.1" in second_example_second_answer_decoded
     "length: 1000" in second_example_second_answer_decoded