Skip to content

Commit

Permalink
update dataset entry to support list of list type (#2879)
Browse files Browse the repository at this point in the history
#2827

In the [PR to introduce RM for the dataset entry
class](#2867) I forgot
that if we have RM, we'll have multiple answers per question so `[Q1,
(A1, A12)]` but I just introduced questions and answers as `list` types
and therefore we could not connect a question to an answer accordingly,
e.g. `questions=[Q1, Q2]` and `answers=[A11, A12, A21, A22]` there was
no way to figure out that `A11, A12` belong to `Q1` and `A21, A22` to
`Q2`. So I introduced a `list[list[str]` type for the answers, so that
we can connect question and answers by indices: `questions=[Q1, Q2]` and
`answers=[[A11, A12], [A21, A22]`, so `answers[0]` belong to
`questions[0]`. Note that this is backwards compatible since `answers`
is a union type of `list[str] | list[list[str]]`. Also added tests for
this


Ran `python check_dataset_appearances.py -d webgpt --cache_dir .cache
--mode rm`
and found one entry with an empty question:
```python
DatasetEntry(questions=[''], 
                      answers=[['Lebensraum is a German geopolitical concept that means "living space." The term was originally used to support colonialism, and was later adapted by Nazi leader Adolf Hitler to support his quest for German expansion to the east . German geographer and ethnographer Friedrich Ratzel first published an essay called "Der Lebensraum" ("The Living Space") in 1901, in which he posited that all people, animals, and plants need to expand their living space in order to survive . According to Ratzel, species that successfully adapted to one location would spread naturally to others . Hitler believed that Germany required Lebensraum in order to survive, and this conviction that this living space could be gained only in the east and, specifically, from Russia, shaped his policy after his take-over of power in Germany in 1933 . The Nazi Generalplan Ost policy (\'Master Plan for the East\') was based on the tenets of Lebensraum . It stipulated that Germany required a Lebensraum necessary for its survival and that most of the indigenous populations of Central and Eastern Europe would have to be removed permanently (either through mass deportation to Siberia, extermination, or enslavement) .', 'There are several ways to unblock blocked websites. One way is to use a good web-based proxy server . Another way is to type in the URL of the blocked site you want to access in the address bar, and then press Go or Enter . The web content will be sent to the proxy server where it can then be viewed from your device . This may make browsing a bit slower, but you should still be able to access any of your favorite websites . Another way to unblock blocked websites is to use a VPN (Virtual Private Network) . A VPN can be used to access region-restricted websites, shield your web browsing activities on public WiFi networks, and more .']], 
                  context=None, 
                  lang=None, 
                  length=None, 
                  quality=None, 
                  humor=None, 
                  creativity=None
)
```
So this was the result: 
```bash
'Found the following occurances in TRAIN webgpt:'
{re.compile('^[\\s\\n]*$'): ['']}
```
  • Loading branch information
CloseChoice authored Apr 27, 2023
1 parent 618ad47 commit 3e06b48
Show file tree
Hide file tree
Showing 6 changed files with 143 additions and 47 deletions.
30 changes: 18 additions & 12 deletions model/model_training/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,21 +212,27 @@ deepspeed trainer_sft.py --configs defaults your-model-name --deepspeed
Here is an uncomplete overview of datasets for sft:

<!-- prettier-ignore -->
dataset_name | train_counts | eval_counts | total_counts
dataset_name | train_counts | eval_counts | total_counts
----------------------------------------------------------------

<!-- prettier-ignore -->
webgpt | 15662 | 3916 | 19578
squad_v2 | 130319 | 11873 | 142192
adversarial_qa | 30000 | 3000 | 33000
trivia_qa_nocontext | 138384 | 17944 | 156328
xsum | 204045 | 11332 | 215377
cnn_dailymail | 287113 | 13368 | 300481
multi_news | 44972 | 5622 | 50594
scitldr | 1992 | 619 | 2611
joke | 301 | 76 | 377
gsm8k | 7473 | 1319 | 8792
dive_mt | 6192 | 1548 | 7740
joke | 301 | 76 | 377
webgpt | 14251 | 3563 | 17814
gpt4all | 313552 | 78388 | 391940
alpaca | 41361 | 10346 | 51707
code_alpaca | 16017 | 4004 | 20021
vicuna | 46939 | 11735 | 58674
minimath | 2304 | 576 | 2880
humaneval_mbpp_codegen_qa | 472 | 119 | 591
humaneval_mbpp_testgen_qa | 472 | 119 | 591
grade_school_math_instructions | 7033 | 1759 | 8792
recipes | 3797 | 950 | 4747
cmu_wiki_qa | 1288 | 322 | 1610
oa_wiki_qa_bart_10000row | 8000 | 2000 | 10000
prosocial_dialogue | 157160 | 26983 | 184143
explain_prosocial | 360708 | 61248 | 421956
soda | 924102 | 231026 | 1155128
oa_leet10k | 18728 | 4683 | 23411

This list can be generated with the following command, but beware that this
downloads all available datasets (>100GB):
Expand Down
43 changes: 41 additions & 2 deletions model/model_training/check_dataset_counts.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
import argparse
from collections import Counter
from enum import Enum
from pathlib import Path
from typing import Any

import pandas as pd
import yaml
from langdetect import DetectorFactory, detect
from model_training.custom_datasets.formatting import DatasetEntry
from model_training.utils.utils import _strtobool, get_dataset


Expand Down Expand Up @@ -54,6 +57,7 @@ def argument_parsing(notebook=False, notebook_args=None):
)
parser.add_argument("--mode", dest="mode", type=Mode, choices=list(Mode))
parser.add_argument("--output_path", dest="output_path", default="dataset_counts.csv")
parser.add_argument("--detect_language", default=False, action="store_true")

if notebook:
args, remaining = parser.parse_known_args(notebook_args)
Expand Down Expand Up @@ -93,6 +97,7 @@ def argument_parsing(notebook=False, notebook_args=None):
conf["output_path"] = args.output_path
conf["datasets_extra"] = []
conf["datasets"] = datasets_list
conf["detect_language"] = args.detect_language
# Override config from command-line
parser = argparse.ArgumentParser()
for key, value in conf.items():
Expand All @@ -111,9 +116,37 @@ def argument_parsing(notebook=False, notebook_args=None):
if __name__ == "__main__":
args = argument_parsing()

train, evals = get_dataset(args, mode=args.mode)
train, evals = get_dataset(args, mode=args.mode.value)
overview_df = pd.DataFrame(columns=["dataset_name", "train_counts", "eval_counts", "total_counts"])
for idx, dataset_name in enumerate(args.datasets):
language_df = pd.DataFrame()
if args.detect_language:
DetectorFactory.seed = 0
for idx, (dataset_name, dataset) in enumerate(evals.items()):
train_lang = Counter()
if args.detect_language:
length = len(dataset)
for idx1, row in enumerate(dataset):
if idx1 % 1000 == 0:
print(f"{idx1} of {length} of ds {dataset_name}.")
try:
if isinstance(row, (list, tuple)):
train_lang += Counter([detect(k) for k in row])
elif isinstance(row, DatasetEntry):
train_lang += Counter([detect(k) for k in row.questions if k])
if isinstance(row.answers[0], list):
for answers in row.answers:
train_lang += Counter([detect(k) for k in answers if k])
else:
train_lang += Counter([detect(k) for k in row.answers if k])
else:
raise ValueError(
f"Did not expect the type {type(row)}. Should be either list, tuple or DatasetEntry."
)
except Exception as e:
print(e)
train_lang = dict(train_lang)
train_lang["dataset_name"] = dataset_name
language_df = pd.concat([language_df, pd.DataFrame([train_lang])])
eval_count = len(evals.get(dataset_name, []))
overview_df.loc[idx] = [
dataset_name,
Expand All @@ -122,4 +155,10 @@ def argument_parsing(notebook=False, notebook_args=None):
len(train.datasets[idx]) + eval_count,
]
print(overview_df)
print(language_df)
overview_df.to_csv(args.output_path, index=False)
language_df.to_csv("language_counts.csv", index=False)

# python check_dataset_counts.py --datasets joke webgpt gpt4all alpaca code_alpaca vicuna minimath humaneval_mbpp_codegen_qa humaneval_mbpp_testgen_qa grade_school_math_instructions recipes cmu_wiki_qa oa_wiki_qa_bart_10000row prosocial_dialogue explain_prosocial soda oa_leet10k --mode sft
# python check_dataset_counts.py --datasets joke webgpt alpaca code_alpaca vicuna minimath humaneval_mbpp_codegen_qa humaneval_mbpp_testgen_qa grade_school_math_instructions recipes cmu_wiki_qa oa_wiki_qa_bart_10000row prosocial_dialogue oa_leet10k --mode sft
# python check_dataset_counts.py --datasets joke webgpt --mode sft
31 changes: 23 additions & 8 deletions model/model_training/custom_datasets/formatting.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
from itertools import zip_longest
from random import shuffle
from random import random, shuffle

from langcodes import Language
from model_training.custom_datasets.entities import Mode
from pydantic import BaseModel, validator
from pydantic.fields import ModelField

SYSTEM_PROPERTY_DROP_PROBA = 0.5

QA_SPECIAL_TOKENS = {
"Question": "<|prompter|>",
"Answer": "<|assistant|>",
Expand All @@ -25,7 +27,7 @@ def format_system_prefix(prefix, eos_token):

class DatasetEntry(BaseModel):
questions: list[str]
answers: list[str]
answers: list[str] | list[list[str]]
context: str | None = None
lang: str | None = None
length: int | None = None
Expand Down Expand Up @@ -56,17 +58,24 @@ def system_tag(self, eos_token: str) -> str | None:
relevant_system_infos = [
(k, v)
for k, v in self.__dict__.items()
if k not in ["questions", "answers"] and v is not None and str(v).replace("\n", "")
if k not in ["questions", "answers"]
and v is not None
and str(v).replace("\n", "")
and random() > SYSTEM_PROPERTY_DROP_PROBA
]
if len(relevant_system_infos) > 0:
shuffle(relevant_system_infos)
system_tag_key_values = "\n".join([f"{k}: {v}" for k, v in relevant_system_infos])
system_tag = f"{QA_SPECIAL_TOKENS['System']}{system_tag_key_values}\n{eos_token}"
return system_tag

def _get_formatted_rm(self, eos_token: str, max_replies: str, system_tag: None | str):
assert len(self.answers) > 1
answers = self.answers[:max_replies]
def _get_formatted_rm(self, eos_token: str, max_replies: int, system_tag: None | str):
if isinstance(self.answers[0], list):
answers = self.answers[0]
else:
answers = self.answers
assert len(answers) > 1 and max_replies > 1
answers = answers[:max_replies]
match len(self.questions):
case 0:
question = ""
Expand All @@ -79,7 +88,7 @@ def _get_formatted_rm(self, eos_token: str, max_replies: str, system_tag: None |
raise ValueError("Received more than one question in RM mode. This is unexpected. Aborting")
if system_tag is not None:
question = f"{system_tag}{question}"
return (question, answers) # NotImplementedError("This is currently not implemented.")
return (question, answers)

def get_formatted(self, mode: Mode, eos_token: str, **kwargs) -> str | list[str] | tuple[str, list[str]]:
system_tag = self.system_tag(eos_token)
Expand All @@ -97,7 +106,13 @@ def get_formatted(self, mode: Mode, eos_token: str, **kwargs) -> str | list[str]
qa_list = [system_tag]
else:
qa_list = list()
for q, a in zip_longest(self.questions, self.answers):
# check if this is a RM capable dataset (so it has multiple answers to the same question)
# and if so, extract just the highest scoring answer
if isinstance(self.answers[0], list):
answers = [answer[0] for answer in self.answers]
else:
answers = self.answers
for q, a in zip_longest(self.questions, answers):
match (q, a):
case (str(), str()):
qa_list.extend(
Expand Down
30 changes: 13 additions & 17 deletions model/model_training/custom_datasets/qa_datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,8 +194,7 @@ def __init__(self, mode: str = "sft", max_answers: int = 5) -> None:

dataset = load_dataset("openai/webgpt_comparisons")

self.questions = []
self.answers = []
self.rows = []

question_answer_dict = defaultdict(dict)

Expand All @@ -208,25 +207,18 @@ def __init__(self, mode: str = "sft", max_answers: int = 5) -> None:
question_answer_dict[question][answer_1] = row["score_1"]

for question, answers in question_answer_dict.items():
self.questions.append(question)
# Sort answer dict with the highest score first (hence the prefactor -1).
# Then take only the first `max_answers` elements (usually there are just
# 2, but there are examples where we have more)
answers_sorted = [x[0] for x in sorted(answers.items(), key=lambda x: -1 * x[1])]
self.answers.append(answers_sorted[:max_answers])
self.rows.append(DatasetEntry(questions=[question], answers=[answers_sorted[:max_answers]], lang="en"))

def __len__(self) -> int:
return len(self.questions)
return len(self.rows)

def __getitem__(self, index) -> list[str] | tuple[list[str], list[str]]:
question = self.questions[index]
answers = self.answers[index]
if self.mode == "sft":
return [question, answers[0]]
elif self.mode == "rm":
return ([question], answers)
elif self.mode == "rl":
return (question,)
def __getitem__(self, index) -> DatasetEntry:
dialogue = self.rows[index]
return dialogue


class SODA(Dataset):
Expand Down Expand Up @@ -436,7 +428,7 @@ def load_alpaca_dataset(
generator = Generator()
generator.manual_seed(manual_seed)

def process_split(dataset: Subset) -> list[tuple[str, str]]:
def process_split(dataset: Subset, set_lang_as_eng: bool = False) -> list[tuple[str, str]]:
data = []

for row in dataset:
Expand All @@ -448,7 +440,11 @@ def process_split(dataset: Subset) -> list[tuple[str, str]]:
if (_filter_by_words(input_) is None) or (_filter_by_words(row["output"]) is None):
continue

data.append(DatasetEntry(questions=[input_], answers=[row["output"]]))
if set_lang_as_eng is True:
ds_entry = DatasetEntry(questions=[input_], answers=[row["output"]], lang="en")
else:
ds_entry = DatasetEntry(questions=[input_], answers=[row["output"]])
data.append(ds_entry)
return data

if dataset_name == "alpaca":
Expand Down Expand Up @@ -527,7 +523,7 @@ def __init__(self, cache_dir: str | Path, mode: str = "sft", input_max_length: i
)["train"]
for data in dataset:
if (qa := self.process_vicuna_conversations(data, input_max_length=input_max_length)) is not None:
self.pairs.append(DatasetEntry(questions=qa[0], answers=qa[1]))
self.pairs.append(DatasetEntry(questions=qa[0], answers=qa[1], lang="en"))

def __len__(self) -> int:
return len(self.pairs)
Expand Down
37 changes: 37 additions & 0 deletions model/model_training/tests/test_formatting.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,43 @@
from model_training.custom_datasets.formatting import QA_SPECIAL_TOKENS, DatasetEntry


def test_dataset_entry_rm_mode():
ds_entry = DatasetEntry(
questions=["Instruction A."],
answers=[["Highest Scored Answer to A.", "Second Highest Scored Answer to A"]],
)

eos = "<|endofline|>"
formatted_rm = ds_entry.get_formatted(mode=Mode.rm, eos_token=eos)
expected_rm = (
["<|prompter|>Instruction A.<|endofline|>"],
[
"<|assistant|>Highest Scored Answer to A.<|endofline|>",
"<|assistant|>Second Highest Scored Answer to A<|endofline|>",
],
)
assert formatted_rm == expected_rm


def test_dataset_entry_sft_mode_compatible_with_rm():
ds_entry = DatasetEntry(
questions=["Instruction A.", "Followup Instruction B."],
answers=[
["Highest Scored Answer to A.", "Second Highest Scored Answer to A"],
["Highest Scored Answer to B.", "Second Highest Scored Answer to B"],
],
)
eos = "<|endofline|>"
formatted_sft = ds_entry.get_formatted(mode=Mode.sft, eos_token=eos)
expected_sft = [
f"{QA_SPECIAL_TOKENS['Question']}{ds_entry.questions[0]}{eos}",
f"{QA_SPECIAL_TOKENS['Answer']}{ds_entry.answers[0][0]}{eos}",
f"{QA_SPECIAL_TOKENS['Question']}{ds_entry.questions[1]}{eos}",
f"{QA_SPECIAL_TOKENS['Answer']}{ds_entry.answers[1][0]}{eos}",
]
assert formatted_sft == expected_sft


def test_dataset_entry_formatting_missing_lang():
ds_entry = DatasetEntry(
questions=["What is the capital of France?"],
Expand Down
19 changes: 11 additions & 8 deletions model/model_training/tests/test_ranking_collator.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,43 +33,46 @@ def pythia_tokenizer():
def test_ranking_collator_system_tag(pythia_tokenizer):
first_example = DatasetEntry(
questions=["First instruction."],
answers=["Answer to first instruction.", "Answer to first instruction."],
answers=[["Answer to first instruction.", "Answer to first instruction."]],
lang="en",
quality=0.7,
)
second_example = DatasetEntry(
questions=["Second instruction."],
answers=["Answer to second instruction.", "Answer to second instruction."],
answers=[["Answer to second instruction.", "Answer to second instruction."]],
humor=0.1,
length=1000,
)
examples = [first_example, second_example]
import pdb

pdb.set_trace()
rdc = RankingDataCollator(tokenizer=pythia_tokenizer, padding=True)
batch, cu_lens = rdc(examples=examples)
assert len(batch) == 2
assert cu_lens == [0, len(first_example.answers), len(first_example.answers) + len(second_example.answers)]
assert batch.data["attention_mask"].shape[0] == 4 # we have 5 replies in total
assert cu_lens == [0, len(first_example.answers[0]), len(first_example.answers[0]) + len(second_example.answers[0])]
assert batch.data["attention_mask"].shape[0] == 4 # we have 4 replies in total
assert batch.data["input_ids"].shape == batch.data["attention_mask"].shape
eos = pythia_tokenizer.eos_token

# check each instruction
first_example_first_answer_decoded = pythia_tokenizer.decode(batch.data["input_ids"][0])
f"{QA_SPECIAL_TOKENS['Question']}{first_example.questions[0]}{eos}{QA_SPECIAL_TOKENS['Answer']}{first_example.answers[0]}{eos}" in first_example_first_answer_decoded
f"{QA_SPECIAL_TOKENS['Question']}{first_example.questions[0]}{eos}{QA_SPECIAL_TOKENS['Answer']}{first_example.answers[0][0]}{eos}" in first_example_first_answer_decoded
"lang: en" in first_example_first_answer_decoded
"quality: 0.7" in first_example_first_answer_decoded

first_example_second_answer_decoded = pythia_tokenizer.decode(batch.data["input_ids"][1])
f"{QA_SPECIAL_TOKENS['Question']}{first_example.questions[0]}{eos}{QA_SPECIAL_TOKENS['Answer']}{first_example.answers[1]}{eos}" in first_example_second_answer_decoded
f"{QA_SPECIAL_TOKENS['Question']}{first_example.questions[0]}{eos}{QA_SPECIAL_TOKENS['Answer']}{first_example.answers[0][1]}{eos}" in first_example_second_answer_decoded
"lang: en" in first_example_second_answer_decoded
"quality: 0.7" in first_example_second_answer_decoded

second_example_first_answer_decoded = pythia_tokenizer.decode(batch.data["input_ids"][2])
f"{QA_SPECIAL_TOKENS['Question']}{second_example.questions[0]}{eos}{QA_SPECIAL_TOKENS['Answer']}{second_example.answers[0]}{eos}" in second_example_first_answer_decoded
f"{QA_SPECIAL_TOKENS['Question']}{second_example.questions[0]}{eos}{QA_SPECIAL_TOKENS['Answer']}{second_example.answers[0][0]}{eos}" in second_example_first_answer_decoded
"humor: 0.1" in second_example_first_answer_decoded
"length: 1000" in second_example_first_answer_decoded

second_example_second_answer_decoded = pythia_tokenizer.decode(batch.data["input_ids"][2])
f"{QA_SPECIAL_TOKENS['Question']}{second_example.questions[0]}{eos}{QA_SPECIAL_TOKENS['Answer']}{second_example.answers[0]}{eos}" in second_example_second_answer_decoded
f"{QA_SPECIAL_TOKENS['Question']}{second_example.questions[0]}{eos}{QA_SPECIAL_TOKENS['Answer']}{second_example.answers[0][0]}{eos}" in second_example_second_answer_decoded
"humor: 0.1" in second_example_second_answer_decoded
"length: 1000" in second_example_second_answer_decoded

Expand Down

0 comments on commit 3e06b48

Please sign in to comment.