-
Notifications
You must be signed in to change notification settings - Fork 518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Add support of multiple datasets in config #889
Changes from 6 commits
c46dd6c
83eff03
c7fcdab
da9df43
d6913c9
46ea2bc
90a6294
abed501
09078c4
071456d
c4af9ea
c1307bf
78d871d
ccd99cf
61c6d22
1ba70ac
c0746d5
913b352
c44badf
26b2997
4faa4d4
0d11d88
90731a3
854eb05
f7a3f95
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we still need to make these changes to the test samples to use Dataset across all the individual dataset test files? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to keep the |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,6 +7,7 @@ | |
from unittest.mock import patch | ||
|
||
import pytest | ||
from datasets import Dataset | ||
|
||
from tests.test_utils import get_assets_path | ||
from torchtune.data._common import CROSS_ENTROPY_IGNORE_IDX | ||
|
@@ -28,17 +29,19 @@ def test_label_no_masking(self, load_dataset, tokenizer): | |
""" | ||
|
||
# mock the call to HF datasets | ||
load_dataset.return_value = [ | ||
{ | ||
"instruction": "Give three tips for staying healthy.", | ||
"input": "", | ||
"output": ( | ||
"1.Eat a balanced diet and make sure to include plenty of fruits and vegetables." | ||
"2. Exercise regularly to keep your body active and strong." | ||
"3. Get enough sleep and maintain a consistent sleep schedule." | ||
), | ||
} | ||
] | ||
load_dataset.return_value = Dataset.from_list( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the story with these There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Technically it's a more accurate return type of load dataset, I think it's ok to just leave these out and stick to primitives for simplicity, but I don't have a strong opinion here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've invested considerable time into understanding how to test my new dataset class. It was initially unclear regarding the required format and content of its elements. Hence, I suggest providing clear guidelines to save fellow programmers time, elucidating the expected format for dataset elements. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, this is very good feedback. I agree that the contracts of various dataset components are not always obvious and take some time to sort through. Aside from improving live docs and better code comments, I'm open to any suggestions you have on how to make this clearer based on your experience. |
||
[ | ||
{ | ||
"instruction": "Give three tips for staying healthy.", | ||
"input": "", | ||
"output": ( | ||
"1.Eat a balanced diet and make sure to include plenty of fruits and vegetables." | ||
"2. Exercise regularly to keep your body active and strong." | ||
"3. Get enough sleep and maintain a consistent sleep schedule." | ||
), | ||
} | ||
] | ||
) | ||
|
||
alpaca_ds = alpaca_dataset(tokenizer=tokenizer) | ||
input, labels = alpaca_ds[0] | ||
|
@@ -55,17 +58,19 @@ def test_label_masking(self, load_dataset, tokenizer): | |
""" | ||
|
||
# mock the call to HF datasets | ||
load_dataset.return_value = [ | ||
{ | ||
"instruction": "Give three tips for staying healthy.", | ||
"input": "", | ||
"output": ( | ||
"1.Eat a balanced diet and make sure to include plenty of fruits and vegetables." | ||
"2. Exercise regularly to keep your body active and strong." | ||
"3. Get enough sleep and maintain a consistent sleep schedule." | ||
), | ||
} | ||
] | ||
load_dataset.return_value = Dataset.from_list( | ||
[ | ||
{ | ||
"instruction": "Give three tips for staying healthy.", | ||
"input": "", | ||
"output": ( | ||
"1.Eat a balanced diet and make sure to include plenty of fruits and vegetables." | ||
"2. Exercise regularly to keep your body active and strong." | ||
"3. Get enough sleep and maintain a consistent sleep schedule." | ||
), | ||
} | ||
] | ||
) | ||
|
||
alpaca_ds = alpaca_dataset(tokenizer=tokenizer, train_on_input=False) | ||
|
||
|
@@ -90,17 +95,19 @@ def test_alpaca_clean(self, load_dataset, tokenizer): | |
""" | ||
|
||
# mock the call to HF datasets | ||
load_dataset.return_value = [ | ||
{ | ||
"instruction": "Give three tips for staying healthy.", | ||
"input": "", | ||
"output": ( | ||
"1.Eat a balanced diet and make sure to include plenty of fruits and vegetables." | ||
"2. Exercise regularly to keep your body active and strong." | ||
"3. Get enough sleep and maintain a consistent sleep schedule." | ||
), | ||
} | ||
] | ||
load_dataset.return_value = Dataset.from_list( | ||
[ | ||
{ | ||
"instruction": "Give three tips for staying healthy.", | ||
"input": "", | ||
"output": ( | ||
"1.Eat a balanced diet and make sure to include plenty of fruits and vegetables." | ||
"2. Exercise regularly to keep your body active and strong." | ||
"3. Get enough sleep and maintain a consistent sleep schedule." | ||
), | ||
} | ||
] | ||
) | ||
|
||
alpaca_ds = alpaca_cleaned_dataset(tokenizer=tokenizer) | ||
input, labels = alpaca_ds[0] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also just directly check if it's a ListConfig. If it's a single dataset then this might fail
I also wonder if there's a better way to handle this so we don't have to repeat this if-else check across all recipes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, i've replaced it with
ListConfig
check.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this logic can be moved (for example) to
config.instantiate
method, but I guess it will break single responsibility principle. So I suggest leaving it as is.