-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Add HuggingFace datasets support #3570
Conversation
Thanks @meganung for your PR. Just let me know if there is anything we can do on Hugging Face Datasets to help you with the integration. 🤗 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comments, larger comments in 1:1
…ove this to the development side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great this is really close. Just small refactors left.
parlai/tasks/huggingface/agents.py
Outdated
hf_path: str = 'glue' | ||
hf_name: str = 'cola' | ||
hf_text_fields: List[str] = ['sentence'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make these three default to NotImplemented
in the abstract teacher, and override them specifically in the glue/cola teacher
The abstract teacher should raise a NotImplementedError
if any of the values are NotImplemented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use property accessor methods to do this, you'll have the added benefit that these fields are immutable. e.g.:
@property
def hf_text_fields(self) -> List[str]:
return ['sentence']
Right now if someone does:
fields = self.hf_text_fields
fields.append['another']
self.hf_text_fields
will contain 'another', which might not be obvious
parlai/tasks/huggingface/agents.py
Outdated
def __init__(self, opt, shared=None): | ||
self.fold = DatatypeHelper.fold(opt['datatype']) | ||
self.hf_split = self.hf_splits_mapping[self.fold] | ||
opt['datafile'] = self.hf_split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this os.path.join(opt['datapath'], 'huggingface', self.hf_path, self.hf_name, self.fold)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is used for caching the number of items in the dataset and we'll have a name collision otherwise
parlai/tasks/huggingface/agents.py
Outdated
candidates = [row[l] for l in pre_candidates] | ||
yield (query, [label], None, candidates), True | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Carve out a new GlueColaTeacher
that fills in the particular values for that task
parlai/tasks/huggingface/test.py
Outdated
|
||
|
||
class TestDefaultTeacher(AutoTeacherTest): | ||
task = 'huggingface' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to glue_cola
(and the class name too)
parlai/tasks/huggingface/agents.py
Outdated
query = '\n'.join(text_arr) | ||
|
||
# construct label and candidates | ||
label = row[self.hf_label_field] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor this into a _get_label_value
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I'd lean toward merging this soon and then iterating on it as we have more concrete use cases.
Will let @stephenroller accept since he has more task/teacher context.
parlai/tasks/huggingface/agents.py
Outdated
hf_name: str = 'cola' | ||
hf_text_fields: List[str] = ['sentence'] | ||
hf_label_field: str = 'label' | ||
hf_splits_mapping: dict = {'train': 'train', 'valid': 'validation', 'test': 'test'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit; since this dictionary has an implicit schema, it might be a good candidate for a TypedDict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about just dict[str, str]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of a TypedDict that I'm anticipating is being able to define a comprehensive set of possible keys. So no one tries to assign to or read from "validation", for example.
parlai/tasks/huggingface/agents.py
Outdated
hf_path: str = 'glue' | ||
hf_name: str = 'cola' | ||
hf_text_fields: List[str] = ['sentence'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use property accessor methods to do this, you'll have the added benefit that these fields are immutable. e.g.:
@property
def hf_text_fields(self) -> List[str]:
return ['sentence']
Right now if someone does:
fields = self.hf_text_fields
fields.append['another']
self.hf_text_fields
will contain 'another', which might not be obvious
parlai/tasks/huggingface/agents.py
Outdated
if self.render_text_field: | ||
text_part = col + ': ' + text_part | ||
text_arr.append(text_part) | ||
query = '\n'.join(text_arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some other tasks we use a configurable self.delimiter
instead of hard-coding '\n'
since some models expect different separators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just move those methods out!
parlai/tasks/huggingface/agents.py
Outdated
Manually override if needed. | ||
""" | ||
|
||
def _get_text_value(row) -> Tuple[str, Dict[str, str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move these into the class so they're overridable
hf_splits_mapping = {'train': 'train', 'valid': 'validation', 'test': 'test'} | ||
|
||
|
||
class DefaultTeacher(ColaTeacher): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like us to add the rest of GLUE, but that can be the next PR if you want
parlai/tasks/huggingface/agents.py
Outdated
|
||
|
||
class DefaultTeacher(AbstractHuggingFaceTeacher): | ||
def __init__(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a standard constructor
Adding support for HuggingFace datasets.
The
AbstractHuggingFaceTeacher
is written as an abstract class. To use, extend the Teacher and set these properties:hf_name
= name parameter passed into hugging face load_dataset functionhf_text_fields
= list of names of the data fields from the dataset to be included in the text/queryhf_label_field
= name of the data field from the hf dataset that specifies the label of the episodehf_splits_mapping
= dictionary mapping with the keys 'train', 'valid', and 'test', that map to thenames of the splits of the hf dataset.
The default
setup_data
function was written with glue and superglue datasets in mind. If the hugging face dataset is not formatted in a way that the currentsetup_data
function can handle, manually override this function.Glue task is created with the Glue Cola teacher added.