Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Add HuggingFace datasets support #3570

Merged
merged 24 commits into from
Apr 26, 2021
Merged

Add HuggingFace datasets support #3570

merged 24 commits into from
Apr 26, 2021

Conversation

meganung
Copy link
Contributor

@meganung meganung commented Apr 5, 2021

Adding support for HuggingFace datasets.

The AbstractHuggingFaceTeacher is written as an abstract class. To use, extend the Teacher and set these properties:
hf_name = name parameter passed into hugging face load_dataset function
hf_text_fields = list of names of the data fields from the dataset to be included in the text/query
hf_label_field = name of the data field from the hf dataset that specifies the label of the episode
hf_splits_mapping = dictionary mapping with the keys 'train', 'valid', and 'test', that map to the
names of the splits of the hf dataset.

The default setup_data function was written with glue and superglue datasets in mind. If the hugging face dataset is not formatted in a way that the current setup_data function can handle, manually override this function.

Glue task is created with the Glue Cola teacher added.

@albertvillanova
Copy link

Thanks @meganung for your PR.

Just let me know if there is anything we can do on Hugging Face Datasets to help you with the integration. 🤗

Copy link
Contributor

@stephenroller stephenroller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments, larger comments in 1:1

parlai/tasks/huggingface/agents.py Outdated Show resolved Hide resolved
parlai/tasks/huggingface/agents.py Outdated Show resolved Hide resolved
parlai/tasks/huggingface/agents.py Outdated Show resolved Hide resolved
parlai/tasks/huggingface/agents.py Outdated Show resolved Hide resolved
parlai/tasks/huggingface/agents.py Show resolved Hide resolved
parlai/tasks/huggingface/agents.py Outdated Show resolved Hide resolved
Copy link
Contributor

@stephenroller stephenroller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great this is really close. Just small refactors left.

Comment on lines 30 to 32
hf_path: str = 'glue'
hf_name: str = 'cola'
hf_text_fields: List[str] = ['sentence']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make these three default to NotImplemented in the abstract teacher, and override them specifically in the glue/cola teacher

The abstract teacher should raise a NotImplementedError if any of the values are NotImplemented

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use property accessor methods to do this, you'll have the added benefit that these fields are immutable. e.g.:

@property
def hf_text_fields(self) -> List[str]:
    return ['sentence']

Right now if someone does:

fields = self.hf_text_fields
fields.append['another']

self.hf_text_fields will contain 'another', which might not be obvious

parlai/tasks/huggingface/agents.py Outdated Show resolved Hide resolved
def __init__(self, opt, shared=None):
self.fold = DatatypeHelper.fold(opt['datatype'])
self.hf_split = self.hf_splits_mapping[self.fold]
opt['datafile'] = self.hf_split
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this os.path.join(opt['datapath'], 'huggingface', self.hf_path, self.hf_name, self.fold)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is used for caching the number of items in the dataset and we'll have a name collision otherwise

parlai/tasks/huggingface/agents.py Outdated Show resolved Hide resolved
parlai/tasks/huggingface/agents.py Outdated Show resolved Hide resolved
candidates = [row[l] for l in pre_candidates]
yield (query, [label], None, candidates), True


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carve out a new GlueColaTeacher that fills in the particular values for that task



class TestDefaultTeacher(AutoTeacherTest):
task = 'huggingface'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to glue_cola (and the class name too)

parlai/tasks/huggingface/agents.py Outdated Show resolved Hide resolved
query = '\n'.join(text_arr)

# construct label and candidates
label = row[self.hf_label_field]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor this into a _get_label_value method

parlai/tasks/huggingface/agents.py Outdated Show resolved Hide resolved
Copy link
Contributor

@spencerp spencerp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I'd lean toward merging this soon and then iterating on it as we have more concrete use cases.

Will let @stephenroller accept since he has more task/teacher context.

hf_name: str = 'cola'
hf_text_fields: List[str] = ['sentence']
hf_label_field: str = 'label'
hf_splits_mapping: dict = {'train': 'train', 'valid': 'validation', 'test': 'test'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; since this dictionary has an implicit schema, it might be a good candidate for a TypedDict

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about just dict[str, str]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of a TypedDict that I'm anticipating is being able to define a comprehensive set of possible keys. So no one tries to assign to or read from "validation", for example.

parlai/tasks/huggingface/agents.py Outdated Show resolved Hide resolved
Comment on lines 30 to 32
hf_path: str = 'glue'
hf_name: str = 'cola'
hf_text_fields: List[str] = ['sentence']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use property accessor methods to do this, you'll have the added benefit that these fields are immutable. e.g.:

@property
def hf_text_fields(self) -> List[str]:
    return ['sentence']

Right now if someone does:

fields = self.hf_text_fields
fields.append['another']

self.hf_text_fields will contain 'another', which might not be obvious

parlai/tasks/huggingface/agents.py Outdated Show resolved Hide resolved
if self.render_text_field:
text_part = col + ': ' + text_part
text_arr.append(text_part)
query = '\n'.join(text_arr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some other tasks we use a configurable self.delimiter instead of hard-coding '\n' since some models expect different separators.

Copy link
Contributor

@stephenroller stephenroller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just move those methods out!

Manually override if needed.
"""

def _get_text_value(row) -> Tuple[str, Dict[str, str]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move these into the class so they're overridable

hf_splits_mapping = {'train': 'train', 'valid': 'validation', 'test': 'test'}


class DefaultTeacher(ColaTeacher):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like us to add the rest of GLUE, but that can be the next PR if you want



class DefaultTeacher(AbstractHuggingFaceTeacher):
def __init__():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a standard constructor

@meganung meganung merged commit 6120596 into master Apr 26, 2021
@meganung meganung deleted the add-hf branch April 26, 2021 17:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants