Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new run_swag example #9175

Merged
merged 5 commits into from
Dec 18, 2020
Merged

Add new run_swag example #9175

merged 5 commits into from
Dec 18, 2020

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Dec 17, 2020

What does this PR do?

This PR adds a new example for multiple-choice using Trainer and Datasets, and moves the older one to the legacy folder.

contexts: list of str. The untokenized text of the first sequence (context of corresponding question).
endings: list of str. multiple choice's options. Its length must be equal to contexts' length.
label: (Optional) string. The label of the example. This should be
specified for train and dev examples, but not for test examples.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't there be a tab?

Suggested change
specified for train and dev examples, but not for test examples.
specified for train and dev examples, but not for test examples.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not written by me, I just copied it there. (We don't care since it's not rendered in the docs.)


cached_features_file = os.path.join(
data_dir,
"cached_{}_{}_{}_{}".format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no f-strings? ;-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, not my file ;-)


def get_train_examples(self, data_dir):
"""See base class."""
logger.info("LOOKING AT {} train".format(data_dir))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why capitalized here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not my file :-p



class RaceProcessor(DataProcessor):
"""Processor for the RACE data set."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible I think it makes a lot of sense to link to the dataset in datasets if available here



class SwagProcessor(DataProcessor):
"""Processor for the SWAG data set."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to dataset in datasets



class ArcProcessor(DataProcessor):
"""Processor for the ARC data set (request from allennlp)."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to dataset if available

def __post_init__(self):
if self.train_file is not None:
extension = self.train_file.split(".")[-1]
assert extension in ["csv", "json"], "`train_file` should be a csv or a json file."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can also be files in .json format that don't have the .json extension (I saw that already quite a bit when porting datasets....) my better to have a try except catch here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script is not universal, I'd leave it to the users to do the change if they have a json file without the json extension (the bottomline is that if they did not prepare it themselves, it's unlikely to work out of the box since the data is expected to have the same format as swag on datasets).

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Left some comments, mostly nits

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, LGTM! Great!

| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a satisfying diff :)

examples/multiple-choice/README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,349 @@
# coding=utf-8
# Copyright The HuggingFace Team and The HuggingFace Inc. team. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we usually put the date here (2020) ?

examples/multiple-choice/run_swag.py Outdated Show resolved Hide resolved
# https://huggingface.co/docs/datasets/loading_datasets.html.

# Load pretrained model and tokenizer
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same octothorp-related question

Comment on lines +274 to +287
# Flatten out
first_sentences = sum(first_sentences, [])
second_sentences = sum(second_sentences, [])

# Tokenize
tokenized_examples = tokenizer(
first_sentences,
second_sentences,
truncation=True,
max_length=data_args.max_seq_length,
padding="max_length" if data_args.pad_to_max_length else False,
)
# Un-flatten
return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, and understandable. Good job!

sgugger and others added 2 commits December 18, 2020 14:10
#

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm satisfied.

@sgugger sgugger merged commit 9a25c5b into master Dec 18, 2020
@sgugger sgugger deleted the run_swag branch December 18, 2020 19:19
thevasudevgupta pushed a commit to thevasudevgupta/transformers that referenced this pull request Dec 23, 2020
* Add new run_swag example

* Add check

* Add sample

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Very important change to make Lysandre happy

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
guyrosin pushed a commit to guyrosin/transformers that referenced this pull request Jan 15, 2021
* Add new run_swag example

* Add check

* Add sample

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Very important change to make Lysandre happy

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants