-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new run_swag example #9175
Add new run_swag example #9175
Conversation
contexts: list of str. The untokenized text of the first sequence (context of corresponding question). | ||
endings: list of str. multiple choice's options. Its length must be equal to contexts' length. | ||
label: (Optional) string. The label of the example. This should be | ||
specified for train and dev examples, but not for test examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't there be a tab?
specified for train and dev examples, but not for test examples. | |
specified for train and dev examples, but not for test examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not written by me, I just copied it there. (We don't care since it's not rendered in the docs.)
|
||
cached_features_file = os.path.join( | ||
data_dir, | ||
"cached_{}_{}_{}_{}".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no f-strings? ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, not my file ;-)
|
||
def get_train_examples(self, data_dir): | ||
"""See base class.""" | ||
logger.info("LOOKING AT {} train".format(data_dir)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why capitalized here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not my file :-p
|
||
|
||
class RaceProcessor(DataProcessor): | ||
"""Processor for the RACE data set.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible I think it makes a lot of sense to link to the dataset in datasets
if available here
|
||
|
||
class SwagProcessor(DataProcessor): | ||
"""Processor for the SWAG data set.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link to dataset in datasets
|
||
|
||
class ArcProcessor(DataProcessor): | ||
"""Processor for the ARC data set (request from allennlp).""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link to dataset if available
def __post_init__(self): | ||
if self.train_file is not None: | ||
extension = self.train_file.split(".")[-1] | ||
assert extension in ["csv", "json"], "`train_file` should be a csv or a json file." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can also be files in .json
format that don't have the .json
extension (I saw that already quite a bit when porting datasets....) my better to have a try except catch here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script is not universal, I'd leave it to the users to do the change if they have a json file without the json extension (the bottomline is that if they did not prepare it themselves, it's unlikely to work out of the box since the data is expected to have the same format as swag on datasets).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Left some comments, mostly nits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, LGTM! Great!
| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb) | ||
| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a satisfying diff :)
@@ -0,0 +1,349 @@ | |||
# coding=utf-8 | |||
# Copyright The HuggingFace Team and The HuggingFace Inc. team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we usually put the date here (2020) ?
examples/multiple-choice/run_swag.py
Outdated
# https://huggingface.co/docs/datasets/loading_datasets.html. | ||
|
||
# Load pretrained model and tokenizer | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same octothorp-related question
# Flatten out | ||
first_sentences = sum(first_sentences, []) | ||
second_sentences = sum(second_sentences, []) | ||
|
||
# Tokenize | ||
tokenized_examples = tokenizer( | ||
first_sentences, | ||
second_sentences, | ||
truncation=True, | ||
max_length=data_args.max_seq_length, | ||
padding="max_length" if data_args.pad_to_max_length else False, | ||
) | ||
# Un-flatten | ||
return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice, and understandable. Good job!
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
examples/multiple-choice/run_swag.py
Outdated
# | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm satisfied.
* Add new run_swag example * Add check * Add sample * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Very important change to make Lysandre happy Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Add new run_swag example * Add check * Add sample * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Very important change to make Lysandre happy Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
What does this PR do?
This PR adds a new example for multiple-choice using Trainer and Datasets, and moves the older one to the legacy folder.