-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update run_glue for do_predict with local test data (#9442) #9486
Update run_glue for do_predict with local test data (#9442) #9486
Conversation
Error messages of the CircleCI are:
I'm sorry but I'd like to ask you if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failures seems spurious indeed. I've left some comment to move the test dataset creation a bit, but it overall looks good to me.
if data_args.task_name is None and data_args.test_file is not None: | ||
extension = data_args.test_file.split(".")[-1] | ||
assert extension in ["csv", "json"], "`test_file` should be a csv or a json file." | ||
if data_args.test_file.endswith(".csv"): | ||
# Loading a dataset from a local csv file | ||
test_dataset = load_dataset("csv", data_files={"test": data_args.test_file}) | ||
else: | ||
# Loading a dataset from a local json file | ||
test_dataset = load_dataset("json", data_files={"test": data_args.test_file}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put those lines earlier, with the validation dataset? This way the map will be done with the other dataset. I think we can do something nice by creating data_files={"train": data_args.train_file, "validation": data_args.validation_file}
and then adding the keys test
if the test_file is passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment! I've reflected the review.
The nested if
statements in the code have increased, but I think the readability may have improved in terms of "whether to use GLUE task or to use local files". What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added the logger output to make sure that the local files a user wants to use are loaded correctly. If this is superfluous, please let me know and I will remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Left two more small comments and it should be good to merge!
if training_args.do_predict: | ||
if data_args.test_file is not None: | ||
extension = data_args.test_file.split(".")[-1] | ||
assert extension in ["csv", "json"], "`test_file` should be a csv or a json file." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extension will need to be the same one as for the training and validation file, so we should adapt this assert to test that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reflecting the comments, assert now checks that the test file
has the same extension as the train file
.
Also, I thought there was no check if the validation file
has the same extension as the train file
, so I modified that. Is this change OK?
datasets = load_dataset( | ||
"json", data_files=data_files | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datasets = load_dataset( | |
"json", data_files=data_files | |
) | |
datasets = load_dataset("json", data_files=data_files) |
Can fit in one line now :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
It may be that the old code before applying the auto-format was left in place.
I have applied the auto-format in b2936c3, could you please check if it is fit in one line?
for key in data_files.keys(): | ||
logger.info(f"load a local file for {key}: {data_files[key]}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could info to log, thanks for adding!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for adding!
@sgugger @LysandreJik |
What does this PR do?
Currently, run_glue.py cannot use the test set (
do_predict
) unless we give it a GLUE task name.This PR will allow us to use the local test dataset.
As commented in #9442, I tried to achieve the functionality with only simple changes.
--do_predict
with out adding specific params, we will get an error statement saying that we need either the GLUE task name or the path of the local test file.Fixes #9442
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@sgugger
Thank you for your kind comments on the issue.
I have tried to keep it simple and hope there is no problem as an example script.