-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Script to download and restore annotated data #4856
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable for the most part, couple of strings
@@ -1,4 +1,38 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ty for putting this together!
reasoning = reasonings[quu1["key"]]["reasoning"] | ||
quu1["gpt-3"] = reasoning | ||
# TODO: TMP, remove! | ||
quu1["dataset"] = "semevalcommonsense_gpt3_expl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This the right path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, thank you for looking! this is not the path, that's key-value in the output JSON I added to completely match with your files, so I could easily run "diff". Will remove!
data = json.loads(line.strip()) | ||
blob = {} | ||
blob["premise"] = data["question"] | ||
blob["hypothesis"] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, I wonder if we could get rid of this and the other code that strips this
parse_cosmos(input_file, model_output_reasoning, save_file) | ||
print(f"Saved COSMOSQA dataset in {save_file}") | ||
elif dataset == 'semevalcommonsense': | ||
# input_file = '/private/home/aslic/scorer/data/semevalcomsense/train_filter_maryam.xml' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"maryam" in the path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!
# this file will contain a command to download annotated datasets into "roscoe_data/annotated" folder | ||
echo "Sorry, Pending data release approval" | ||
|
||
PATH_TO_DATA="./projects/roscoe/roscoe_data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit; Usually data is downloaded to ParlAI/data/{project}
, where the path to that data/
folder is found in opt['datapath']
. I think one reason is that data/
is in the .gitignore (and there may be other assumptions coded in, in other places).
Patch description
Updating ROSCOE with script to download and restore human-annotated data.
Updating links.
Testing steps
% python projects/roscoe/roscoe_data/restore_annotated.py
...
Saved SEMEVAL dataset in ./projects/roscoe/roscoe_data/generated/semevalcommonsense.json
Saved GSM8K dataset in ./projects/roscoe/roscoe_data/generated/gsm8k.json
Note: SEMEVAL dataset is not actually released yet. Commented out in the code.