Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Script to download and restore annotated data #4856

Merged
merged 5 commits into from
Nov 3, 2022
Merged

Conversation

Golovneva
Copy link
Contributor

@Golovneva Golovneva commented Nov 1, 2022

Patch description
Updating ROSCOE with script to download and restore human-annotated data.
Updating links.

Testing steps
% python projects/roscoe/roscoe_data/restore_annotated.py
...
Saved SEMEVAL dataset in ./projects/roscoe/roscoe_data/generated/semevalcommonsense.json
Saved GSM8K dataset in ./projects/roscoe/roscoe_data/generated/gsm8k.json

Note: SEMEVAL dataset is not actually released yet. Commented out in the code.

@Golovneva Golovneva marked this pull request as ready for review November 3, 2022 13:46
Copy link
Contributor

@moyapchen moyapchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable for the most part, couple of strings

@@ -1,4 +1,38 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ty for putting this together!

reasoning = reasonings[quu1["key"]]["reasoning"]
quu1["gpt-3"] = reasoning
# TODO: TMP, remove!
quu1["dataset"] = "semevalcommonsense_gpt3_expl"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This the right path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, thank you for looking! this is not the path, that's key-value in the output JSON I added to completely match with your files, so I could easily run "diff". Will remove!

data = json.loads(line.strip())
blob = {}
blob["premise"] = data["question"]
blob["hypothesis"] = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, I wonder if we could get rid of this and the other code that strips this

parse_cosmos(input_file, model_output_reasoning, save_file)
print(f"Saved COSMOSQA dataset in {save_file}")
elif dataset == 'semevalcommonsense':
# input_file = '/private/home/aslic/scorer/data/semevalcomsense/train_filter_maryam.xml'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"maryam" in the path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

@Golovneva Golovneva merged commit 5f41ba7 into main Nov 3, 2022
@Golovneva Golovneva deleted the olggol/ha-sets branch November 3, 2022 16:08
# this file will contain a command to download annotated datasets into "roscoe_data/annotated" folder
echo "Sorry, Pending data release approval"

PATH_TO_DATA="./projects/roscoe/roscoe_data"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; Usually data is downloaded to ParlAI/data/{project}, where the path to that data/ folder is found in opt['datapath']. I think one reason is that data/ is in the .gitignore (and there may be other assumptions coded in, in other places).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants