-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build a data pipeline for running unlabelled papers through the labeller #45
Comments
@dsmith111 just created a script that may help convert data into the MATCH format. https://github.com/nasa-petal/PeTaL-labeller/tree/main/scripts/lens-cleaner |
@bruffridge do we need any follow-ups/further scripting to this issue? I know migrating to lambda was mentioned |
Eventually the plan is for this to run in lambda. For now, we just need a python script that downloads the json on every paper from MAG, and transforms them into the format expected by MATCH. |
Some metrics for consideration: 5000 papers per API request (or it might timeout based on my testing). Total: ~5 hours, ~28gb, and 2600 requests to pull down 13,000,000 biology papers using MAG API. |
The papers returned are consistently ordered between requests even without a sortby parameter, which means we can probably use the offset and limit parameters to chunk through the full dataset. 1st API request:
2nd API request:
3rd API request:
~ 5 hours later 2600th API request:
|
This is the API request to get all biology papers from MAG API.
https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate?expr=And(Ty='0',Or(Composite(J.JN=='biomimetics'), Composite(F.FN=='biology')))&model=latest&count=10&offset=0&attributes=Id,DOI,Ti,VFN,F.FN,AA.AuId,AW,RId
These will be the papers we will run through our labeller. If 13 mil papers is technically infeasible or difficult to process (@elkong can help weigh in on this) let me know and we can figure out how to filter this down further.
put them in this JSON format
Example:
paper = MAG Paper Id
mag = MAG Normalized field of study name F.FN (all lowercase, spaces replaced by underscores)
venue = MAG Venue full name VFN
author = array of MAG author ids.
reference = array of MAG paper ids
text = title + abstract. (tokenize the text, remove all punctuation, and convert all characters to lowercase)
label = empty array
How to clean text in python: https://machinelearningmastery.com/clean-text-machine-learning-python/
Then put the JSON on a single line per paper in a
.json
fileThe text was updated successfully, but these errors were encountered: