Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build a data pipeline for running unlabelled papers through the labeller #45

Open
bruffridge opened this issue Jun 17, 2021 · 5 comments
Assignees

Comments

@bruffridge
Copy link
Member

bruffridge commented Jun 17, 2021

  1. Get paper data for the ~13mil biology papers from MAG API. See Use open-source Snorkel to create labelling functions to expand our training dataset. #65 for more details about interfacing with the API.

This is the API request to get all biology papers from MAG API.
https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate?expr=And(Ty='0',Or(Composite(J.JN=='biomimetics'), Composite(F.FN=='biology')))&model=latest&count=10&offset=0&attributes=Id,DOI,Ti,VFN,F.FN,AA.AuId,AW,RId

These will be the papers we will run through our labeller. If 13 mil papers is technically infeasible or difficult to process (@elkong can help weigh in on this) let me know and we can figure out how to filter this down further.

  1. Transform paper data into the format expected by MATCH.

put them in this JSON format

Example:

{
  "paper": "2133743025",
  "doi": "10.1016/J.CUB.2007.07.011"
  "mag": [
    "microtubule_polymerization", "microtubule", "tubulin", "guanosine_triphosphate", "growth_rate", "gtp'", "optical_tweezers", "biophysics", "dimer", "biology"
  ],
  "venue": "Current biology",
  "author": [
    "2305659199", "2275630009", "2294310593", "1706693917", "2152058803"
  ],
  "reference": [
    "2002430130", "2089645884", "1848121837"
  ],
  "text": "microtubule assembly dynamics at the nanoscale background the labile nature of microtubules is critical for establishing cellular morphology and motility yet the molecular basis of assembly remains unclear here we use optical tweezers to track microtubule polymerization against microfabricated barriers permitting unprecedented spatial resolution",
  "label": []
}

paper = MAG Paper Id
mag = MAG Normalized field of study name F.FN (all lowercase, spaces replaced by underscores)
venue = MAG Venue full name VFN
author = array of MAG author ids.
reference = array of MAG paper ids
text = title + abstract. (tokenize the text, remove all punctuation, and convert all characters to lowercase)
label = empty array

How to clean text in python: https://machinelearningmastery.com/clean-text-machine-learning-python/

Then put the JSON on a single line per paper in a .json file

{ "paper": "2133743025","venue": "Current biology",...}
{ "paper": "2002430130","venue": "Journal of Experimental Biology",...}
...
@bruffridge bruffridge changed the title Build a data pipeline for running papers through the labeller Build a data pipeline for running unlabelled papers through the labeller Jun 17, 2021
@bruffridge
Copy link
Member Author

@dsmith111 just created a script that may help convert data into the MATCH format. https://github.com/nasa-petal/PeTaL-labeller/tree/main/scripts/lens-cleaner

@Simarkohli24
Copy link

@bruffridge do we need any follow-ups/further scripting to this issue? I know migrating to lambda was mentioned

@bruffridge
Copy link
Member Author

bruffridge commented Jun 22, 2021

Eventually the plan is for this to run in lambda. For now, we just need a python script that downloads the json on every paper from MAG, and transforms them into the format expected by MATCH.

@bruffridge
Copy link
Member Author

Some metrics for consideration:

5000 papers per API request (or it might timeout based on my testing).
~7 seconds per API request
~10.81MB per request.

Total: ~5 hours, ~28gb, and 2600 requests to pull down 13,000,000 biology papers using MAG API.

@bruffridge
Copy link
Member Author

bruffridge commented Jul 8, 2021

The papers returned are consistently ordered between requests even without a sortby parameter, which means we can probably use the offset and limit parameters to chunk through the full dataset.

1st API request:

limit: 5000
offset: 0

2nd API request:

limit: 5000
offset: 5000

3rd API request:

limit: 5000
offset: 10000

~ 5 hours later

2600th API request:

limit: 5000
offset: 12,995,000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants