Build a data pipeline for running unlabelled papers through the labeller #45

bruffridge · 2021-06-17T13:12:54Z

Get paper data for the ~13mil biology papers from MAG API. See Use open-source Snorkel to create labelling functions to expand our training dataset. #65 for more details about interfacing with the API.

This is the API request to get all biology papers from MAG API.
https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate?expr=And(Ty='0',Or(Composite(J.JN=='biomimetics'), Composite(F.FN=='biology')))&model=latest&count=10&offset=0&attributes=Id,DOI,Ti,VFN,F.FN,AA.AuId,AW,RId

These will be the papers we will run through our labeller. If 13 mil papers is technically infeasible or difficult to process (@elkong can help weigh in on this) let me know and we can figure out how to filter this down further.

Transform paper data into the format expected by MATCH.

put them in this JSON format

Example:

{
  "paper": "2133743025",
  "doi": "10.1016/J.CUB.2007.07.011"
  "mag": [
    "microtubule_polymerization", "microtubule", "tubulin", "guanosine_triphosphate", "growth_rate", "gtp'", "optical_tweezers", "biophysics", "dimer", "biology"
  ],
  "venue": "Current biology",
  "author": [
    "2305659199", "2275630009", "2294310593", "1706693917", "2152058803"
  ],
  "reference": [
    "2002430130", "2089645884", "1848121837"
  ],
  "text": "microtubule assembly dynamics at the nanoscale background the labile nature of microtubules is critical for establishing cellular morphology and motility yet the molecular basis of assembly remains unclear here we use optical tweezers to track microtubule polymerization against microfabricated barriers permitting unprecedented spatial resolution",
  "label": []
}

paper = MAG Paper Id
mag = MAG Normalized field of study name F.FN (all lowercase, spaces replaced by underscores)
venue = MAG Venue full name VFN
author = array of MAG author ids.
reference = array of MAG paper ids
text = title + abstract. (tokenize the text, remove all punctuation, and convert all characters to lowercase)
label = empty array

How to clean text in python: https://machinelearningmastery.com/clean-text-machine-learning-python/

Then put the JSON on a single line per paper in a .json file

{ "paper": "2133743025","venue": "Current biology",...}
{ "paper": "2002430130","venue": "Journal of Experimental Biology",...}
...

The text was updated successfully, but these errors were encountered:

bruffridge · 2021-06-21T19:29:41Z

@dsmith111 just created a script that may help convert data into the MATCH format. https://github.com/nasa-petal/PeTaL-labeller/tree/main/scripts/lens-cleaner

Simarkohli24 · 2021-06-22T02:31:19Z

@bruffridge do we need any follow-ups/further scripting to this issue? I know migrating to lambda was mentioned

bruffridge · 2021-06-22T12:55:03Z

Eventually the plan is for this to run in lambda. For now, we just need a python script that downloads the json on every paper from MAG, and transforms them into the format expected by MATCH.

bruffridge · 2021-07-08T19:01:06Z

Some metrics for consideration:

5000 papers per API request (or it might timeout based on my testing).
~7 seconds per API request
~10.81MB per request.

Total: ~5 hours, ~28gb, and 2600 requests to pull down 13,000,000 biology papers using MAG API.

bruffridge · 2021-07-08T19:05:35Z

The papers returned are consistently ordered between requests even without a sortby parameter, which means we can probably use the offset and limit parameters to chunk through the full dataset.

1st API request:

limit: 5000
offset: 0

2nd API request:

limit: 5000
offset: 5000

3rd API request:

limit: 5000
offset: 10000

~ 5 hours later

2600th API request:

limit: 5000
offset: 12,995,000

bruffridge changed the title ~~Build a data pipeline for running papers through the labeller~~ Build a data pipeline for running unlabelled papers through the labeller Jun 17, 2021

bruffridge assigned Simarkohli24 Jun 17, 2021

bruffridge mentioned this issue Jun 17, 2021

Verify input format expected by MATCH for unlabelled papers #48

Closed

bruffridge mentioned this issue Jul 9, 2021

Mock up the labeller #9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build a data pipeline for running unlabelled papers through the labeller #45

Build a data pipeline for running unlabelled papers through the labeller #45

bruffridge commented Jun 17, 2021 •

edited

Loading

bruffridge commented Jun 21, 2021

Simarkohli24 commented Jun 22, 2021

bruffridge commented Jun 22, 2021 •

edited

Loading

bruffridge commented Jul 8, 2021

bruffridge commented Jul 8, 2021 •

edited

Loading

Build a data pipeline for running unlabelled papers through the labeller #45

Build a data pipeline for running unlabelled papers through the labeller #45

Comments

bruffridge commented Jun 17, 2021 • edited Loading

bruffridge commented Jun 21, 2021

Simarkohli24 commented Jun 22, 2021

bruffridge commented Jun 22, 2021 • edited Loading

bruffridge commented Jul 8, 2021

bruffridge commented Jul 8, 2021 • edited Loading

bruffridge commented Jun 17, 2021 •

edited

Loading

bruffridge commented Jun 22, 2021 •

edited

Loading

bruffridge commented Jul 8, 2021 •

edited

Loading