diff --git a/docs/src/user/tasks.rst b/docs/src/user/tasks.rst index ff229041..1787d0be 100644 --- a/docs/src/user/tasks.rst +++ b/docs/src/user/tasks.rst @@ -18,6 +18,242 @@ Predictions are generated by *Model Developers* when they define :meth:`rafiki.m and received by *Application Users* as predictions to their queries sent to *Inference Jobs*. +QUESTION_ANSWERING +-------------------------------------------------------------------- + + +Dataset Format +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:ref:`dataset-type:IMAGE_FILES` + + +Dataset can be used to finetune the SQuAD pre-trained Bert model. + +- The dataset zips folders containing JSON files. JSON files under different folders will be automaticly read all together. + +The dataset structure, JSON schema and metadata csv examples are given below. + +An example of the dataset structure: + +.. code-block:: text + + /DATASET_NAME.zip + │ + ├──arxiv + │ └──arxiv + │ └──pdf_json + │ ├── 003d2e515e1aaf06f0052769953e861ed8e56608.json(97.51 KB) + │ ├── 00a407540a8bdd6d7425bd8a561eb21d69682511.json(48.45 KB) + │ ...(788 files) + │ + ├──biorxiv_medrxiv + │ └──biorxiv_medrxiv + │ └──pdf_json + │ ├── 0015023cc06b5362d332b3baf348d11567ca2fbb.json(71.27 KB) + │ ├── 001b4a31684c8fc6e2cfbb70304354978317c429.json(126.12 KB) + │ ...(2670 files) + │ + ├──comm_use_subset + │ └──comm_use_subset + │ ├──pdf_json + │ │ ├── 000b7d1517ceebb34e1e3e817695b6de03e2fa78.json(12.06 KB) + │ │ ...(9918 files) + │ │ + │ └──pmc_json + │ ├── PMC1054884.xml.json(97.67 KB) + │ ...(9540 files) + │ + ├──custom_license + │ └──custom_license + │ ├──pdf_json + │ │ ├── 0001418189999fea7f7cbe3e82703d71c85a6fe5.json(48.76 KB) + │ │ ...(32.5k files) + │ │ + │ └──pmc_json + │ ├── PMC1065028.xml.json(16.53 KB) + │ ...(11.0k files) + │ + ├──noncomm_use_subset + │ └──noncomm_use_subset + │ ├──pdf_json + │ │ ├── 0036b28fddf7e93da0970303672934ea2f9944e7.json(708.8 KB) + │ │ ...(2584 files) + │ │ + │ └──pmc_json + │ ├── PMC1616946.xml.json + │ ...(2311 files) + │ + └──metadata.csv + +- JSON file includes ``abstract`` and ``body_text``, providing, providing list of paragraphs in the abstract, and list of paragraphs in full body which can be used for question answering. And JSON file also includs ``paper_id``, 40-character sha1 of the PDF. + +An example of JSON schema with full text documents: + +.. code-block:: text + + { + "paper_id": , # 40-character sha1 of the PDF + "metadata": { + "title": , + "authors": [ # list of author dicts, in order + { + "first": , + "middle": , + "last": , + "suffix": , + "affiliation": , + "email": + }, + ... + ], + "abstract": [ # list of paragraphs in the abstract + { + "text": , + "cite_spans": [ # list of character indices of inline citations + # e.g. citation "[7]" occurs at positions 151-154 in "text" + # linked to bibliography entry BIBREF3 + { + "start": 151, + "end": 154, + "text": "[7]", + "ref_id": "BIBREF3" + }, + ... + ], + "ref_spans": , # e.g. inline reference to "Table 1" + "section": "Abstract" + }, + ... + ], + "body_text": [ # list of paragraphs in full body + # paragraph dicts look the same as above + { + "text": , + "cite_spans": [], + "ref_spans": [], + "eq_spans": [], + "section": "Introduction" + }, + ... + { + ..., + "section": "Conclusion" + } + ], + "bib_entries": { + "BIBREF0": { + "ref_id": , + "title": , + "authors": # same structure as earlier, + # but without `affiliation` or `email` + "year": , + "venue": , + "volume": , + "issn": , + "pages": , + "other_ids": { + "DOI": [ + + ] + } + }, + "BIBREF1": {}, + ... + "BIBREF25": {} + }, + "ref_entries": + "FIGREF0": { + "text": , # figure caption text + "type": "figure" + }, + ... + "TABREF13": { + "text": , # table caption text + "type": "table" + } + }, + "back_matter": # same structure as body_text + } + } + + +- ``metadata.csv`` gives additional information, i.e. authors, title, journal and publish_time, mapping to JSON files by sha values. ``cord_uid`` serves unique values serve as the entry identity. Do note that in certain condition, a PDF/sha correponding to the main article, and possibly additional PDF/shas corresponding to supporting materials for the article. + +.. note:: + + (1) Metadata for papers from these sources are combined: CZI, PMC, BioRxiv/MedRxiv. (total records 29500) + - CZI 1236 records + - PMC 27337 + - bioRxiv 566 + - medRxiv 361 + (2) 17K of the paper records have PDFs and the hash of the PDFs are in 'sha' + (3) For PMC sourced papers, one paper's metadata can be associated with one or more PDFs/shas under that paper - a PDF/sha correponding to the main article, and possibly additional PDF/shas corresponding to supporting materials for the article. + (4) 13K of the PDFs were processed with fulltext ('has_full_text'=True) + (5) Various 'keys' are populated with the metadata: + - 'pmcid': populated for all PMC paper records (27337 non null) + - 'doi': populated for all BioRxiv/MedRxiv paper records and most of the other records (26357 non null) + - 'WHO #Covidence': populated for all CZI records and none of the other records (1236 non null) + - 'pubmed_id': populated for some of the records + - 'Microsoft Academic Paper ID': populated for some of the records + +An example of ``meta.csv``(85.15 MB) entry: + ===================== ===================== + Column Names Column Values + --------------------- --------------------- + cord_uid zjufx4fo + sha b2897e1277f56641193a6db73825f707eed3e4c9 + source_x PMC + title Sequence requirements for RNA strand transfer during nidovirus ... + doi 10.1093/emboj/20.24.7220 + pmcid PMC125340 + pubmed_id 11742998 + license unk + abstract Nidovirus subgenomic mRNAs contain a leader sequence derived ... + publish_time 2001-12-17 + ===================== ===================== + +Query Format +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. note:: + + - The pretrained model should be fine-tuned with a dataset first to adapt to particular question domains when necessary. + - Otherwise, following the question, input should contain relevant information (context paragraph or candidate answers, or both), whether or not addresses the question. + +Query is in JSON format. While the relevant information is provided in query, the question always comes first, followed by additional information. We use “\n” separators between different parts of the input. + +.. code-block:: text + + { + 'questions': ['At what speed did the turbine operate? \n (Nikola_Tesla) On his 50th birthday in 1906, .... several of his bladeless turbine engines were tested at 100–5,000 hp.', + 'What does Paul McCartney think about his music? \n LAS VEGAS, Nevada (CNN) -- Former Beatles Paul McCartney and Ringo Starr clowned around and marveled at their band's amazing impact in an interview Tuesday on CNN's "Larry King Live." ... McCartney said the early Beatles knew they were a good band and were pretty sure of themselves, but Starr said, "We thought we'd be really big in Liverpool." ', + 'The author tells us that to succeed in a project you are in charge of, you should _ . \n (A) make everyone work for you (B) get everyone willing to help you (C) let people know you have the final say (D) keep sending out orders to them \n If you're in charge of a project, the key to success is getting everyone to want to help you. ... You and your team can discover the answers to problems together. ', + 'is the isle of man a part of great britain? \n (Isle of Man) In 1266, the island became part of Scotland under the Treaty of Perth, after being ruled by Norway.' + ] + + 'target_answers':['16,000 rpm', + 'very good', + 'get everyone willing to help you', + 'no' + ] + } + +Prediction Format +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The output is in JSON format. + +.. code-block:: text + + {'answers':['16,000 rpm', + 'very good', + 'get everyone willing to help you', + 'no' + ]} + + + IMAGE_CLASSIFICATION -------------------------------------------------------------------- @@ -248,4 +484,4 @@ A `Base64-encoded `_ string of the bytes o Prediction Format ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -A string, representing the predicted transcript for the audio. \ No newline at end of file +A string, representing the predicted transcript for the audio.