The following documents the data accessible on Zenodo here: full / permissively licensed subset
unarXive is distributed as an XZ compressed TAR archive. You can unpack it using either of the following methods.
- On the command line using
tar -xJf
. - Graphically with 7zip.
Each decompressed file arXiv_src_<yy><mm>_<num>.jsonl
is in the JSON Lines format. This means each line is a JSON object and can, for example, be loaded in Python with json.loads(some_line)
.
Papers are represented as shown in the example below, which is an excerpt from the paper 2105.05862
from the file arXiv_src_2105_034.jsonl
. A full documentation of data fields is given further down.
Paper object
{
"paper_id": "2105.05862",
"_pdf_hash": None,
"_source_hash": "b7d5f27b5c8abc3bd8a44d875899fdc0d945a604",
"_source_name": "2105.05862.gz",
"metadata": {...},
"discipline": "Physics",
"abstract": {...},
"body_text": [...],
"bib_entries": {...},
"ref_entries": {...}
}
One example paragraph in body_text
is:
{
"section": "Memory wave form",
"sec_number": "2.1",
"sec_type": "subsection",
"content_type": "paragraph",
"text": "The gauge choice leading us to this solution does not fix "
"completely all the gauge freedom and an additional constraint "
"should be imposed to leave only the physical degrees of freedom. "
"This is done by projecting the source tensor {{formula:7fd88bcd-"
"9013-433d-9756-b874472530d9}} into its transverse-traceless (TT) "
"components (see for example {{cite:80dbb6c8b9c12f561a8e585faceac5f"
"4e104d60d}}). Doing this and without loss of generality, we will "
"use the following very well known ansatz for the source term "
"proposed in {{cite:bc9a8ca19785627a087ae0c01abe155c22388e16}}\n",
"cite_spans": [...],
"ref_spans": [...]
}
where {{formula:7fd88bcd-9013-433d-9756-b874472530d9}}
refers in ref_entries
to the mathematical notation
{
"latex": "S_{\\mu \\nu }",
"type": "formula"
}
and {{cite:bc9a8ca19785627a087ae0c01abe155c22388e16}}
, for example, refers in bib_entries
to the bibliographical reference
{
"bib_entry_raw": "R. Epstein, The Generation of Gravitational Radiation by "
"Escaping Supernova Neutrinos, Astrophys. J. 223 (1978) "
"1037.",
"contained_arXiv_ids": [],
"contained_links": [
{
"url": "https://doi.org/10.1086/156337",
"text": "Astrophys. J. 223 (1978) 1037.",
"start": 87,
"end": 117
}
],
"discipline": "Physics",
"ids" {...}
}
Papers (JSON objects saved as a single line in a JSONL file) have the following format.
paper_id
: arXiv ID of the paper_pdf_hash
: always None_source_hash
: SHA1 hash of the arXiv source file_source_name
: name of the arXiv source filemetadata
: paper metadata from kaggle.com/datasets/Cornell-University/arxivdiscipline
: scientific discipline of the paperabstract
: paper abstract copied from metadatabody_text
: list of paper content sections (paragraphs, listings, etc.)section
: section namesec_number
: section numbersec_type
: section type (section, subsection, etc.)content_type
: content type (paragraph, listing, etc.)text
: text contentcite_spans
: list of citation markersstart
: starting character offset in textend
: ending character offset in texttext
: surface textref_id
: dictionary key for linked content in bib_entries
ref_spans
: list of referenced non-textual content (figures, formulas, etc.)start
: starting character offset in textend
: ending character offset in texttext
: surface textref_id
: dictionary key for linked content in ref_entries
bib_entries
: list of bibliographic referencesbib_entry_raw
: raw bibliographic reference stringcontained_arXiv_ids
: list of linked arXiv papersid
: ID of linked arXiv papertext
: text segment in reference that the link was attached tostart
: starting character offset in bib_entry_rawend
: ending character offset in bib_entry_raw
contained_links
: list of embedded linksurl
: URL of linktext
: text segment in reference that the link was attached tostart
: starting character offset in bib_entry_rawend
: ending character offset in bib_entry_raw
discipline
: scientific discipline of the cited paperids
: matched identifiers of referenced paperopen_alex_id
: referenced paper’s OpenAlex IDsem_open_alex_id
: referenced paper’s SemOpenAlex IDpubmed_id
: referenced paper’s PubMed IDpmc_id
: referenced paper’s PMC IDdoi
: referenced paper’s DOI
ref_entries
: list of non-textual content (figures, formulas, etc.)type
: content typecaption
: table/figure caption (iftype
is table or figure)latex
: content of LaTeX math mode (iftype
is formula)