This is the complete MedMentions dataset. See also additional documentation here.
-
data: Data files ...
- corpus_pubtator.txt.gz: The annotated data, in PubTator format,
gzip
-ed. - corpus_pubtator_pmids_all.txt: A list of PMID's for all the documents in the annotated corpus, one per line.
- corpus_pubtator_pmids_trng.txt, corpus_pubtator_pmids_dev.txt, corpus_pubtator_pmids_test.txt: A 60% / 20% / 20% random split of the documents in the corpus for use in training, dev (or validation), and testing, e.g. for building an Entity linking model.
- corpus_pubtator.txt.gz: The annotated data, in PubTator format,
-
stats: Some additional statistics.
- TypeMentionStats.xlsx: A breakdown of number of concepts by UMLS Semantic Type, and their coverage in mentions in the MedMentions corpus.
Description | Stat | avg |
---|---|---|
Number of Concepts in UMLS 2017-AA Active | 3,271,124 | |
Number of Semantic Types (incl. UnknownType) | 128 | |
Number of Annotated Docs in MedMentions | 4,392 | |
Total number of Mentioned Concepts | 34,724 | (1.06% of UMLS) |
Total number of Mentions in MedMentions | 352,496 | (80.3 / doc) |
Total Number of of Tokens (PTB via StanfordNLP) | 1,176,058 | (267.8 / doc) |
Number of Annotated Tokens | 579,839 | (132.0 / doc) |
Proportion of tokens annotated | 49.3% | (1.6 / mention) |
As a comparison, the BioCreative V Chemical-Disease Relation Task Corpus (BC5-CDR) is a smaller set of 1,500 papers annotated only with Chemical and Disease entity mentions from the MeSH ontology (along with CID relations). Entity mentions cover about 11.8% of the tokens in the corpus, at an average of 25.9 annotated tokens per document.