Skip to content

Latest commit

 

History

History

full

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

MedMentions: Full Dataset

This is the complete MedMentions dataset. See also additional documentation here.

Contents

  • data: Data files ...

    • corpus_pubtator.txt.gz: The annotated data, in PubTator format, gzip-ed.
    • corpus_pubtator_pmids_all.txt: A list of PMID's for all the documents in the annotated corpus, one per line.
    • corpus_pubtator_pmids_trng.txt, corpus_pubtator_pmids_dev.txt, corpus_pubtator_pmids_test.txt: A 60% / 20% / 20% random split of the documents in the corpus for use in training, dev (or validation), and testing, e.g. for building an Entity linking model.
  • stats: Some additional statistics.

    • TypeMentionStats.xlsx: A breakdown of number of concepts by UMLS Semantic Type, and their coverage in mentions in the MedMentions corpus.

Some Corpus Statistics

Description Stat avg
Number of Concepts in UMLS 2017-AA Active 3,271,124
Number of Semantic Types (incl. UnknownType) 128
Number of Annotated Docs in MedMentions 4,392
Total number of Mentioned Concepts 34,724 (1.06% of UMLS)
Total number of Mentions in MedMentions 352,496 (80.3 / doc)
Total Number of of Tokens (PTB via StanfordNLP) 1,176,058 (267.8 / doc)
Number of Annotated Tokens 579,839 (132.0 / doc)
Proportion of tokens annotated 49.3% (1.6 / mention)

As a comparison, the BioCreative V Chemical-Disease Relation Task Corpus (BC5-CDR) is a smaller set of 1,500 papers annotated only with Chemical and Disease entity mentions from the MeSH ontology (along with CID relations). Entity mentions cover about 11.8% of the tokens in the corpus, at an average of 25.9 annotated tokens per document.