Name		Name	Last commit message	Last commit date
parent directory ..
data		data
stats		stats
ReadMe.md		ReadMe.md

ReadMe.md

MedMentions: Full Dataset

This is the complete MedMentions dataset. See also additional documentation here.

data: Data files ...
- corpus_pubtator.txt.gz: The annotated data, in PubTator format, gzip-ed.
- corpus_pubtator_pmids_all.txt: A list of PMID's for all the documents in the annotated corpus, one per line.
- corpus_pubtator_pmids_trng.txt, corpus_pubtator_pmids_dev.txt, corpus_pubtator_pmids_test.txt: A 60% / 20% / 20% random split of the documents in the corpus for use in training, dev (or validation), and testing, e.g. for building an Entity linking model.
stats: Some additional statistics.
- TypeMentionStats.xlsx: A breakdown of number of concepts by UMLS Semantic Type, and their coverage in mentions in the MedMentions corpus.

Some Corpus Statistics

Description	Stat	avg
Number of Concepts in UMLS 2017-AA Active	3,271,124
Number of Semantic Types (incl. UnknownType)	128
Number of Annotated Docs in MedMentions	4,392
Total number of Mentioned Concepts	34,724	(1.06% of UMLS)
Total number of Mentions in MedMentions	352,496	(80.3 / doc)
Total Number of of Tokens (PTB via StanfordNLP)	1,176,058	(267.8 / doc)
Number of Annotated Tokens	579,839	(132.0 / doc)
Proportion of tokens annotated	49.3%	(1.6 / mention)

As a comparison, the BioCreative V Chemical-Disease Relation Task Corpus (BC5-CDR) is a smaller set of 1,500 papers annotated only with Chemical and Disease entity mentions from the MeSH ontology (along with CID relations). Entity mentions cover about 11.8% of the tokens in the corpus, at an average of 25.9 annotated tokens per document.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

full

full

ReadMe.md

MedMentions: Full Dataset

Contents

Some Corpus Statistics

Files

full

Directory actions

More options

Directory actions

More options

Latest commit

History

full

Folders and files

parent directory

ReadMe.md

MedMentions: Full Dataset

Contents

Some Corpus Statistics