BigBIO (BigScience Biomedical) is an open library of biomedical dataloaders built using Huggingface's (🤗) datasets
library for data-centric machine learning. Our goals include:
- Lightweight, programmatic access to biomedical datasets at scale
- Promoting reproducibility in data processing
- Better documentation for dataset provenance, licensing, and other key attributes
- Easier generation of meta-datasets for natural language prompting, multi-task learning
Currently BigBIO provides support for:
- 126+ biomedical datasets
- 10+ languages
- 12 task categories
- Harmonized dataset schemas by task type
- Metadata on licensing, coarse/fine-grained task types, domain, and more!
- Tutorials
- Materializing Meta-datasets
- Prompt Engineering and Evaluation
- BigBIO Report Card (Unit tests, Benchmarks, and more)
- Streamlit Visualization Demo
- BigBIO Data Cards
- Volunteer Project Board: Implement or suggest new datasets
- Contributor Guide
- Task Schema Overview
just want to use the package? pip install from github
pip install git+https://github.com/bigscience-workshop/biomedical.git
want to develop? pip install from cloned repo
git clone git@github.com:bigscience-workshop/biomedical.git
cd biomedical
pip install -e .
Start by creating an instance of BigBioConfigHelpers
.
This will help locate and filter datasets available in the bigbio package.
from bigbio.dataloader import BigBioConfigHelpers
conhelps = BigBioConfigHelpers()
print("found {} dataset configs from {} datasets".format(
len(conhelps),
len(conhelps.available_dataset_names)
))
Each bigbio dataset has at least one source config and one bigbio config. Source configs attempt to preserve the original structure of the dataset while bigbio configs are normalized into one of several bigbio task schemas
The conhelps
container has one element for every config of every dataset.
The config helper for the source config of the
BioCreative V Chemical-Disease Relation
dataset looks like this,
BigBioConfigHelper(
script='/home/galtay/repos/biomedical/bigbio/biodatasets/bc5cdr/bc5cdr.py',
dataset_name='bc5cdr',
tasks=[
<Tasks.NAMED_ENTITY_RECOGNITION: 'NER'>,
<Tasks.NAMED_ENTITY_DISAMBIGUATION: 'NED'>,
<Tasks.RELATION_EXTRACTION: 'RE'>
],
languages=[<Lang.EN: 'English'>],
config=BigBioConfig(
name='bc5cdr_source',
version=1.5.16,
data_dir=None,
data_files=None,
description='BC5CDR source schema',
schema='source',
subset_id='bc5cdr'
),
is_local=False,
is_bigbio_schema=False,
bigbio_schema_caps=None,
is_large=False,
is_resource=False,
is_default=True,
is_broken=False,
bigbio_version='1.0.0',
source_version='01.05.16',
citation='@article{DBLP:journals/biodb/LiSJSWLDMWL16,\n author = {Jiao Li and\n Yueping Sun and\n
Robin J. Johnson and\n Daniela Sciaky and\n Chih{-}Hsuan Wei and\n Robert Leaman and\n
Allan Peter Davis and\n Carolyn J. Mattingly and\n Thomas C. Wiegers and\n Zhiyong Lu},\n
title = {BioCreative {V} {CDR} task corpus: a resource for chemical disease\n relation extraction},\n journal =
{Database J. Biol. Databases Curation},\n volume = {2016},\n year = {2016},\n url =
{https://doi.org/10.1093/database/baw068},\n doi = {10.1093/database/baw068},\n timestamp = {Thu, 13 Aug 2020 12:41:41
+0200},\n biburl = {https://dblp.org/rec/journals/biodb/LiSJSWLDMWL16.bib},\n bibsource = {dblp computer science bibliography,
https://dblp.org}\n}\n',
description='The BioCreative V Chemical Disease Relation (CDR) dataset is a large annotated text corpus of\nhuman annotations of
all chemicals, diseases and their interactions in 1,500 PubMed articles.\n',
homepage='http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/',
license='Public Domain Mark 1.0'
)
Each config helper provides a wrapper to the
load_dataset
function from Huggingface's datasets package.
This wrapper will automatically populate the first two arguments of load_dataset
,
path
: path to the dataloader scriptname
: name of the dataset configuration
If you have a specific dataset and config in mind, you can,
- fetch the helper from
conhelps
with thefor_config_name
method - load the dataset using the
load_dataset
wrapper
bc5cdr_source = conhelps.for_config_name("bc5cdr_source").load_dataset()
bc5cdr_bigbio = conhelps.for_config_name("bc5cdr_bigbio_kb").load_dataset()
This wrapper function will pass through any other kwargs you may need to use.
For example data_dir
for datasets that are not public,
n2c2_2011_source = conhelps.for_config_name("n2c2_2011_source").load_dataset(data_dir="/path/to/n2c2_2011/data")
Some useful dataset discovery tools are available from the BigBioConfigHeplers
# all dataset name
dataset_names = conhelps.available_dataset_names
# all dataset config names
ds_config_names = [helper.config.name for helper in conhelps]
You can use any attribute of a BigBioConfigHelper
to filter the collection.
Here are some examples,
# public bigbio configs that are not extra large
bb_public_helpers = conhelps.filtered(
lambda x:
x.is_bigbio_schema
and not x.is_local
and not x.is_large
)
# source versions of all n2c2 datasets
n2c2_source_helpers = conhelps.filtered(
lambda x:
x.dataset_name.startswith("n2c2")
and not x.is_bigbio_schema
)
from bigbio.utils.constants import Tasks
nli_helpers = conhelps.filtered(
lambda x: Tasks.TEXTUAL_ENTAILMENT in x.tasks
)
You can iterate over any instance of BigBioConfigHelpers
and store the loaded datasets
in a container (e.g. a dictionary),
bb_public_datasets = {
helper.config.name: helper.load_dataset()
for helper in bb_public_helpers
}
The BigBioConfigHelper
provides a get_metadata
method that will calculate
schema specific metadata for configs implementing a BigBIO schema.
For example,
helper = conhelps.for_config_name('scitail_bigbio_te')
metadata = helper.get_metadata()
print(metadata)
{'train': BigBioTeMetadata(samples_count=23596, premise_char_count=2492695, hypothesis_char_count=1669028, label_counter={'neutral': 14994, 'entailment': 8602}), 'test': BigBioTeMetadata(samples_count=2126, premise_char_count=216196, hypothesis_char_count=153547, label_counter={'neutral': 1284, 'entailment': 842}), 'validation': BigBioTeMetadata(samples_count=1304, premise_char_count=138239, hypothesis_char_count=97320, label_counter={'entailment': 657, 'neutral': 647})}
At it's core, the bigbio package is a collection of dataloader scripts written to be used with the datasets package.
All of these scripts live in dataset specific directories in the bigbio/biodatasets
folder.
If you cloned the repo, you will find the bigbio
directory at the top level of the repo.
If you pip installed the package from github, the bigbio
directory will be in
the site-packages
directory of your python environment.
You can use them with the load_dataset function as you would any other local dataloader script if you like. For example, if you cloned the repo
from datasets import load_dataset
dsd = load_dataset("bigbio/biodatasets/anat_em/anat_em.py", name="anat_em_bigbio_kb")
In addition there is a convenience function that will provide the kwargs that should be used,
load_dataset_kwargs = conhelps.for_config_name("n2c2_2011_source").get_load_dataset_kwargs(data_dir="path/to_data")
print(load_dataset_kwargs)
{
'path': '/home/user/repos/biomedical/bigbio/biodatasets/n2c2_2011/n2c2_2011.py',
'name': 'n2c2_2011_source',
'data_dir': 'path/to/data'
}
dsd = load_dataset(**load_dataset_kwargs)
BigBIO includes support for almost all datasets included in other popular English biomedical benchmarks.
Task Type | Dataset | BigBIO (ours) | BLUE | BLURB | BoX | DUA needed |
---|---|---|---|---|---|---|
NER | BC2GM | ✓ | ✓ | ✓ | ||
NER | BC5-chem | ✓ | ✓ | ✓ | ✓ | |
NER | BC5-disease | ✓ | ✓ | ✓ | ✓ | |
NER | EBM PICO | ✓ | ✓ | |||
NER | JNLPBA | ✓ | ✓ | ✓ | ||
NER | NCBI-disease | ✓ | ✓ | ✓ | ||
RE | ChemProt | ✓ | ✓ | ✓ | ✓ | |
RE | DDI | ✓ | ✓ | ✓ | ✓ | |
RE | GAD | ✓ | ✓ | |||
QA | PubMedQA | ✓ | ✓ | ✓ | ||
QA | BioASQ | ✓ | ✓ | ✓ | ✓ | |
DC | HoC | ✓ | ✓ | ✓ | ✓ | |
STS | BIOSSES | ✓ | ✓ | ✓ | ||
STS | MedSTS | * | ✓ | ✓ | ||
NER | n2c2 2010 | ✓ | ✓ | ✓ | ✓ | |
NER | ShARe/CLEF 2013 | * | ✓ | ✓ | ||
NLI | MedNLI | ✓ | ✓ | ✓ | ||
NER | n2c2 deid 2006 | ✓ | ✓ | ✓ | ||
DC | n2c2 RFHD 2014 | ✓ | ✓ | ✓ | ||
NER | AnatEM | ✓ | ✓ | |||
NER | BC4CHEMD | ✓ | ✓ | |||
NER | BioNLP09 | ✓ | ✓ | |||
NER | BioNLP11EPI | ✓ | ✓ | |||
NER | BioNLP11ID | ✓ | ✓ | |||
NER | BioNLP13CG | ✓ | ✓ | |||
NER | BioNLP13GE | ✓ | ✓ | |||
NER | BioNLP13PC | ✓ | ✓ | |||
NER | CRAFT | * | ✓ | |||
NER | Ex-PTM | ✓ | ✓ | |||
NER | Linnaeus | ✓ | ✓ | |||
POS | GENIA | * | ✓ | |||
SA | Medical Drugs | ✓ | ✓ | |||
SR | COVID | private | ||||
SR | Cooking | private | ||||
SR | HRT | private | ||||
SR | Accelerometer | private | ||||
SR | Acromegaly | private |
* denotes dataset implementation in-progress
If you use BigBIO in your work, please cite
@article{fries2022bigbio,
title = {
BigBIO: A Framework for Data-Centric Biomedical Natural Language
Processing
},
author = {
Fries, Jason Alan and Weber, Leon and Seelam, Natasha and Altay,
Gabriel and Datta, Debajyoti and Garda, Samuele and Kang, Myungsun
and Su, Ruisi and Kusa, Wojciech and Cahyawijaya, Samuel and others
},
journal = {arXiv preprint arXiv:2206.15076},
year = 2022
}
BigBIO is a open source, community effort made possible through the efforts of many volunteers as part of BigScience and the Biomedical Hackathon.