🤖 BioOntoBERT 🤖 Combining BERT with Biomedical Ontologies

This repository provides the code and instructions for pre-training the BioOntoBERT model, which integrates BERT with knowledge from biomedical ontologies. The model is pre-trained using a corpus generated by Onto2Sen from biomedical ontologies and then fine-tuned on the MedMCQA dataset. BioOntoBERT demonstrates enhanced performance over baseline BERT models, including PubMedBERT, in biomedical multiple-choice question-answering tasks. Remarkably, it achieves this with only 0.7% of the pre-training data used for PubMedBERT, showcasing its efficiency and improved accuracy.

Introduction

BioOntoBERT is a domain-specific language model tailored for the biomedical domain. It is pre-trained on a large corpus generated from biomedical ontologies using the Onto2Sen methodology, which helps capture domain-specific context and semantics. This pre-trained model is then fine-tuned on the MedMCQA dataset, a benchmark for biomedical question answering, to improve its performance on this specific task.

Requirements

Python (>=3.6)
PyTorch (>=1.6)
Transformers library (Hugging Face)
CUDA (optional but recommended for faster training)
MedMCQA dataset

Install the required packages using:

pip install torch torchvision transformers

Usage

Follow the below steps to pre-train BioOntoBERT on the Onto2Sen biomedical corpus and fine-tune it on the MedMCQA dataset:

Pre-training

Data Preparation: Prepare the Onto2Sen-generated biomedical corpus in text format for pre-training.
Model Configuration: Modify the pre-training configuration in pretrain_config.json to set hyperparameters, paths, and other settings.
Run Pre-training: Execute the pre-training script

Fine-tuning

Data Preparation: Obtain the MedMCQA dataset and preprocess it for fine-tuning.
Model Configuration: Adjust the fine-tuning configuration in finetune_config.json according to your hardware and preferences.
Run Fine-tuning: Start fine-tuning the pre-trained BioOntoBERT model

Results

The above table shows how efficiently BioOntoBERT is outperforming other pre-training BERT models with just 158MB of pre-training data from Biomedical ontologies

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
Finetuning/src		Finetuning/src
Onto2Sen		Onto2Sen
Pretraining/src		Pretraining/src
Model Evaluation.png		Model Evaluation.png
README.md		README.md
requirements.txt		requirements.txt
results.png		results.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 BioOntoBERT 🤖 Combining BERT with Biomedical Ontologies

Table of Contents

Introduction

Requirements

Usage

Pre-training

Fine-tuning

Results

About

Releases

Packages

Contributors 2

Languages

sahillihas/BioOntoBERT

Folders and files

Latest commit

History

Repository files navigation

🤖 BioOntoBERT 🤖 Combining BERT with Biomedical Ontologies

Table of Contents

Introduction

Requirements

Usage

Pre-training

Fine-tuning

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages