In this project, we aim to replicate the deep learning model presented in the 2017 paper PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts. The goal is to classify sentences within medical abstracts into specific roles (e.g., objective, methods, results) to facilitate efficient reading and comprehension of research articles.
The growing number of randomized controlled trial (RCT) papers poses challenges in efficiently reviewing the literature, particularly for unstructured abstracts. We aim to address this by developing an NLP model that classifies abstract sentences into their respective roles, aiding researchers in quickly skimming through the literature while allowing for in-depth exploration when needed.
We start by downloading the PubMed RCT200k dataset from GitHub, which serves as the foundation for our model training and evaluation.
We develop a preprocessing function to prepare the dataset for modeling, including tokenization and embedding creation.
We create a TF-IDF classifier to establish a baseline for our modeling experiments.
We experiment with various deep learning models, incorporating different combinations of token embeddings, character embeddings, pretrained embeddings, and positional embeddings.
We build a multimodal model, replicating the architecture outlined in the research paper, which takes multiple types of data inputs.
We evaluate the models and identify the most incorrect predictions to gain insights and improve model performance.
Finally, we use our trained model to make predictions on unseen PubMed abstracts, demonstrating the model's practical application.