This repository is the official implementation of Self-supervised context-aware Covid-19 document exploration through atlas grounding authored by Dusan Grujicic*, Gorjan Radevski*, Tinne Tuytelaars, Matthew Blaschko. NLP COVID-19 Workshop at ACL 2020.
See our Cord-19 Explorer and our Cord-19 Visualizer tools.
* Equal contribution
If you are using Poetry, navigating to the project root directory and running poetry install
will suffice. Otherwise, a requirements.txt
file is present so you can install all dependencies by running pip install -r requirements.txt
. However, if you just want to download the trained models or dataset splits, make sure to have gdown installed. If the project dependencies are installed then gdown
is already present. Otherwise, run pip install gdown
to install it.
The data we use to perform the research consist of the splits used for training, validation and testing the model, together with a 3D human model.
The training, validation and test splits obtained from the original dataset can be downloaded with gdown
using the code snippet bellow.
gdown "https://drive.google.com/uc?id=1kLvbRVzyR-66lrfzLfeFd3k9-l_S_Cl4" -O data/cord_dataset_train.json
gdown "https://drive.google.com/open?id=1mnlcI5HwgY9RaCqPyWmpEeftnqIxAUQQ" -O data/cord_dataset_val.json
gdown "https://drive.google.com/uc?id=18VSbspzB2VjxDdLaVSNyFB-GZAvEopGE" -O data/cord_dataset_test.json
Instructions for obtaining the human atlas can be found on the Voxel-Man website. The obtained model contains images of the male head head.zip
and torso innerorgans.zip
. The unzipped directory innerograns/
, contains a text file with organs and their segmentation labels, and three directories, CT/
, labels/
, rgb/
.
The innerorgans/labels/
directory constains slices of the human atlas in the form of .tiff
images, where the grayscale level represents the segmentation label for each organ. It is used for training and evaluating the model, and should be moved to the data/
directory in the project prior to running the scripts.
The required four json files organ2ind.json
, ind2organ.json
, organ2label.json
and organ2alias.json
that contain the the dictionaries related to the organs in the human atlas can be downloaded and extracted by running:
gdown "https://drive.google.com/uc?id=18qxmrOovy1_Cd4ceUNLPKTQUHf3RRs1r" -O data/data_organs_cord.zip
unzip -qq data/data_organs_cord.zip
rm data/data_organs_cord.zip
Details of the steps (removals, mergers of organ segmentation labels and renamings) that resulted in such json files can be found here. An additional three json files need to be generated after obtaining the human atlas and moving the labels/
directory with images to the data/
directory of the project. This can be done by running the following script:
python src/generate_voxel_dict.py --organs_dir_path "data/data_organs_cord"\
--voxelman_images_path "data/labels"
This script should generate three additional json files organ2voxels.json
, organ2voxels_eroded.json
, organ2summary.json
, and place them in the data/data_organs_cord/
directory.
To train a new model on the training data split, from the root project directory run:
python src/train_mapping_reg.py --batch_size 128\
--save_model_path "models/cord_basebert_grounding.pt"\
--save_intermediate_model_path "models/intermediate_cord_basebert_grounding.pt"\
--train_json_path "data/cord_dataset_train.json"\
--val_json_path "data/cord_dataset_val.json"\
--epochs 20\
--bert_name "bert-base-uncased"\
--loss_type "all_voxels"\
--organs_dir_path "data/data_organs_cord"\
--learning_rate 2e-5
The script will train a model for 20 epochs, and will save the model with that reports the lowest distance to the nearest voxel on the validation set at "models/cord_basebert_grounding.pt"
. Furthermore, keeping the arguments as they are, while changing --bert_name
to bert-base-uncased
, emilyalsentzer/Bio_ClinicalBERTpytorch
, allenai/scibert_scivocab_uncased
or emilyalsentzer/Bio_ClinicalBERT
, will reproduce the BertBase
, BioBert
, SciBert
and ClinicalBert
models from the paper accordingly. To train the model we use for the Cord-19 Explorer tool, the --bert_name
argument should be changed to google/bert_uncased_L-4_H-512_A-8
, --learning_rate
to 5e-5
and --epochs
to 50
.
To perform inference on the test data split, from the root project directory run:
python src/inference_mapping_reg.py --batch_size 128\
--checkpoint_path "models/cord_basebert_grounding.pt"\
--test_json_path "data/cord_dataset_test.json"\
--bert_name "bert-base-uncased"\
--organs_dir_path "data/data_organs_cord"
The script will perfrom inference with the trained model saved at models/cord_basebert_grounding.pt
, and report:
- Distance to the nearest voxel of the nearest correct organ (NVD).
- Distance to the nearest correct organ voxel calculated only on the samples for which the projection is outside the organ volume (NVD-O).
- Rate at which the sentences are grounded within the volume of the correct organ, which we denote as Inside Organ Ratio (IOR).
both NVD and NVD-O are calculated in centimeters.
All models used to report the results in the paper can be downloaded with gdown
using the code snippet bellow.
gdown "https://drive.google.com/uc?id=17_2g3kWndZI64WpGSR4EZEIK2qBzLrtI" -O models/cord_basebert_grounding.pt
gdown "https://drive.google.com/uc?id=17nUZ0Iym6q7U83kO9QowdmCzvQlp7Cce" -O models/cord_biobert_grounding.pt
gdown "https://drive.google.com/uc?id=1_WxTKu7qJ0sF5oLqniYnTMUVIFcJ1pPJ" -O models/cord_scibert_grounding.pt
gdown "https://drive.google.com/uc?id=144TyLhPmPnZNH88hP4WHLzAC4So7OvFU" -O models/cord_clinicalbert_grounding.pt
gdown "https://drive.google.com/uc?id=11OHi9wETRPAHUTIH4p6BqZY3gH6NJtve" -O models/cord_smallbert_grounding.pt
If you found this code useful, or use some of our resources for your work, we will appreciate if you cite our paper.
@inproceedings{grujicic-radevski-covid-20,
title={ Self-supervised context-aware Covid-19 document exploration through atlas grounding },
author={Dusan Grujicic and Gorjan Radevski and Tinne Tuytelaars and Matthew Blaschko},
year={2020},
booktitle={Proceedings of the 1st Workshop on {NLP} for {COVID-19} at {ACL 2020}},
month = jul,
volume = 1,
address = {Online},
publisher = {Association for Computational Linguistics},
abstract = {In this paper, we aim to develop a self-supervised grounding of Covid-related medical text based on the actual spatial relationships between the referred anatomical concepts. More specifically, we learn to project sentences into a physical space defined by a three-dimensional anatomical atlas, allowing for a visual approach to navigating Covid-related literature. We design a straightforward and empirically effective training objective to reduce the curated data dependency issue. We use BERT as the main building block of our model and perform a quantitative analysis that demonstrates that the model learns a context-aware mapping. We illustrate two potential use-cases for our approach, one in interactive, 3D data exploration, and the other in document retrieval. To accelerate research in this direction, we make public all trained models, codebase and the developed tools, which can be accessed at https://github.com/gorjanradevski/macchina/.},
}
Everything is licensed under the MIT License.