This repository has the AI models built around the SODA-data datasets.
For transparency we offer bash scripts that will automatically run the experiments shown in the
papers related to this and the aford mentioned repository.
This scripts are located in the scripts
folder of the repository.
Everything is ready to run on a linux machine with GPU availability using the provided docker container.
Currently, we show the models for:
- Named-entity recognition
- Semantic identification of empirical roles of entities
- Panelization of figure captions
All these models can be found in the token_classification
folder.
-
Clone the repository to your local computer
-
Generate a virtual encironment and activate it:
python -m venv /path/to/new/virtual/environment
source activate .venv/bin/activate
- Make sure that you have a docker-compose version that supports versions 2.3 or higher:
docker-compose --version
>> docker-compose version 1.29.2, build unknown
- Build the docker container and open it
docker-compose build --force-rm --no-cache
docker-compose up -d
docker-compose exec nlp bash
- From inside the container you can run any command.
# Run all the available experiments
sh scripts/run_all.sh
# Run ner
sh scripts/ner.sh
# Run the panelization task
sh scripts/panelization.sh
# Run the semantic roles:
sh scripts/geneprod.sh
sh scripts/smallmol.sh
You can also run your own version of the models. Any model can be trained via bash
passing the TrainingArguments
as in HuggingFace.
This means that the models can even be automatically uploaded to your own HuggingFace
account given a valid token is provided.
Example of how to train a NER model using this repository form inside the
python -m soda_model.token_classification.trainer \
--dataset_id "EMBO/SourceData" \
--task NER \
--version 1.0.0 \
--from_pretrained microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract \
--ner_labels all \
--filter_empty \
--max_length 512 \
--max_steps 50 \
--masking_probability 1.0 \
--replacement_probability 1.0 \
--classifier_dropout 0.2 \
--do_train \
--do_predict \
--use_crf \
--report_to none \
--truncation \
--padding "longest" \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 4 \
--evaluation_strategy "no" \
--save_strategy "no" \
--results_file "test_crf_"
# Empirical role of geneproducts.
# It will show which tokens belong to GENEPROD
# No masking is needed.
# Can be used for SMALL_MOLECULE changing ROLES_GP to ROLES_SM
python -m soda_model.token_classification.trainer \
--dataset_id "EMBO/SourceData" \
--task ROLES_GP \
--version 1.0.0 \
--from_pretrained microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract \
--ner_labels all \
--filter_empty \
--max_length 512 \
--num_train_epochs 2.0 \
--masking_probability 0.0 \
--replacement_probability 0.0 \
--classifier_dropout 0.2 \
--do_train \
--do_predict \
--report_to none \
--truncation \
--padding "longest" \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 32 \
--evaluation_strategy "no" \
--save_strategy "no" \
--results_file "roles_gp_no_masking_" \
--use_is_category
# Empirical role of geneproducts.
# It masks the GENEPROD entities
# Can be used for SMALL_MOLECULE changing ROLES_GP to ROLES_SM
python -m soda_model.token_classification.trainer \
--dataset_id "EMBO/SourceData" \
--task ROLES_GP \
--version 1.0.0 \
--from_pretrained microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract \
--ner_labels all \
--filter_empty \
--max_length 512 \
--num_train_epochs 2.0 \
--masking_probability 1.0 \
--replacement_probability 1.0 \
--classifier_dropout 0.2 \
--do_train \
--do_predict \
--report_to none \
--truncation \
--padding "longest" \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 32 \
--evaluation_strategy "no" \
--save_strategy "no" \
--results_file "roles_gp_masking_"
# Assigns Empirical roles to GP and SM simultaneously
# It does not mask the entities, but adds an indicator
# of where the tokens they belong to
python -m soda_model.token_classification.trainer \
--dataset_id "EMBO/SourceData" \
--task ROLES_MULTI \
--version 1.0.0 \
--from_pretrained microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract \
--ner_labels all \
--filter_empty \
--max_length 512 \
--num_train_epochs 2.0 \
--masking_probability 0.0 \
--replacement_probability 0.0 \
--classifier_dropout 0.2 \
--do_train \
--do_predict \
--report_to none \
--truncation \
--padding "longest" \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 32 \
--evaluation_strategy "no" \
--save_strategy "no" \
--results_file "roles_multi_"
# Like above but it identifies the entities in the text
python -m soda_model.token_classification.trainer \
--dataset_id "EMBO/SourceData" \
--task ROLES_MULTI \
--version 1.0.0 \
--from_pretrained microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract \
--ner_labels all \
--filter_empty \
--max_length 512 \
--num_train_epochs 2.0 \
--masking_probability 0.0 \
--replacement_probability 0.0 \
--classifier_dropout 0.2 \
--do_train \
--do_predict \
--report_to none \
--truncation \
--padding "longest" \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 32 \
--evaluation_strategy "no" \
--save_strategy "no" \
--results_file "roles_multi_" \
--entity_identifier "(*&"
This project has been set up using PyScaffold 4.4. For details and usage information on PyScaffold see https://pyscaffold.org/.