SODA-MODEL: NLP models obtained with the SourceData data set

This repository has the AI models built around the SODA-data datasets.

For transparency we offer bash scripts that will automatically run the experiments shown in the papers related to this and the aford mentioned repository. This scripts are located in the scripts folder of the repository.

Everything is ready to run on a linux machine with GPU availability using the provided docker container.

Currently, we show the models for:

Named-entity recognition
Semantic identification of empirical roles of entities
Panelization of figure captions

All these models can be found in the token_classification folder.

Running the scripts

Clone the repository to your local computer
Generate a virtual encironment and activate it:

    python -m venv /path/to/new/virtual/environment

    source activate .venv/bin/activate

Make sure that you have a docker-compose version that supports versions 2.3 or higher:

    docker-compose --version

    >> docker-compose version 1.29.2, build unknown

Build the docker container and open it

    docker-compose build --force-rm --no-cache
    docker-compose up -d
    docker-compose exec nlp bash

From inside the container you can run any command.

    # Run all the available experiments
    sh scripts/run_all.sh

    # Run ner
    sh scripts/ner.sh

    # Run the panelization task
    sh scripts/panelization.sh

    # Run the semantic roles:
    sh scripts/geneprod.sh
    sh scripts/smallmol.sh

You can also run your own version of the models. Any model can be trained via bash passing the TrainingArguments as in HuggingFace. This means that the models can even be automatically uploaded to your own HuggingFace account given a valid token is provided.

Example of how to train a NER model using this repository form inside the

    python -m soda_model.token_classification.trainer \
        --dataset_id "EMBO/SourceData" \
        --task NER \
        --version 1.0.0 \
        --from_pretrained microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract \
        --ner_labels all \
        --filter_empty \
        --max_length 512 \
        --max_steps 50 \
        --masking_probability 1.0 \
        --replacement_probability 1.0 \
        --classifier_dropout 0.2 \
        --do_train \
        --do_predict \
        --use_crf \
        --report_to none \
        --truncation \
        --padding "longest" \
        --per_device_train_batch_size 2 \
        --per_device_eval_batch_size 4 \
        --evaluation_strategy "no" \
        --save_strategy "no" \
        --results_file "test_crf_"

Examples of experiments that can be run using SODA-model

Tokenclassification: NER

Tokenclassification: PANELIZATION

Tokenclassification: Semantic interpretation of Empirical Roles (SiER)

    # Empirical role of geneproducts.
    # It will show which tokens belong to GENEPROD
    # No masking is needed.
    # Can be used for SMALL_MOLECULE changing ROLES_GP to ROLES_SM
    python -m soda_model.token_classification.trainer \
        --dataset_id "EMBO/SourceData" \
        --task ROLES_GP \
        --version 1.0.0 \
        --from_pretrained microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract \
        --ner_labels all \
        --filter_empty \
        --max_length 512 \
        --num_train_epochs 2.0 \
        --masking_probability 0.0 \
        --replacement_probability 0.0 \
        --classifier_dropout 0.2 \
        --do_train \
        --do_predict \
        --report_to none \
        --truncation \
        --padding "longest" \
        --per_device_train_batch_size 16 \
        --per_device_eval_batch_size 32 \
        --evaluation_strategy "no" \
        --save_strategy "no" \
        --results_file "roles_gp_no_masking_" \
        --use_is_category

    # Empirical role of geneproducts.
    # It masks the GENEPROD entities
    # Can be used for SMALL_MOLECULE changing ROLES_GP to ROLES_SM
    python -m soda_model.token_classification.trainer \
        --dataset_id "EMBO/SourceData" \
        --task ROLES_GP \
        --version 1.0.0 \
        --from_pretrained microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract \
        --ner_labels all \
        --filter_empty \
        --max_length 512 \
        --num_train_epochs 2.0 \
        --masking_probability 1.0 \
        --replacement_probability 1.0 \
        --classifier_dropout 0.2 \
        --do_train \
        --do_predict \
        --report_to none \
        --truncation \
        --padding "longest" \
        --per_device_train_batch_size 16 \
        --per_device_eval_batch_size 32 \
        --evaluation_strategy "no" \
        --save_strategy "no" \
        --results_file "roles_gp_masking_"

    # Assigns Empirical roles to GP and SM simultaneously
    # It does not mask the entities, but adds an indicator
    # of where the tokens they belong to
    python -m soda_model.token_classification.trainer \
        --dataset_id "EMBO/SourceData" \
        --task ROLES_MULTI \
        --version 1.0.0 \
        --from_pretrained microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract \
        --ner_labels all \
        --filter_empty \
        --max_length 512 \
        --num_train_epochs 2.0 \
        --masking_probability 0.0 \
        --replacement_probability 0.0 \
        --classifier_dropout 0.2 \
        --do_train \
        --do_predict \
        --report_to none \
        --truncation \
        --padding "longest" \
        --per_device_train_batch_size 16 \
        --per_device_eval_batch_size 32 \
        --evaluation_strategy "no" \
        --save_strategy "no" \
        --results_file "roles_multi_"

    # Like above but it identifies the entities in the text
python -m soda_model.token_classification.trainer \
    --dataset_id "EMBO/SourceData" \
    --task ROLES_MULTI \
    --version 1.0.0 \
    --from_pretrained microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract \
    --ner_labels all \
    --filter_empty \
    --max_length 512 \
    --num_train_epochs 2.0 \
    --masking_probability 0.0 \
    --replacement_probability 0.0 \
    --classifier_dropout 0.2 \
    --do_train \
    --do_predict \
    --report_to none \
    --truncation \
    --padding "longest" \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 32 \
    --evaluation_strategy "no" \
    --save_strategy "no" \
    --results_file "roles_multi_" \
    --entity_identifier "(*&"

Note

This project has been set up using PyScaffold 4.4. For details and usage information on PyScaffold see https://pyscaffold.org/.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
docs		docs
notebooks		notebooks
scripts		scripts
src/soda_model		src/soda_model
tests		tests
.cirrus.yml		.cirrus.yml
.coveragerc		.coveragerc
.env.example		.env.example
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
AUTHORS.md		AUTHORS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
docker-compose.yml		docker-compose.yml
install.sh		install.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SODA-MODEL: NLP models obtained with the SourceData data set

Running the scripts

Examples of experiments that can be run using SODA-model

Tokenclassification: NER

Tokenclassification: PANELIZATION

Tokenclassification: Semantic interpretation of Empirical Roles (SiER)

Note

About

Releases

Packages

Languages

License

source-data/soda-model

Folders and files

Latest commit

History

Repository files navigation

SODA-MODEL: NLP models obtained with the SourceData data set

Running the scripts

Examples of experiments that can be run using SODA-model

Tokenclassification: NER

Tokenclassification: PANELIZATION

Tokenclassification: Semantic interpretation of Empirical Roles (SiER)

Note

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages