Skip to content

This repository contains the code associated with the participation of the Lasige-BioTM team in both sub-tracks of ProfNER.

Notifications You must be signed in to change notification settings

lasigeBioTM/LASIGE-participation-in-ProfNER

Repository files navigation

LASIGE-participation-in-ProfNER

The track ProfNER-ST: Identification of professions & occupations in Health-related Social Media" in the context of the #SMM4H 2021 included two different sub-tracks:

  • Track A: Tweet binary classification
  • Track B: NER offset detection and classification

This repository contains the code associated with the participation of the Lasige-BioTM team in both sub-tracks of ProfNER.

Draft schema with the pipeline


1. Setup

1.1. Data

To get the necessary data (ProfNER corpus, occupations gazeteer, ...) execute the following script:


./get_data.sh

1.2. Requirements

To install the necessary requirements execute the following script:


pip install -r requirements.txt


2. Preprocessing

To perform data augmentation in the train set (train_spacy.txt) using nlpaug library


python src/data_augmentation.py

Output: train_spacy.txt + train_key.txt + train_random.txt + train_synonym.txt in dir "profner/subtask-2/BIO/


3. MER

Python implementation of MER

3.1. To create and process lexicons for MER

The following lexicons are created and processed for MER:

  • 1st lexicon "profesionShort": it includes mentions belonging to "PROFESION" category in train files + synonyms (output in "profesion_list.txt")

  • 2nd lexicon "profesion": mentions belonging to "PROFESION" category in train files + synonyms, and entities in profner-gazetteer.tsv + synonyms

  • 3rd lexicon "situacion": mentions belonging to "SITUACION_LABORAL" category in train files (output in "situacion_laboral_list.txt")

  • 4th lexicon "actividad": mentions belonging to "ACTIVIDAD" category in train files (output in "actividad_list.txt")

  • 5th lexicon "figurativa": mentions belonging to "FIGURATIVA" category in train files (output in "figurativa_list.txt")

Run the script:


python src/mer/mer_annotate.py <mode>

Arg:

  • : if it is the first run, has value "lexicon", otherwise has value "predict"

3.2. Tweet classification and Named Entity Recognition

To recognize entities in test set, classify tweets, and generate predictions file for both sub-tracks run the same script with a different value for the first argument:


python src/mer/mer_annotate.py predict

Output: "valid_task1.txt" and "valid_task2_txt" with predictions for sub-track 7a and 7b, respectively.


4. FLAIR tagger

FLAIR framework

4.1. Preprocessing

To prepare train files for FLAIR:


python src/flair/flair_pre_process.py 

4.2. Training

To train the NER tagger:


python src/flair/train_ner_model.py <model>

Arg :

Output in "resources/taggers/"

4.3. Prediction

To recognize entities in test set and generate the output file:


python src/flair/predict_ner.py <model>

Arg :

Output TSV file in "/evaluation/flair_subtask_2/"

3.4. Tweet classification

To determine if a tweet in test set contains a mention of occupation:

python src/flair/flair_classification_tweet.py <model>

Arg :

Output TSV file in "/evaluation/flair_subtask_1/"

About

This repository contains the code associated with the participation of the Lasige-BioTM team in both sub-tracks of ProfNER.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published