LASIGE-participation-in-ProfNER

The track ProfNER-ST: Identification of professions & occupations in Health-related Social Media" in the context of the #SMM4H 2021 included two different sub-tracks:

Track A: Tweet binary classification
Track B: NER offset detection and classification

This repository contains the code associated with the participation of the Lasige-BioTM team in both sub-tracks of ProfNER.

Draft schema with the pipeline

1. Setup

1.1. Data

To get the necessary data (ProfNER corpus, occupations gazeteer, ...) execute the following script:


./get_data.sh

1.2. Requirements

To install the necessary requirements execute the following script:


pip install -r requirements.txt

2. Preprocessing

To perform data augmentation in the train set (train_spacy.txt) using nlpaug library


python src/data_augmentation.py

Output: train_spacy.txt + train_key.txt + train_random.txt + train_synonym.txt in dir "profner/subtask-2/BIO/

3. MER

Python implementation of MER

3.1. To create and process lexicons for MER

The following lexicons are created and processed for MER:

1st lexicon "profesionShort": it includes mentions belonging to "PROFESION" category in train files + synonyms (output in "profesion_list.txt")
2nd lexicon "profesion": mentions belonging to "PROFESION" category in train files + synonyms, and entities in profner-gazetteer.tsv + synonyms
3rd lexicon "situacion": mentions belonging to "SITUACION_LABORAL" category in train files (output in "situacion_laboral_list.txt")
4th lexicon "actividad": mentions belonging to "ACTIVIDAD" category in train files (output in "actividad_list.txt")
5th lexicon "figurativa": mentions belonging to "FIGURATIVA" category in train files (output in "figurativa_list.txt")

Run the script:


python src/mer/mer_annotate.py <mode>

Arg:

: if it is the first run, has value "lexicon", otherwise has value "predict"

3.2. Tweet classification and Named Entity Recognition

To recognize entities in test set, classify tweets, and generate predictions file for both sub-tracks run the same script with a different value for the first argument:


python src/mer/mer_annotate.py predict

Output: "valid_task1.txt" and "valid_task2_txt" with predictions for sub-track 7a and 7b, respectively.

4. FLAIR tagger

FLAIR framework

4.1. Preprocessing

To prepare train files for FLAIR:


python src/flair/flair_pre_process.py

4.2. Training

To train the NER tagger:


python src/flair/train_ner_model.py <model>

Arg :

"base": Spanish FLAIR embeddings
"twitter": FastText Spanish COVID-19 CBOW uncased. Download
"medium": Combination of previous embeddings.

Output in "resources/taggers/"

4.3. Prediction

To recognize entities in test set and generate the output file:


python src/flair/predict_ner.py <model>

Arg :

"base": Spanish FLAIR embeddings
"twitter": FastText Spanish COVID-19 CBOW uncased. Download
"medium": Combination of previous embeddings.

Output TSV file in "/evaluation/flair_subtask_2/"

3.4. Tweet classification

To determine if a tweet in test set contains a mention of occupation:

python src/flair/flair_classification_tweet.py <model>

Arg :

"base": Spanish FLAIR embeddings
"twitter": FastText Spanish COVID-19 CBOW uncased. Download
"medium": Combination of previous embeddings.

Output TSV file in "/evaluation/flair_subtask_1/"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LASIGE-participation-in-ProfNER

1. Setup

1.1. Data

1.2. Requirements

2. Preprocessing

3. MER

3.1. To create and process lexicons for MER

3.2. Tweet classification and Named Entity Recognition

4. FLAIR tagger

4.1. Preprocessing

4.2. Training

4.3. Prediction

3.4. Tweet classification

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
src		src
submissions		submissions
README.md		README.md
actividad_list.txt		actividad_list.txt
figurativa_list.txt		figurativa_list.txt
get_data.sh		get_data.sh
profesion_list.txt		profesion_list.txt
profesion_short_list.txt		profesion_short_list.txt
requirements.txt		requirements.txt
situacion_laboral_list.txt		situacion_laboral_list.txt

lasigeBioTM/LASIGE-participation-in-ProfNER

Folders and files

Latest commit

History

Repository files navigation

LASIGE-participation-in-ProfNER

1. Setup

1.1. Data

1.2. Requirements

2. Preprocessing

3. MER

3.1. To create and process lexicons for MER

3.2. Tweet classification and Named Entity Recognition

4. FLAIR tagger

4.1. Preprocessing

4.2. Training

4.3. Prediction

3.4. Tweet classification

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages