The track ProfNER-ST: Identification of professions & occupations in Health-related Social Media" in the context of the #SMM4H 2021 included two different sub-tracks:
- Track A: Tweet binary classification
- Track B: NER offset detection and classification
This repository contains the code associated with the participation of the Lasige-BioTM team in both sub-tracks of ProfNER.
Draft schema with the pipeline
To get the necessary data (ProfNER corpus, occupations gazeteer, ...) execute the following script:
./get_data.sh
To install the necessary requirements execute the following script:
pip install -r requirements.txt
To perform data augmentation in the train set (train_spacy.txt) using nlpaug library
python src/data_augmentation.py
Output: train_spacy.txt + train_key.txt + train_random.txt + train_synonym.txt in dir "profner/subtask-2/BIO/
The following lexicons are created and processed for MER:
-
1st lexicon "profesionShort": it includes mentions belonging to "PROFESION" category in train files + synonyms (output in "profesion_list.txt")
-
2nd lexicon "profesion": mentions belonging to "PROFESION" category in train files + synonyms, and entities in profner-gazetteer.tsv + synonyms
-
3rd lexicon "situacion": mentions belonging to "SITUACION_LABORAL" category in train files (output in "situacion_laboral_list.txt")
-
4th lexicon "actividad": mentions belonging to "ACTIVIDAD" category in train files (output in "actividad_list.txt")
-
5th lexicon "figurativa": mentions belonging to "FIGURATIVA" category in train files (output in "figurativa_list.txt")
Run the script:
python src/mer/mer_annotate.py <mode>
Arg:
- : if it is the first run, has value "lexicon", otherwise has value "predict"
To recognize entities in test set, classify tweets, and generate predictions file for both sub-tracks run the same script with a different value for the first argument:
python src/mer/mer_annotate.py predict
Output: "valid_task1.txt" and "valid_task2_txt" with predictions for sub-track 7a and 7b, respectively.
FLAIR framework
To prepare train files for FLAIR:
python src/flair/flair_pre_process.py
To train the NER tagger:
python src/flair/train_ner_model.py <model>
Arg :
- "base": Spanish FLAIR embeddings
- "twitter": FastText Spanish COVID-19 CBOW uncased. Download
- "medium": Combination of previous embeddings.
Output in "resources/taggers/"
To recognize entities in test set and generate the output file:
python src/flair/predict_ner.py <model>
Arg :
- "base": Spanish FLAIR embeddings
- "twitter": FastText Spanish COVID-19 CBOW uncased. Download
- "medium": Combination of previous embeddings.
Output TSV file in "/evaluation/flair_subtask_2/"
To determine if a tweet in test set contains a mention of occupation:
python src/flair/flair_classification_tweet.py <model>
Arg :
- "base": Spanish FLAIR embeddings
- "twitter": FastText Spanish COVID-19 CBOW uncased. Download
- "medium": Combination of previous embeddings.
Output TSV file in "/evaluation/flair_subtask_1/"