EasyNLP is a toolkit that enables easy text preprocessing,featureExtraction, and neural models building to help you speed up your Natural Language Processing (NLP) research
The project is structred like this :
- Feature Extractions : This package contains feature extraction methods
- Models : This package contains different usable models architectures
- preprocessing : This Folder contains Text preprocessing methods
- Scrappping : This contains Scrapping methods
- Training : This Folder contains training loops and methods directly usable
- LM TRAINING : This folder contains a builtin script that you can call to adapt bert base models to your dataset
- Examples : This Folder contains example on how we used the Pipeline on Crisis classification and Tweet stigmatisation .
- DataVisualisation : On this Folder we show examples on how we uses Data vizualisation and transformation to improve our understanding of the problem
.
├── ...
├── EasyNLP
│ ├── Feature Extractions
│ ├── Models
│ ├── preprocessing
│ ├── Scrappping
│ ├── Training
│ ├── LM TRAINING
│ ├── Examples
| └── DataVisualisation
├── setup.py
└── ...
This project has a main pipeline with several sideline pieces of analyisis. The steps of the main flow are:
- Scrapping data : You can eaither scrap tweets or enhance existing tweets features, Just with tweets IDS using feature_extraction
>>> from easy_nlp.scrapping.twitter_scrap import Scrapper
>>> scrapper = Scrapper()
>>> keywords= ['innondation','occitanie','crises']
>>> lang = ['fr']
>>> begindate = datetime.datetime(2020, 5, 17)
>>> enddate = datetime.datetime.now()
>>> limit = 150
>>> scrapper.get_tweets_df(keywords,lang,begindate,enddate,limit)
- Preprocess : Preprocess Your Data using preprocessing modules using preprocessing
>>> from easy_nlp.preprocessing import TextPreprocessing
>>> text_preprocessing = TextPreprocessing(df,"TEXT")
>>> text_preprocessing.fit_transform()
- features_extraction : extract features from your tweets using features extraction feature_extraction
>>> from easy_nlp.feature_extraction import FeaturesExtraction
>>> featuresExtrator = FeaturesExtraction(df,"TEXT")
>>> featuresExtrator.fit_transform()
The package also allows tokenizing texts as Bert input, the Bert model requires Mask and tokens as IDS
from easy_nlp.feature_extraction import BertInput
from Transformers import FlaubertTokenizer
Tokenizer = FlaubertTokenizer.from_pretrained('flaubert-base-cased')
bert_input= BertInput(Tokenizer)
X_train = bert_input.fit_transform(sentences_train)
- training : Train your models using our training loops on our predifined models including , We enhanced Bert models using several technics :
- BERT + LSTM
- BERT + Custom Sampling
- BERT + CNN
- BERT + ADAPTED ON LM
- BERT + FEATURES
- BERT + CROSS LINGUAL TRAINING
>>> from transformers import FlaubertModel
>>> from easy_nlp.models import BasicBertForClassification,
>>> base_model = FlaubertModel.from_pretrained('flaubert-base-cased')
>>> model = BasicBertForClassification(base_model,n_class)
>>> from transformers import AdamW,get_linear_schedule_with_warmup
>>> from easy_nlp.training import train
>>> optimizer = AdamW(model.parameters(),
lr = 1e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
eps = 1e-8 # args.adam_epsilon - default is 1e-8.
)
>>> scheduler = get_linear_schedule_with_warmup(optimizer,
num_warmup_steps = 0, # Default value in run_glue.py
num_training_steps = 100)
>>> # Number of training epochs (authors recommend between 2 and 4)
>>> epochs = 5
>>> # Total number of training steps is number of batches * number of epochs.
>>> total_steps = len(train_dataloader) * epochs
>>> # Create the learning rate scheduler.
>>> scheduler = get_linear_schedule_with_warmup(optimizer,
num_warmup_steps = 0, # Default value in run_glue.py
num_training_steps = total_steps)
>>> fit(model,train_dataloader,validation_dataloader,epochs,torch.device('cuda'),optimizer,scheduler,criterion)
As side projects we have
- The training of our own BERT_ADAPTED_LM,Code
- The Vizualisation of training , using TensoarBoardX
- Training of ULMFIT using FastAI
- Exploring LASER with synthetic DATA Augmentation.
$ git clone https://github.com/Moumeneb1/IRIT_INTERNSHIP.git
$ pip install IRIT_INTERNSHIP/
This library,is intended for NLP entheusiast to start quickly working on classification problems without struggling with common NLP pipeline modules. This out of the box pipeline modules will help you build a baseline solution quickly that you can improve with proper fineTuning .
You can look at the examples folder to know how to use IT
If you wanna use LM we advice you to instal apex amp
If you wanna use Glove and FastText we invite to use the package on nlpcrisis
The pipeline i developped was used in improving results using state of the art models on crisis tweets classification, We now have a paper that shows this library improved the results for using if you can cite 🤗 :
@article{Kozlowski-et-al2020,
title = "A three-level classification of French tweets in ecological crises",
journal = "Information Processing & Management",
volume = "57",
number = "5",
pages = "102284",
year = "2020",
issn = "0306-4573",
doi = "https://doi.org/10.1016/j.ipm.2020.102284",
url = "http://www.sciencedirect.com/science/article/pii/S0306457320300650",
author = "Diego Kozlowski and Elisa Lannelongue and Frédéric Saudemont and Farah Benamara and Alda Mari and Véronique Moriceau and Abdelmoumene Boumadane",
keywords = "Crisis response from social media, Machine learning, Natural language processing, Transfer learning",
}