Skip to content

Hamza-t/Language-Identification-Tun

Repository files navigation

Language-Identification-Tun : a project to predict the langage of text (arabic/english/french/tunizi/code-switching)

General Introduction : What is language identification in NLP?

In natural language processing, language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, solved with various statistical methods.


Project Workflow

We will be using the following workflow:

workflow

1. Data collection

data_collection

a. Scrape Comments from Youtube

This task is done in the python environment of my desktop with selenium and webdriver.

b. Collecting Public Dataset

We have 2 different datasets :

  1. TUNIZI: (dataset in Tunisian Arabizi) https://github.com/chaymafourati/TUNIZI-Sentiment-Analysis-Tunisian-Arabizi-Dataset
  2. TSAC: (mix of Arabic, French, and Arabizi) https://github.com/fbougares/TSAC

c. Annotation

this task have completed in the class

Attribute to each sentence, a label, from these 5 classes:

  • Arabic: all the letters/words are in Arabic
  • French: all the letters/words are in French
  • English: all the letters/words are in English
  • Tunizi: words are written in the Tunisian Arabizi (Latin chars with numerics)
  • Code-switching: the sentence is a mix between more than 2 languages.

2. Data preparation

data_preparation

  • we will merge all data file in one data frame and insure the type of each column
  • The final data will contain just 2 columns : "text" and "label"
  • The "text" column shoold be string, the "label" column shoold be integer

3. Data cleaning

data_cleaning

In this step, we will clean text in our data file

  1. Delete duplicate rows and Nan values in labels column.
  2. Change the type of data (text column must be string and label colmn must be integer
  3. Clean text data from : URL, emojis, punctuation (?,:!..) , symbols, newlines and Tabs. : Example : To know more about this website: https://Hamza.example.com
  4. Remove Accented Characters. : Γ©, Γ , ...
  5. Reduce repeated characters. : eyyyyyy (mean "yes") ==> ey
  6. Remove Whitespaces : "How are you doing ?" Case Conversion : str.lower()

4. Data visualization

With writing some python code, we can see some beautiful informations about our data graph1 graph2 graph3 graph4

  • The data is not balanced: a major difference between the English/French languages and tunizi in terms of the number of comments and the number of unique words.
  • There is a difference between code-switching/tunizi and other classes: the largest number of unique words and max/mean sentence is important.
  • The data must be balanced in terms of number of word and unique word (vocab) in all the text
  • To get closed to the number of comments that we shoold add to our data in each class, we will use a new feature alpha of each class
  • alpha = (Num of unique words / mean_all_comments)
  • Num of words : number of unique words in each class
  • mean_all_comments : the mean length of all comments

graph5

We need a data augmentation action in English/French/Arabic language (+ ~2000 comments for the arabic/ ~3500 comments for english and frensh), and Code-switching class (+ ~2000 comments)

5. Data augmentation

data_augmentation

We are thinking about balancing our data distribution by adding more labled-data in the Code-switching/English/French class There is many methods :

  1. Using public datasets
  2. Generate text using OpenAi or any others tools
  3. Scrape more data from social media/ blog sites/ journal or magazine
  4. Back translation/ Synonym Replacement/ Random Insertion/ Random Swap / Random Deletion/ Shuffle Sentences Transform using NLPAug Library (https://neptune.ai/blog/data-augmentation-nlp)
  • Our French/English data does not contain the suffisant amount of text to apply the 4th technique
  • Scraping text need more time for annotate data and check it

The best solution is to use public data set for English/French/arabic text and using text generation for Code-switching text

*English/French/Arabic language (+ ~2000 comments for the arabic/ ~3500 comments for english and frensh), and Code-switching class (+ ~2000 comments)

a. Collecting public Dataset : from Huggingface πŸ™‚

link : https://huggingface.co/datasets/papluca/language-identification/blob/main/train.csv

  • The data distribution image

b. Generating data : Code-switching text: We will discuss many tools ⚑

  1. OpenAi (https://openai.com/api/)
  2. Using a custom LSTM model to generate text : the training data will be our data from Arabic/French/English text (https://www.analyticsvidhya.com/blog/2018/03/text-generation-using-python-nlp/)

We will try openai api

6. Data validation

image

we will validate our dataset before modeling.

  1. Deep data cleaning :

a. Clean text : URL, emojis, punctuation (?,:!..) , symbols, newlines and Tabs ... ✊

b. Clean langages : validate language letters and convert numeric patterns to letters πŸ›‘

c. Stop words : removing or keeping ❎

  • What are stop words? πŸ€” The words which are generally filtered out before processing a natural language are called stop words. These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text

  • As we mentioned, the stopwords are a bags of words in a language, an intersted bag of words!

  • So we can't remove them if we will predict the langage of text

  • let's take an example: text = "Hamza is a clever person, mais he is stupid!"

  • this text is a C-S text : english and french, if we remove the stopwords ( 'mais' is a french stopword) the text will be english langage!

  1. Data visualisation 🎨

image

7. Data modeling : Built the classifier

image

we will apply the AI stuff to our data to predict the langage of the text.

Our task is a text multilabel classification, there is many methods :

  • the old-fashioned Bag-of-Words (with Tf-Idf or countvector)
  • the cutting edge Language models (with BERT).

image

A. Using the old-fashioned Bag-of-Words (with Tf-Idf or countvector) 🧯

  • The text feature extraction methods will be tf-idf and Count-Vectorization
  • The classifiers will be Stochastic Gradient Descent and naive_bayes

a. Count vectorizer

i) Naive bayes

image image image

The f1-score of the 4th class(code-switching) is too low : 0.61, the rest of the prediction is super perfect , so our problem in the code-switching label! let's see the others combination

ii) Stochastic Gradient Descent

image image image

a. tf-idf

i) Naive bayes

image image image

ii) Stochastic Gradient Descent

image image image

Interpretation :

Bag of words give us a macro averaged F-score equal 0.89, but a bad results in the 4th label (recall = 0.77 for this class)

  • Best combination : tf-idf + Stochastic Gradient Descent

B. Using the cutting edge Language models (with BERT) πŸ™‚

  • In order to complete a text classification task, you can use BERT in 3 different ways:
  1. train it all from scratches and use it as classifier.
  2. Extract the word embeddings and use them in an embedding layer (like I did with Word2Vec).
  3. Fine-tuning the pre-trained model (transfer learning).

We are going with the latter and do transfer learning from a pre-trained lighter version of BERT, called Distil-BERT (66 million of parameters instead of 110 million!).

  • Then! we will going to build the deep learning model with transfer learning from the pre-trained BERT. Basically, we will going to summarize the output of BERT into one vector with Average Pooling and then add two final Dense layers to predict the probability of each langages.

image image image

Interpretation :

Bert give us a macro averaged F-score equal 0.91, also a better result in the 4th class classification (recall = 0.86 for this class)

8. Conclusion

Bert reached an accuracy of 0.93 and macro averaged F-score equal to 0.91.

We can do more fixing tasks like using the TunBert (a pre-trained bert model with Tunizi langage) to get more accurate results espacily between the CS and the Tunizi langages.