This repository contains a Spanish-English translator based on a Transformer model. I trained it on Google Cloud with 114k examples and a customized training loop.
A visualization of the Transformer architecture. Source: nature.com
It is not a notebook but a modular object-oriented project with a simple TUI that allows interaction with the translator.
The translator at inference time. Source: Own
This project is a practical exercise to increase my knowledge and technical skills in NLP. For this reason, if you want to use it, keep in mind that it is not perfect. The purpose of its construction is learning.
You can read the article related to this repository here. In the article, I detail the challenges I faced when coding the translator, delve into crucial concepts of the Transformer architecture, and share some practical advice for those in the same boat as me.
Enjoy!
It replicates the standard Transformer architecture as shown in image below:
From the image:
- The TUI loop invokes the training process through the
train
method. If it detects a saved model will load it for inference. Otherwise, it will read the configuration values (tbset.ini
) and instantiate aTrainer
object. - The
Trainer
object instantiates aTransformer
which creates the architecture with an Encoder, Decoder, and Dense layer. It also implements a custom training loop. The training process (invoked withtrainer.train()
) includes the following actions:- Dataset workout: A pipeline that downloads and prepares the dataset, creates batches, tokenize each batch (if vocabularies are missing, it will create them using a
BertTokenizer
), and prefetch the dataset. - Instantiate the Transformer model and set up the optimizer.
- Restore any previous checkpoint if it is the case.
- Run the custom loop to calculate the gradients with the predictions and update the model's parameters.
- Save the trained model for later use.
- Dataset workout: A pipeline that downloads and prepares the dataset, creates batches, tokenize each batch (if vocabularies are missing, it will create them using a
- Each
call
to the Transformer object will:- Create the padding and look-ahead masks for the current source and target languages training batches.
- Calculate the Encoder's output for the current source language training batch.
- Calculate the Decoder's output from the current target language training batch and Encoder's output.
- Return as a prediction the output of the Dense layer after the Decoder's output.
- The Encoder is composed of a stack of
EncoderLayer
objects. These layers perform the Multi-Headed Attention calculations for the current input and pass the results to the following layer. - The Decoder is composed of a stack of
DecoderLayer
objects. These layers have three subblocks- The first one attends the current target language training batch or the output of the previous Decoder layer.
- The second one attends the output of the first subblock plus the output of the Encoder's last layer.
- The third one is an FFN that processes the output of the second subblock.
The tbset.ini
configuration file manages all hyperparameters and important variables values. These are the values I used to train the translator:
[TRN_HYPERP]
num_layers = 4
d_model = 128
num_heads = 8
dff = 512
dropout_rate = 0.1
ckpt_path = tbset/local.multivac/checkpoints
save_path = tbset/local.multivac/saved_model
epochs = 400
[DATASET_HYPERP]
dwn_destination = tbset/local.multivac/dataset
vocab_path = tbset/local.multivac/dataset
buffer_size = 2000
batch_size = 64
vocab_size = 8000
num_examples = 114000
These parameters make a smaller model comparing it with the base model from Vaswani et. al:
We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days).
I chose the OPUS dataset. This dataset is available in the TF catalog. It contains a collection of translated texts from the web, with formal and informal examples. My first idea was to merge two datasets, the Books and the Subtitles corpora. Sadly, the Books is not yet available in the TensorFlow catalog.
For this reason, I trained the model with the Subtitles dataset only, which produced acceptable results given the particularities of this type of content.
I trained the model in a VM from Google Cloud. The technical characteristics of this VM were:
- CPUs: 4
- Memory: 15 GB
- GPU: One Tesla T4
The training time was almost nine hours (400 epochs and 114K training examples). The final loss was 1.1077 with an accuracy of 0.7303.
It was challenging coding this project following a modular and object-oriented mindset. For this reason and the quota limitations, I discarded Kaggle or Colab. On the other hand, I wanted to have a real business-like experience. So I launched a VM on GCP with a preloaded Tensorflow 2.6 image, added the additional dependencies, and connected my IDE to this remote interpreter deploying my code and training the model on the cloud.
This repository contains the model's parameters after training, so it is easy to use the translator for inference:
- Clone the repository
- Install the dependencies
- Run the module with
$ python -m tbset
- There are two commands available.
train
will train the model from scratch. If a saved model is detected it won't be possible to continue unless you delete the related files from the disk, including the stored checkpoints. Thetranslate
command will use a saved model for inference. - To exit the TUI, press Ctrl-D
Dependency-wise, I have found it is tricky to use tensorflow-text
. I'd recommend starting with the desired version of tensorflow-text
. TF text will install the appropriate version of tensorflow
as it requires it to work. If you are using a preinstalled VM, be sure the tensorflow-text
version matches the tensorflow
version.
As I instantiated a tensorflow
2.6.0. VM, these are the dependencies I added:
- tensorflow-text 2.6.0
- tensorflow-datasets 4.3.0
- prompt-toolkit 3.0.20
You can check this repository or read some of my blog posts. Have fun! :)