Language modelling with the WikiHow dataset by implementing a transformer model (decoder) in PyTorch. Text is read from a static source (a text file), processed, tokenized for training. After training, the model can be inferred from a CLI-tool, a web app and a HTTP API.
- Data, training and model configuration managed with a single TOML file
- HTTP API (FastAPI) and web-application (Streamlit) for testing the model
- Metrics logging with Weights & Biases
- Clone the repository and create a new Python virtual environment,
$> git clone --depth=1 shubham0204/language-modelling-with-pytorch
$> cd language-modelling-with-pytorch
$> python -m venv project_env
$> ./project_env/bin/activate
After activating the virtual environment, install Python dependencies,
$> (project_env) pip install -r requirements.txt
- The dataset is included in the
dataset/
directory as a text file. The next step is to execute theprocess_data.py
script which will read the articles from the text file and transform it into a tokenized corpus of sentences. Theprocess_data.py
requires an input directory (where the text file is) and an output directory to store the tokenized corpus along withindex2word
andword2index
mappings, serialized as Pythonpickle
. The input and the output directories are specified withdata_path
anddata_tensors_path
respectively in project's configuration fileproject_config.toml
.
data_path
defaults to the dataset/
directory in the project's root.
$> (project_env) python process_data.py
- After execution of the
process_data.py
script, thevocab_size
parameter gets changed in theproject_config.toml
. We're now ready to train the model. The parameters within thetrain
group inproject_config.toml
are used to control the training process. See Understandingproject_config.toml
to get more details about thetrain
parameters.
The project_config.toml
file contains all settings required by the project which are used by nearly all scripts in the
project. Having a global configuration in a TOML file enhances control over the project and provides a sweet spot
where all settings could be viewed/modified at once.
While developing the project, I wrote a blog - Managing Deep Learning Models Easily With TOML Configurations. Do check it out.
The following sections describe each setting that can be changed through project_config.toml
num_train_iter
(int
): The number of iterations to be performed on the training dataset. Note, an iteration refers to the forward-pass of a single batch of data, and a back-pass to update the parameters.num_test_iter
(int
): The number of iterations to be performed on the test dataset.test_interval
(int
): The number of iterations after which testing should be performed.batch_size
(int
): Number of samples present in a batch.learning_rate
(float
): The learning rate used by theoptimizer
intrain.py
.checkpoint_path
(str
): Path where checkpoints will be saved during training.wandb_logging_enabled
(bool
): Enable/disable logging to Weights&Biases console intrain.py
.wandb_project_name
(str
): If logging is enabled, the name of theproject
that is to be used for W&B.resume_training
(bool
): IfTrue
, a checkpoint will be loaded and training will be resumed.resume_training_checkpoint_path
(str
): Ifresume_training = True
, the checkpoint will be loaded from this path.compile_model
(bool
): Whether to usetorch.compile
to speedup training intrain.py
vocab_size
(int
): Number of tokens in the vocabulary. This variable is set whenprocess_data.py
script is executed.test_split
(float
): Fraction of data used for testing the model.seq_length
(int
): Context length of the model. Input sequences ofseq_length
will be produced intrain.py
to train the model.data_path
(str
): Path of the text file containing the articles.data_tensors_path
(str
): Path of the directory where tensors of processed data will be stored.
embedding_dim
(int
): Dimensions of the output embedding fortorch.nn.Embedding
.num_blocks
(int
): Number of blocks to be used in the transformer model. A single block containsMultiHeadAttention
,LayerNorm
andLinear
layers. Seelayers.py
.num_heads_in_block
(int
): Number of heads used inMultiHeadAttention
.dropout
(float
): Dropout rate for the transformer.
host
(str
): Host IP used to deploy API endpoints.port
(int
): Port through which API endpoints will be exposed.
The trained ML model can be deployed in two ways,
The model can be used with a StreamLit app easily,
$> (project_env) streamlit run app.py
In the app, we need to select a model for inference, the data_tensors_path
(required for tokenization) and other
parameters like number of words to generate and the temperature.
The model can be accessed with REST APIs built with FastAPI,
$> (project_env) uvicorn api:server
A GET
request at the /predict
endpoint with query parameters prompt
, num_tokens
and temperature
can
be initiated to generate a response from the model,
curl --location 'http://127.0.0.1:8000/predict?prompt=prompt_text_here&temperature=1.0&num_tokens=100'