Villager a chabot made on top of a distilled version of GPT-2 by using the HuggingFace Repository. The bot is trained on the open source dataset, Topical-Chat Repository made available by Amazon. Villager has been trained on roughly 45,000 total back-and-forth conversations. Villager is a step above an average conversational chabot as it generates its own replies just by reading what the user says to it. This bot is a step taken in my thesis research in the field of NLP and NLG.
Villager implements a part of natural language processing called NLG, which is Natural Language Generation. As opposed to the conventional method where the bot looks into a text file with pre-stored replies for a set of expected questions, The bot generates its own replies based on the context of the question asked by the user.
The code is split into 5 scripts:
- dataset_preparation.py
- model_train.ipybn
- run_experiments.sh
- model_generator.py
- villager.py
The first file reads the data gathered from the Topical-Chat repository and converts it into a form readable by our distilled GPT-2 model in a file called processsed.txt. Further, the program reads the intermediate data from this file to further make it clean and trainable.
#dataset_preparation.py
df = pd.DataFrame()
temp, User, Bot = [], [], []
with open('Dataset/processsed.txt', 'r', encoding = "UTF-8") as f:
temp = f.readlines()
for idx, x in enumerate(temp):
if "agent_1" in x and "agent_2" in temp[idx+1]:
User.append(x[9:])
elif "agent_2" in x and "agent_1" in temp[idx-1]:
Bot.append(x[9:])
df['User'] = User
df['Bot'] = Bot
special_token = ' <|endoftext|> '
df['train_param'] = 'User: ' + df.User + 'Bot: ' + df.Bot + special_token
dataset_train = df[:90000].train_param.values
dataset_val = df[90000:].train_param.values
The data is now split into training and test sets for the bot. Using Google Colab, the next file makes use of the transformers package made available by HuggingFace. This part loads in a distilled version of GPT-2 and instructs to train it on the dataset provided by us.
#model_train.ipybn
!git clone https://github.com/huggingface/transformers;
!cd transformers; pip3 install .
#The distilled gpt-2 model comes from the huggingface repository using the transformers library.
#This section of the code imports transformers.
!bash run_experiments.sh
#The following bash file trains the model over 4 epochs and makes checkpoints every 500 steps.
Upon this, the run_experiments.sh file is executed which trains the model over 4 epochs. This, and run_lm_finetuning.py are default codes provided by HuggingFace to aid in model training.
#run_experiments.sh
mkdir experiments
for epoch in 4
do
python run_lm_finetuning.py \
--model_name_or_path distilgpt2 \
--model_type gpt2 \
--train_data_file /content/Dataset/dataset_train.txt \
--output_dir experiments/epochs_$epoch \
--do_train \
--per_device_train_batch_size 2 \
--overwrite_output_dir \
--num_train_epochs $epoch
done
This creates our final trained model which is stored in Models/epochs_4/. The next file, model_generator.py uses the transformers package on this dataset to implement the model for our purposes.
Additionally, the number of epochs can be increased based on system performance and the model can be improved. the folder Models/20K_steps/ houses another freshly trained model_generator.py which was trained for 20,000 steps (roughly 8 epochs).
#model_generator.py
def get_reply(user_input):
prompt_text = f'User: {user_input}\nBot:'
encoded_prompt = tokenizer.encode(prompt_text,
add_special_tokens = False,
return_tensors = 'pt')
output_sequences = model.generate(
input_ids = encoded_prompt,
max_length = 200,
temperature = 0.9,
top_k = 20,
top_p = 0.9,
repetition_penalty = 1,
do_sample = True,
num_return_sequences = 4
)
An output for any given query is generated by the function get_reply() which takes in anything said by the user. With this, the bot can easily now generate replies for anything asked by the user. The last file, villager.py hosts the bot on Discord for conversations. It imports the model from model_generator and supplies it user queries using the discord API. It also creates a file, Logs.txt which stores every conversation had by the bot and the user.
#constantly runs, takes messages from the user and responds using the provided dataset
@client.event
async def on_message(message):
msg = message.content.strip()
reply = Generator.get_reply(msg.strip())
if message.author == client.user:
return
if message.content.startswith(msg) and not message.content.startswith('!'):
await message.channel.send(reply)
await client.process_commands(message)
#a log of every conversation is recorded on Logs.txt
with open('Logs.txt', 'a', encoding = "UTF-8") as f:
f.write(f'User: {msg}\nVillager#9524: {reply}\n')
User: have you read the bible?
Villager#9524: No, I have not. I have not read the bible.
User: are you religious?
Villager#9524: I am. Do you follow the NBA?
User: is NBA your religion?
Villager#9524: I don't think so, but it sure does have an interesting history, like the New Zealand team that invented the game!
User: how often do you go to the church?
Villager#9524: Every year I do it's mostly for the sacrament for me to be in and out. It's a really nice thing to be able to have a small congregation.
The bot is named Villager as a gag after the Minecraft Villagers who only speak gibberish. This model is trained on a tiny one dimensional dataset for only 4 epochs but shows a proof of concept of such bots being capable of industry specific use in the future.
To run the bot locally, the following dependencies and libraries are required:
- tensorflow
- PyTorch
- transformers (from HuggingFace)
- pandas
- discord
The pytorch_model.bin file containing the trained model isn't in the repository because of it's size. A link to download it is given in the Models/epochs_4/model.txt file. To run it locally, download the model and save it in the same directory. Alternatively, run the model_train.ipybn file on Google Colab and a downloadable model will be trained in roughly 1.5 hours.