Customer churn, the loss of current customers, is a problem faced by a wide range of companies. When trying to retain customers, it is in a company’s best interest to focus their efforts on customers who are more likely to leave, but companies need a way to detect customers who are likely to leave before they have decided to leave. Users prone to churn often leave clues to their disposition in user behavior and customer support chat logs which can be detected and understood using Natural Language Processing (NLP) tools.
Here, we demonstrate how to build a churn prediction model that leverages both text and structured data (numerical and categorical) which we call a bi-modal model architecture. We use Amazon SageMaker to prepare, build, and train the model. Detecting customers who are likely to churn is only part of the battle, finding the root cause is an essential part of actually solving the issue. Since we are not only interested in the likelihood of a customer churning but also in the driving factors, we complement the prediction model with an analysis into feature importance for both text and non-text inputs.
The categorical and numerical data is from Kaggle: Customer Churn Prediction 2020 and was combined with a synthetic text dataset we created using GPT-2.
Medium / Towards Data Science blog post
git clone https://github.com/aws-samples/churn-prediction-with-text-and-interpretability.git
conda create -n py39 python=3.9
conda activate py39
cd churn-prediction-with-text-and-interpretability
pip install -r requirements.txt
-
Download categorical/numerical data - Customer Churn Prediction 2020 May require Kaggle account. Download train.csv and store in data folder.
-
Run script to combine categorical data with synthetic text data (../scripts)
python create_dataset.py
An example notebook to run the entire pipeline and print/visualize the results in included in ../notebook.
The python scripts to prepare the data, train and evaluate the model, as well as interpret the model, are stored in ../scripts. The parameters used for training and interpreting the model are stored in ../model/params.yaml.
- Prepare the data:
python preprocess.py
- Train and evaluate the model:
python train.py
- Interpret the trained model (text):
python interpret.py --churn 1 --speaker Customer
-
Packages:
-
Datasets:
- Customer Churn Prediction 2020 (with synthetic text dataset)
-
Models:
- GPT2, Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever
- BERT, Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Reimers, Nils and Gurevych, Iryna
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.