Code for MVNO Purification and Optimization

Introduction

This repository contains our codes for the technical optimization of a commercial MVNO (Xiaomi Mobile) from 3 key aspects (data usage prediction, custmer churn prediction and fraud detection), as well as a dataset with limited samples, which is allowed to be disclosed after negotiation with Xiaomi Mobile. The codes are adjusted for the sample dataset. Therefore, you'll be able to run codes in this repo with the samples.

Codebase Organization

[The entire codebase is available in our Github repo.]

Sample Dataset

The samples are 100 active customers randomly selected and extracted from the database of Xiaomi Mobile in July 2020. Specific attributes are divided into 3 categories: history data usages, characteristics of drop-out users and features of scammers, as shown in Data.

Dataset samples-data-usage contains history data usages (18 months at most) of the customers:

Attribute	Description
`id`	Unique ID generated to identify a user (cannot be related to the user's true indentity)
`month_i`	Data usage of a user (in MB) in the i-th month, i ∈ [1,18]

Dataset samples-churn-features contains the 8-dimensional features obtained through feature correlation analysis (few samples in this dataset lack features). Here age is discretized by the interval of 10 years, for example, 20 means 20 to 30 years old and the largest discrete value 60 means more than 60 years old:

Attribute	Description
`id`	Unique ID generated to identify a user (cannot be related to the user's true indentity)
`account_balance`	Account balance (in RMB) of a user in the current month
`data_plan`	Monthly data (in GB) of a user's data plan
`last_month_expense`	Expense (in RMB) of a user in last month
`lifetime`	Life time (in months) of a user
`age`	Discretized age of a user
`avg_delta_expense`	Average delta of the monthly expenses (in RMB) of a user
`gender`	Gender of a user
`expense_variance`	Variance of the monthly expenses (in RMB) of a user

Dataset samples-scammer-features contains the 9-dimensional features extracted through risk-return analysis of scammers combined with feature correlation analysis:

Attribute	Description
`id`	Unique ID generated to identify a user (cannot be related to the user's true indentity)
`call_count_per_day`	Average number of outgoing calls per day of a user
`avg_duration`	Average duration (in seconds) of outgoing calls of a user
`number_count`	Number of phone numbers in outgoing calls of a user
`rate`	Average frequency of per called phone number for out going calls of a user
`day_gap`	Interval (in days) between first bill time and activation time of a user
`first_cellId_percent`	Usage ration (%) of the most involved BS for outgoing calls of a user
`city_count`	Number of cities in outgoing calls of a user
`first_city_percent`	Usage ration (%) of the most called city for outgoing calls of a user
`province_count`	Number of provinces in outgoing calls of a user

The developers can train models with the code of Data Usage Prediction, Customer Churn Prediction and Fraud Detection using the sample dataset, but the performance is limited compared with using the full dataset, due to the lack of samples in this sample dataset.

Data Usage Prediction

We formulate data usage prediction into a supervised time series forecasting problem. We employ a robust statistical method (Grubb’s Test) and neighbor mean interpolation to complement the ML-based method (RandomForest). The proposed approach leads to an average prediction accuracy of 93.3% on the full dataset.

Customer Churn Prediction

We use the extracted features to proactively predict customers’ churn with RandomForest classifier. Additionally, we employ an under-sampling based method called One-Sided Selection to mitigate the negative impact caused by imbalanced positive and negative samples in the full dataset. Eventually, we manage to achieve both high precision (96.3%) and recall (97.8%) with the full dataset.

Fraud Detection

Also considering the class imbalance in our dataset, we adopt a resampling based method named SMOTETomek combined with BalancedRandomForestClassifier, which can achieve a precision of 90% and a recall of 92.4% with the full dataset.

Platform Requirements

python==3.8

numpy==1.20.0+

scikit-learn==0.24.0+

imbalanced-learn==0.7.0+

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
customer-churn-prediction		customer-churn-prediction
data-usage-prediction		data-usage-prediction
data		data
fraud-detection		fraud-detection
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Code for MVNO Purification and Optimization

Table of Contents

Introduction

Codebase Organization

[The entire codebase is available in our Github repo.]

Sample Dataset

Data Usage Prediction

Customer Churn Prediction

Fraud Detection

Platform Requirements

About

Uh oh!

Releases

Packages

Languages

MVNO-Optimization/MVNO-Optimization.github.io

Folders and files

Latest commit

History

Repository files navigation

Code for MVNO Purification and Optimization

Table of Contents

Introduction

Codebase Organization

[The entire codebase is available in our Github repo.]

Sample Dataset

Data Usage Prediction

Customer Churn Prediction

Fraud Detection

Platform Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages