
Description:
Customer retention is key priority for any business. Multiple factors drive customer churn and understanding of these factors can help proactive management of customer churn. Combination of data processing and statistics can help in understanding the possible reasons and identifying customers at risk.
For this example we have used telecom data set from IBM community website ( https://community.watsonanalytics.com/resources/).Customersgo through a complex decision making process before subscribing to any one of the numerous Telecom service.
Goal of predictive model in this blog is to identify set of customers who have high probability of unsubscribing from the service. For this model, we are using personal details, demographic information, pricing, and plan information. We will also identify set of independent variables that are related to customer unsubscribing from service.
Data Description:
- Dataset has 7043 rows with 21 features.
- Independent variables considered for this exercise
- Customer Demographics (Age, Gender, Marital Status, Location, etc,)
- Billing Information (monthly and yearly payment)
- Voice and data service (Phone service, multiple lines, Internet service, online security, device protection, Tech support)
- Contract type
- Bill payment mode
- Response/Dependent variable considered for the model:
- Value ‘1’ indicates UNSUBSCRIBED customers
- Value ‘0’ indicates ACTIVE customers
Source code
Source code for this exercise is available at https://github.com/Innovyt/Machine-Learning-Model/blob/master/Telecom%20Customer%20Churn%20Analysis.ipynb
Predictive Model:
Logistic Regression:
For this exercise, we are using logistic regression algorithm. Logistic regression is useful in establishing a relationship between binary outcome and a group of continuous and/or categorical predictor variables. It also determines the percentage of variance in the dependent variable explained by independent variable.
Training the data to the model:
lr = LogisticRegression()
lr.fit(X_train,y_train)
Y_test=lr.predict(X_test)
Explained_Cov=lr.score(X_test,y_test)
print(“exaplined covarinace is {}”.format(Explained_Cov))
Roc: evaluating the trade-offs between true positive rate (sensitivity) and false positive rate (1- specificity). Higher the area under curve(AUC), better the prediction power of the model.
mean_squared_error is 0.345635202271
mean_absolute_error is 0.345635202271
explained_variance_score is -0.728474919601
r2 score is -0.791188969636
jaccard similarity is 0.654364797729
Partitioning the Data:
Training: 70% of the data is used for training a model
Testing: 30% of data is used to test the model.
Prediction Accuracy & Model Selection:
It is common to use multiple models. We chose this model based on Prediction accuracy and impact of type1 error.
Precision | recall | f1-score | support | |
0 | 0.84 | 0.90 | 0.87 | 1061 |
1 | 0.61 | 0.47 | 0.53 | 348 |
Avg/total | 0.78 | 0.79 | 0.79 | 1409 |
Technology:
Python,Pandas,Scikit-Learn
2 comments. Leave new
Hi there! This is my 1st comment here so I just wanted to give a quick shout out and tell you I really enjoy reading your blog posts. Can you recommend any other blogs/websites/forums that go over the same topics? Thanks for your time!|
Hello There. I found your blog using msn. This is a very well written article. I will make sure to bookmark it and come back to read more of your useful information. Thanks for the post. I’ll definitely return.