This dataset captures the results of a series of direct marketing campaigns of “BANK”, an international banking institution. The campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to assess if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
Run the a classifier that predicts the subscription of a term deposit using a 90:10, 80:20, and 70:30 split strategy #preparations -------------------------------------------------------------------------
library(tidyverse)
library(fastDummies)
library(visdat)
library(caret) #confusion matrix package
library(rpart.plot)
library(rpart)
dfbank=read.csv("https://raw.githubusercontent.com/jcbonilla/BusinessAnalytics/master/BAData/bank_marketing.csv ",stringsAsFactors = TRUE)#load data
dfbank[dfbank==""]=NA
vis_miss(dfbank,warn_large_data=FALSE) #no NA value
summary(dfbank)
#90-10---------------------------------------------------------------------------------
set.seed(97) # setting seed to reproduce results of random sampling
split=(.9) #90/10
trainingRowIndex = sample(1:nrow(dfbank),(split)*nrow(dfbank)) # row indices for training data
trainingData = dfbank[trainingRowIndex, ] # model training data
testData = dfbank[-trainingRowIndex, ] # test data
dtree = rpart(y ~age+job+marital+default+housing+loan+contact+duration+pdays+previous+poutcome+nr.employed, data=trainingData, method="class",parms=list(split="information"), control=rpart.control(minsplit=2, cp=0.001))
plotcp(dtree) # when cp=0.006 the model performs best
dtree$cptable
dtree$variable.importance # the importance of variables in this model is duration>nr.employed>pdays
dtree.pruned = prune(dtree, cp=.006) # pruning the decision tree
prp(dtree.pruned, type = 2, extra = 104,fallen.leaves = TRUE, main="Decision Tree") #plot the decision tree
dtree.pred = predict(dtree.pruned, testData, type="class") #predict based on model
conMatrix = confusionMatrix(data=dtree.pred, reference = testData$y) #Confusion Matrix
print(conMatrix)
This plot shows no NA value in this dataset
The summary result shows the basic situations about this dataset, we can find the result of subscription is not balanced, about 90% is no and only 10% is yes, so we can not say it’s a good model if the accuracy is near 90%, because even if I randomly guess all result is no, the prediction accuracy is still about 90%
This plot shows when cp=0.006 the model performs best, therefore when we prune the decision tree model, we can use cp = 0.006
The result shows the importance of variables in this model is duration>nr.employed>pdays>........
This is the decision tree plot after the pruning.
From this plot, we notice that the tree first split the data by nr.emplo then by duration and pdays.
This is the Confusion Matrix of prediction result on testing data, we can find the accuracy is 92.09%
Precision Rate = TP/(TP+FP)= 65.38%
Recall Rate= TP/(TP+FN)= 54.33%
From the result we can see that it’s very likely for this model to predict a NO result when it is a YES in real.
#80/20---------------------------------------------------------------------------------
set.seed(97)
split=(.8) #80/20
trainingRowIndex = sample(1:nrow(dfbank),(split)*nrow(dfbank)) # row indices for training data
trainingData = dfbank[trainingRowIndex, ] # model training data
testData = dfbank[-trainingRowIndex, ] # test data
dtree = rpart(y ~age+job+marital+default+housing+loan+contact+duration+pdays+previous+poutcome+nr.employed, data=trainingData, method="class",parms=list(split="information"), control=rpart.control(minsplit=2, cp=0.001))
plotcp(dtree) # when cp=0.0069 the model performs best
dtree$cptable
dtree$variable.importance # the importance of variables in this model is duration>nr.employed>pdays
dtree.pruned = prune(dtree, cp=0.0069) # pruning the decision tree
prp(dtree.pruned, type = 2, extra = 104,fallen.leaves = TRUE, main="Decision Tree") #plot the decision tree
dtree.pred = predict(dtree.pruned, testData, type="class") #predict based on model
conMatrix = confusionMatrix(data=dtree.pred, reference = testData$y) #Confusion Matrix
print(conMatrix)
This plot shows when cp=0.0069 the model performs best, therefore when we prune the decision tree model, we can use cp = 0.0069
The result shows the importance of variables in this model is duration>nr.employed>pdays>........, in another word, duration influence the result mostly
This is the decision tree plot after the pruning.
From this plot, we notice that the tree first split the data by nr.emplo then by different section of duration and pdays.
This is the Confusion Matrix of prediction result on testing data, we can find the accuracy is 91.49%
Precision Rate = TP/(TP+FP)= 64.77%
Recall Rate= TP/(TP+FN)= 53.98%
From the result we can see that it’s very likely for this model to predict a NO result when it is a YES in real.
#70/30---------------------------------------------------------------------------------
set.seed(97)
split=(.7) #70/30
trainingRowIndex = sample(1:nrow(dfbank),(split)*nrow(dfbank)) # row indices for training data
trainingData = dfbank[trainingRowIndex, ] # model training data
testData = dfbank[-trainingRowIndex, ] # test data
dtree = rpart(y ~age+job+marital+default+housing+loan+contact+duration+pdays+previous+poutcome+nr.employed, data=trainingData, method="class",parms=list(split="information"), control=rpart.control(minsplit=2, cp=0.001))
plotcp(dtree) # when cp=0.01 the model performs best
dtree$cptable
dtree$variable.importance # the importance of variables in this model is duration>nr.employed>pdays
dtree.pruned = prune(dtree, cp=0.01) # pruning the decision tree
prp(dtree.pruned, type = 2, extra = 104,fallen.leaves = TRUE, main="Decision Tree") #plot the decision tree
dtree.pred = predict(dtree.pruned, testData, type="class") #predict based on model
conMatrix = confusionMatrix(data=dtree.pred, reference = testData$y) #Confusion Matrix
print(conMatrix)
This plot shows when cp=0.01 the model performs best, therefore when we prune the decision tree model, we can use cp = 0.01
The result shows the importance of variables in this model is duration>nr.employed>pdays>........
This is the decision tree plot after the pruning.
From this plot, we notice that the tree first split the data by nr.emplo then by different section of duration and pdays.
This is the Confusion Matrix of prediction result on testing data, we can find the accuracy is 91.27%
Precision Rate = TP/(TP+FP)= 62.54%
Recall Rate= TP/(TP+FN)= 55.65%
From the result we can see that it’s very likely for this model to predict a NO result when it is a YES in real.
2. Interpret the classification trees and describe their accuracy.
See answers in question 1
3. Which model would your recommend and why?
The 90/10 model is the most accurate model, but I will choose the 80/20, because high rate of training data will cause overfit.