Skip to content

PROBLEM 2: Bank Marketing

This dataset captures the results of a series of direct marketing campaigns of “BANK”, an international banking institution. The campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to assess if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
Run the a classifier that predicts the subscription of a term deposit using a 90:10, 80:20, and 70:30 split strategy
#preparations -------------------------------------------------------------------------
library(tidyverse)
library(fastDummies)
library(visdat)
library(caret) #confusion matrix package
library(rpart.plot)
library(rpart)

dfbank=read.csv("https://raw.githubusercontent.com/jcbonilla/BusinessAnalytics/master/BAData/bank_marketing.csv ",stringsAsFactors = TRUE)#load data
dfbank[dfbank==""]=NA
vis_miss(dfbank,warn_large_data=FALSE) #no NA value
summary(dfbank)

#90-10---------------------------------------------------------------------------------
set.seed(97) # setting seed to reproduce results of random sampling
split=(.9) #90/10
trainingRowIndex = sample(1:nrow(dfbank),(split)*nrow(dfbank)) # row indices for training data
trainingData = dfbank[trainingRowIndex, ] # model training data
testData = dfbank[-trainingRowIndex, ] # test data


dtree = rpart(y ~age+job+marital+default+housing+loan+contact+duration+pdays+previous+poutcome+nr.employed, data=trainingData, method="class",parms=list(split="information"), control=rpart.control(minsplit=2, cp=0.001))
plotcp(dtree) # when cp=0.006 the model performs best
dtree$cptable
dtree$variable.importance # the importance of variables in this model is duration>nr.employed>pdays
dtree.pruned = prune(dtree, cp=.006) # pruning the decision tree

prp(dtree.pruned, type = 2, extra = 104,fallen.leaves = TRUE, main="Decision Tree") #plot the decision tree
dtree.pred = predict(dtree.pruned, testData, type="class") #predict based on model

conMatrix = confusionMatrix(data=dtree.pred, reference = testData$y) #Confusion Matrix
print(conMatrix)
image.png
This plot shows no NA value in this dataset

1647374630(1).png
The summary result shows the basic situations about this dataset, we can find the result of subscription is not balanced, about 90% is no and only 10% is yes, so we can not say it’s a good model if the accuracy is near 90%, because even if I randomly guess all result is no, the prediction accuracy is still about 90%
image.png
This plot shows when cp=0.006 the model performs best, therefore when we prune the decision tree model, we can use cp = 0.006


image.png
The result shows the importance of variables in this model is duration>nr.employed>pdays>........


image.png
This is the decision tree plot after the pruning.
From this plot, we notice that the tree first split the data by nr.emplo then by duration and pdays.

1647375281(1).png
This is the Confusion Matrix of prediction result on testing data, we can find the accuracy is 92.09%
Precision Rate = TP/(TP+FP)= 65.38%
Recall Rate= TP/(TP+FN)= 54.33%
From the result we can see that it’s very likely for this model to predict a NO result when it is a YES in real.


#80/20---------------------------------------------------------------------------------
set.seed(97)
split=(.8) #80/20
trainingRowIndex = sample(1:nrow(dfbank),(split)*nrow(dfbank)) # row indices for training data
trainingData = dfbank[trainingRowIndex, ] # model training data
testData = dfbank[-trainingRowIndex, ] # test data


dtree = rpart(y ~age+job+marital+default+housing+loan+contact+duration+pdays+previous+poutcome+nr.employed, data=trainingData, method="class",parms=list(split="information"), control=rpart.control(minsplit=2, cp=0.001))
plotcp(dtree) # when cp=0.0069 the model performs best
dtree$cptable
dtree$variable.importance # the importance of variables in this model is duration>nr.employed>pdays
dtree.pruned = prune(dtree, cp=0.0069) # pruning the decision tree

prp(dtree.pruned, type = 2, extra = 104,fallen.leaves = TRUE, main="Decision Tree") #plot the decision tree
dtree.pred = predict(dtree.pruned, testData, type="class") #predict based on model

conMatrix = confusionMatrix(data=dtree.pred, reference = testData$y) #Confusion Matrix
print(conMatrix)
image.png
This plot shows when cp=0.0069 the model performs best, therefore when we prune the decision tree model, we can use cp = 0.0069

1647377605(1).png
The result shows the importance of variables in this model is duration>nr.employed>pdays>........, in another word, duration influence the result mostly



image.png
This is the decision tree plot after the pruning.
From this plot, we notice that the tree first split the data by nr.emplo then by different section of duration and pdays.
1647377723(1).png
This is the Confusion Matrix of prediction result on testing data, we can find the accuracy is 91.49%
Precision Rate = TP/(TP+FP)= 64.77%
Recall Rate= TP/(TP+FN)= 53.98%
From the result we can see that it’s very likely for this model to predict a NO result when it is a YES in real.


#70/30---------------------------------------------------------------------------------
set.seed(97)
split=(.7) #70/30
trainingRowIndex = sample(1:nrow(dfbank),(split)*nrow(dfbank)) # row indices for training data
trainingData = dfbank[trainingRowIndex, ] # model training data
testData = dfbank[-trainingRowIndex, ] # test data


dtree = rpart(y ~age+job+marital+default+housing+loan+contact+duration+pdays+previous+poutcome+nr.employed, data=trainingData, method="class",parms=list(split="information"), control=rpart.control(minsplit=2, cp=0.001))
plotcp(dtree) # when cp=0.01 the model performs best
dtree$cptable
dtree$variable.importance # the importance of variables in this model is duration>nr.employed>pdays
dtree.pruned = prune(dtree, cp=0.01) # pruning the decision tree

prp(dtree.pruned, type = 2, extra = 104,fallen.leaves = TRUE, main="Decision Tree") #plot the decision tree
dtree.pred = predict(dtree.pruned, testData, type="class") #predict based on model

conMatrix = confusionMatrix(data=dtree.pred, reference = testData$y) #Confusion Matrix
print(conMatrix)
image.png
This plot shows when cp=0.01 the model performs best, therefore when we prune the decision tree model, we can use cp = 0.01
image.png
The result shows the importance of variables in this model is duration>nr.employed>pdays>........

image.png
This is the decision tree plot after the pruning.
From this plot, we notice that the tree first split the data by nr.emplo then by different section of duration and pdays.

1647377940(1).png
This is the Confusion Matrix of prediction result on testing data, we can find the accuracy is 91.27%
Precision Rate = TP/(TP+FP)= 62.54%
Recall Rate= TP/(TP+FN)= 55.65%
From the result we can see that it’s very likely for this model to predict a NO result when it is a YES in real.

2. Interpret the classification trees and describe their accuracy.
See answers in question 1
3. Which model would your recommend and why?
The 90/10 model is the most accurate model, but I will choose the 80/20, because high rate of training data will cause overfit.


Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.