R Classification - Ensemble / ROCR

Classification - Ensemble (Bagging, Boosting, Random Forest) / ROCR
참고자료1 : Ensemble 기법
참고자료2 : ROCR, Lift chart

Ensemble

의사결정나무(Decision Tree)는 데이터의 작은 변화에 의해 예측 모델이 크게 변하는 불안정성이 있다.
주어진 자료로 여러 개의 예측 모델을 만들어 조합하여 하나의 최종 예측 모델을 만드는 방법을 앙상블(ensemble) 기법이라고 한다.

참고자료1-Bagging and Boosting
참고자료2-random forest
참고자료3-random forest and gradient boosting

1. Decision Tree - Ensemble - Bagging

주어진 데이터에서 여러 개의 bootstrap 자료를 생성 --> 각 자료에 대한 예측 모델 생성 --> 결합하여 최종 모델 결정
일반적으로 traing data의 모집단 분포를 모르기 때문에 실제 문제에서는 평균예측모델을 구할 수 없다.
배깅은 traing data를 모집단으로 생각하고 평균예측모델을 구하기 때문에 분산을 줄이고 예측력을 향상시킬 수 있다.
일반적으로 overfitting 된 모델일 경우 사용하면 좋다.
bootstrap : raw data 에서 랜덤 복원추출을 통해 만든 동일한 크기의 자료들
voting : 여러 개의 모델로부터 산출된 결과를 합쳐 다수결에 의해 최종 결과로 선택

# bootstrap data 생성
data_boot1 <- iris[sample(1:nrow(iris), replace = T), ]
data_boot2 <- iris[sample(1:nrow(iris), replace = T), ]
data_boot3 <- iris[sample(1:nrow(iris), replace = T), ]
data_boot4 <- iris[sample(1:nrow(iris), replace = T), ]
data_boot5 <- iris[sample(1:nrow(iris), replace = T), ]

# Modeling
tree1 <- ctree(Species ~ ., data_boot1)
tree2 <- ctree(Species ~ ., data_boot2)
tree3 <- ctree(Species ~ ., data_boot3)
tree4 <- ctree(Species ~ ., data_boot4)
tree5 <- ctree(Species ~ ., data_boot5)

plot(tree1)

plot(tree5)

pred1 <- predict(tree1, iris)
pred2 <- predict(tree2, iris)
pred3 <- predict(tree3, iris)
pred4 <- predict(tree4, iris)
pred5 <- predict(tree5, iris)

# 각각의 예측 결과를 취합
test <- data.frame(Species = iris$Species, pred1, pred2, pred3, pred4, pred5)
head(test)

##   Species  pred1  pred2  pred3  pred4  pred5
## 1  setosa setosa setosa setosa setosa setosa
## 2  setosa setosa setosa setosa setosa setosa
## 3  setosa setosa setosa setosa setosa setosa
## 4  setosa setosa setosa setosa setosa setosa
## 5  setosa setosa setosa setosa setosa setosa
## 6  setosa setosa setosa setosa setosa setosa

# 5개 분류기의 결과를 취합하여 최종 결과를 voting
funcResultValue <- function(x) {
    result <- NULL
    for (i in 1:nrow(x)) {
        xtab <- table(t(x[i, ]))
        rvalue <- names(sort(xtab, decreasing = T)[1])
        result <- c(result, rvalue)
    }
    return(result)
}

test$result <- funcResultValue(test[ , 2:6])
confusionMatrix(test$result, test$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         47         1
##   virginica       0          3        49
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9733          
##                  95% CI : (0.9331, 0.9927)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.96            
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9400           0.9800
## Specificity                 1.0000            0.9900           0.9700
## Pos Pred Value              1.0000            0.9792           0.9423
## Neg Pred Value              1.0000            0.9706           0.9898
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3133           0.3267
## Detection Prevalence        0.3333            0.3200           0.3467
## Balanced Accuracy           1.0000            0.9650           0.9750

2. Decision Tree - Ensemble - Boosting

예측력이 약한 모델들을 결합하여 강한 예측 모델을 만드는 방법. 훈련오차를 빨리 그리고 쉽게 줄일 수 있다.
잘못 분류된 데이터에 가중치를 주어서 더 잘 분류하는 것이 목적.
단점 : 2차, 3차 분류기에 들어가는 데이터는 기존 데이터의 일부만 적용되므로 traing data의 규모가 커야 한다.
Adaboost : 이진분류 문제에서 랜덤 분류기보다 조금 더 좋은 분류기 n개에 가중치를 설정하고 이를 결합하여 최종 분류기를 만듬.

library(tree)

data(kyphosis, package = "rpart")   # 척추교정 수술을 받은 어린이 데이터.
data <- kyphosis
head(data)

##   Kyphosis Age Number Start
## 1   absent  71      3     5
## 2   absent 158      3    14
## 3  present 128      4     5
## 4   absent   2      5     1
## 5   absent   1      4    15
## 6   absent   1      2    16

totalCount <- nrow(data)
totalCount

## [1] 81

boost <- function(k, compare) {
    # 첫번째 표본 추출 확률을 모두 동일하게 설정
    pr <- rep(1/totalCount, totalCount)
    
    # 결과에 대한 확률 및 모델의 정확도를 저장할 객체
    result <- matrix(0, k, 3) # k row 3 col
    
    # k개 만큼 tree model 생성
    for (j in 1:k) {
        # 배깅과 달리 각 인덱스에 설정된 확률로 샘플링
        data.boost <- data[sample(1:totalCount, prob = pr, replace = T), ]
        
        # 샘플링 데이터에 대한 tree 생성
        data.tree <- tree(Kyphosis ~ ., data.boost)
        
        # 각 row에 대한 예측을 저장할 객체  
        pred <- matrix(0, totalCount, 1)
        
        for (i in 1:totalCount) {
            # predict - absent / present 확률
            if (predict(data.tree, data[i, ])[ , 1] > 0.5) {
                pred[i, 1] <- "absent"
            } else {
                pred[i, 1] <- "present"
            }
        }
        
        # test data (compare) 한 개에 대한 예측 확률
        result[j, 1] <- predict(data.tree, compare)[ , 1]
        result[j, 2] <- predict(data.tree, compare)[ , 2]
        result[j, 3] <- length(which(as.matrix(data)[ , 1] == pred)) / totalCount # 정확도
        
        pr <- rep(1/totalCount, totalCount)
        # 오분류 표본의 확률을 2배로 설정하여 2번째 loop 수행
        pr[as.matrix(data)[ , 1] != pred] <- 2/totalCount   
    }
    return(result)
}

# 80번째 데이터로 10회 반복해서 측정
boost.result <- boost(10, data[80, ])
boost.result

##            [,1]      [,2]      [,3]
##  [1,] 1.0000000 0.0000000 0.8641975
##  [2,] 0.5555556 0.4444444 0.8148148
##  [3,] 0.1000000 0.9000000 0.7283951
##  [4,] 0.1428571 0.8571429 0.8271605
##  [5,] 0.2857143 0.7142857 0.8271605
##  [6,] 0.0000000 1.0000000 0.8271605
##  [7,] 0.0000000 1.0000000 0.8518519
##  [8,] 0.6000000 0.4000000 0.8271605
##  [9,] 0.0000000 1.0000000 0.8518519
## [10,] 1.0000000 0.0000000 0.8518519

a <- t(boost.result[,1])%*%(boost.result[,3])    # absent 확률
b <- t(boost.result[,2])%*%(boost.result[,3])    # present 확률
a;b

##          [,1]
## [1,] 3.092357

##          [,1]
## [1,] 5.179248

# b가 a 보다 확률이 높기 때문에 80번째 데이터는 present로 최종 예측

3. Decision Tree - Ensemble - Random Forest

bagging, boosting 보다 더 많은 무작위성을 주어 모델을 생성한 후 이를 선형 결합하여 최종 모델을 만드는 방법.
입력 변수가 아주 많은 경우에도 변수 제거없이 실행 가능.
최종 결과에 대한 해석이 어렵다는 단점이 있지만 좋은 예측력을 보인다.

idx <- sample(2, nrow(iris), replace = T, prob = c(0.7, 0.3))
trainData <- iris[idx == 1, ]
testData <- iris[idx == 2, ]

library(randomForest)

# ntree = 100 : 100 개의 tree 만듬.
# proximity = T : 다양한 트리 분할 시도
model <- randomForest(Species ~ ., data = trainData, ntree = 100, proximity = T)
model

## 
## Call:
##  randomForest(formula = Species ~ ., data = trainData, ntree = 100,      proximity = T) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 3.85%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         38          0         0  0.00000000
## versicolor      0         31         1  0.03125000
## virginica       0          3        31  0.08823529

table(trainData$Species, predict(model))

##             
##              setosa versicolor virginica
##   setosa         38          0         0
##   versicolor      0         31         1
##   virginica       0          3        31

importance(model)    # 지니계수. 값이 높은 변수가 클래스를 분류하는데 가장 큰 영향을 줌.

##              MeanDecreaseGini
## Sepal.Length         5.408025
## Sepal.Width          1.272408
## Petal.Length        30.947899
## Petal.Width         30.721860

plot(model, main = "randomForest model of iris")

# tree가 40개 이상일 경우 오차가 안정적으로 나타난다.

varImpPlot(model)   # 변수의 상대적 중요도를 표시

pred <- predict(model, newdata = testData)
table(testData$Species, pred)

##             pred
##              setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         16         2
##   virginica       0          0        16

plot(margin(model, testData$Species))

ROCR

library(C50)
library(ROCR)

data(churn)     # C50 dataset. 서비스 제공자를 바꾸는 고객.
summary(churnTrain)

##      state      account_length          area_code    international_plan
##  WV     : 106   Min.   :  1.0   area_code_408: 838   no :3010          
##  MN     :  84   1st Qu.: 74.0   area_code_415:1655   yes: 323          
##  NY     :  83   Median :101.0   area_code_510: 840                     
##  AL     :  80   Mean   :101.1                                          
##  OH     :  78   3rd Qu.:127.0                                          
##  OR     :  78   Max.   :243.0                                          
##  (Other):2824                                                          
##  voice_mail_plan number_vmail_messages total_day_minutes total_day_calls
##  no :2411        Min.   : 0.000        Min.   :  0.0     Min.   :  0.0  
##  yes: 922        1st Qu.: 0.000        1st Qu.:143.7     1st Qu.: 87.0  
##                  Median : 0.000        Median :179.4     Median :101.0  
##                  Mean   : 8.099        Mean   :179.8     Mean   :100.4  
##                  3rd Qu.:20.000        3rd Qu.:216.4     3rd Qu.:114.0  
##                  Max.   :51.000        Max.   :350.8     Max.   :165.0  
##                                                                         
##  total_day_charge total_eve_minutes total_eve_calls total_eve_charge
##  Min.   : 0.00    Min.   :  0.0     Min.   :  0.0   Min.   : 0.00   
##  1st Qu.:24.43    1st Qu.:166.6     1st Qu.: 87.0   1st Qu.:14.16   
##  Median :30.50    Median :201.4     Median :100.0   Median :17.12   
##  Mean   :30.56    Mean   :201.0     Mean   :100.1   Mean   :17.08   
##  3rd Qu.:36.79    3rd Qu.:235.3     3rd Qu.:114.0   3rd Qu.:20.00   
##  Max.   :59.64    Max.   :363.7     Max.   :170.0   Max.   :30.91   
##                                                                     
##  total_night_minutes total_night_calls total_night_charge
##  Min.   : 23.2       Min.   : 33.0     Min.   : 1.040    
##  1st Qu.:167.0       1st Qu.: 87.0     1st Qu.: 7.520    
##  Median :201.2       Median :100.0     Median : 9.050    
##  Mean   :200.9       Mean   :100.1     Mean   : 9.039    
##  3rd Qu.:235.3       3rd Qu.:113.0     3rd Qu.:10.590    
##  Max.   :395.0       Max.   :175.0     Max.   :17.770    
##                                                          
##  total_intl_minutes total_intl_calls total_intl_charge
##  Min.   : 0.00      Min.   : 0.000   Min.   :0.000    
##  1st Qu.: 8.50      1st Qu.: 3.000   1st Qu.:2.300    
##  Median :10.30      Median : 4.000   Median :2.780    
##  Mean   :10.24      Mean   : 4.479   Mean   :2.765    
##  3rd Qu.:12.10      3rd Qu.: 6.000   3rd Qu.:3.270    
##  Max.   :20.00      Max.   :20.000   Max.   :5.400    
##                                                       
##  number_customer_service_calls churn     
##  Min.   :0.000                 yes: 483  
##  1st Qu.:1.000                 no :2850  
##  Median :1.000                           
##  Mean   :1.563                           
##  3rd Qu.:2.000                           
##  Max.   :9.000                           
##

# Modeling
c5_options <- C5.0Control(winnow = FALSE, noGlobalPruning = FALSE)
model <- C5.0(churn ~ ., data = churnTrain, control = c5_options, rules = FALSE)
summary(model)

## 
## Call:
## C5.0.formula(formula = churn ~ ., data = churnTrain, control =
##  c5_options, rules = FALSE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sun Mar 19 20:59:53 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 3333 cases (20 attributes) from undefined.data
## 
## Decision tree:
## 
## total_day_minutes > 264.4:
## :...voice_mail_plan = yes:
## :   :...international_plan = no: no (45/1)
## :   :   international_plan = yes: yes (8/3)
## :   voice_mail_plan = no:
## :   :...total_eve_minutes > 187.7:
## :       :...total_night_minutes > 126.9: yes (94/1)
## :       :   total_night_minutes <= 126.9:
## :       :   :...total_day_minutes <= 277: no (4)
## :       :       total_day_minutes > 277: yes (3)
## :       total_eve_minutes <= 187.7:
## :       :...total_eve_charge <= 12.26: no (15/1)
## :           total_eve_charge > 12.26:
## :           :...total_day_minutes <= 277:
## :               :...total_night_minutes <= 224.8: no (13)
## :               :   total_night_minutes > 224.8: yes (5/1)
## :               total_day_minutes > 277:
## :               :...total_night_minutes > 151.9: yes (18)
## :                   total_night_minutes <= 151.9:
## :                   :...account_length <= 123: no (4)
## :                       account_length > 123: yes (2)
## total_day_minutes <= 264.4:
## :...number_customer_service_calls > 3:
##     :...total_day_minutes <= 160.2:
##     :   :...total_eve_charge <= 19.83: yes (79/3)
##     :   :   total_eve_charge > 19.83:
##     :   :   :...total_day_minutes <= 120.5: yes (10)
##     :   :       total_day_minutes > 120.5: no (13/3)
##     :   total_day_minutes > 160.2:
##     :   :...total_eve_charge > 12.05: no (130/24)
##     :       total_eve_charge <= 12.05:
##     :       :...total_eve_calls <= 125: yes (16/2)
##     :           total_eve_calls > 125: no (3)
##     number_customer_service_calls <= 3:
##     :...international_plan = yes:
##         :...total_intl_calls <= 2: yes (51)
##         :   total_intl_calls > 2:
##         :   :...total_intl_minutes <= 13.1: no (173/7)
##         :       total_intl_minutes > 13.1: yes (43)
##         international_plan = no:
##         :...total_day_minutes <= 223.2: no (2221/60)
##             total_day_minutes > 223.2:
##             :...total_eve_charge <= 20.5: no (295/22)
##                 total_eve_charge > 20.5:
##                 :...voice_mail_plan = yes: no (20)
##                     voice_mail_plan = no:
##                     :...total_night_minutes > 174.2: yes (50/8)
##                         total_night_minutes <= 174.2:
##                         :...total_day_minutes <= 246.6: no (12)
##                             total_day_minutes > 246.6:
##                             :...total_day_charge <= 43.33: yes (4)
##                                 total_day_charge > 43.33: no (2)
## 
## 
## Evaluation on training data (3333 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      27  136( 4.1%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     365   118    (a): class yes
##      18  2832    (b): class no
## 
## 
##  Attribute usage:
## 
##  100.00% total_day_minutes
##   93.67% number_customer_service_calls
##   87.73% international_plan
##   20.73% total_eve_charge
##    8.97% voice_mail_plan
##    8.01% total_intl_calls
##    6.48% total_intl_minutes
##    6.33% total_night_minutes
##    4.74% total_eve_minutes
##    0.57% total_eve_calls
##    0.18% account_length
##    0.18% total_day_charge
## 
## 
## Time: 0.1 secs

# 가지가 너무 많아서 Attribute usage 가 작은 변수 제거하고 다시 모델링
drops <- c("total_day_charge", "account_length", "total_eve_calls", "total_day_calls", "total_eve_minutes")
churnTrain2 <- churnTrain[, !(names(churnTrain) %in% drops)]

model2 <- C5.0(churn ~ ., data = churnTrain2, control = c5_options, rules = FALSE)
summary(model2)

## 
## Call:
## C5.0.formula(formula = churn ~ ., data = churnTrain2, control
##  = c5_options, rules = FALSE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sun Mar 19 20:59:53 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 3333 cases (15 attributes) from undefined.data
## 
## Decision tree:
## 
## total_day_minutes > 264.4:
## :...voice_mail_plan = yes:
## :   :...international_plan = no: no (45/1)
## :   :   international_plan = yes:
## :   :   :...total_day_minutes <= 275.4: yes (4)
## :   :       total_day_minutes > 275.4: no (4/1)
## :   voice_mail_plan = no:
## :   :...total_eve_charge > 15.95:
## :       :...total_night_minutes > 126.9: yes (94/1)
## :       :   total_night_minutes <= 126.9:
## :       :   :...total_day_minutes <= 277: no (4)
## :       :       total_day_minutes > 277: yes (3)
## :       total_eve_charge <= 15.95:
## :       :...total_eve_charge <= 12.26: no (15/1)
## :           total_eve_charge > 12.26:
## :           :...total_day_minutes <= 277:
## :               :...total_night_minutes <= 224.8: no (13)
## :               :   total_night_minutes > 224.8: yes (5/1)
## :               total_day_minutes > 277:
## :               :...total_night_minutes <= 151.9: no (6/2)
## :                   total_night_minutes > 151.9: yes (18)
## total_day_minutes <= 264.4:
## :...number_customer_service_calls > 3:
##     :...total_day_minutes > 160.2:
##     :   :...total_eve_charge <= 12.05: yes (19/5)
##     :   :   total_eve_charge > 12.05: no (130/24)
##     :   total_day_minutes <= 160.2:
##     :   :...total_eve_charge <= 19.83: yes (79/3)
##     :       total_eve_charge > 19.83:
##     :       :...total_day_minutes <= 120.5: yes (10)
##     :           total_day_minutes > 120.5: no (13/3)
##     number_customer_service_calls <= 3:
##     :...international_plan = yes:
##         :...total_intl_calls <= 2: yes (51)
##         :   total_intl_calls > 2:
##         :   :...total_intl_minutes <= 13.1: no (173/7)
##         :       total_intl_minutes > 13.1: yes (43)
##         international_plan = no:
##         :...total_day_minutes <= 223.2: no (2221/60)
##             total_day_minutes > 223.2:
##             :...total_eve_charge <= 20.5: no (295/22)
##                 total_eve_charge > 20.5:
##                 :...voice_mail_plan = yes: no (20)
##                     voice_mail_plan = no:
##                     :...total_night_minutes > 174.2: yes (50/8)
##                         total_night_minutes <= 174.2:
##                         :...total_day_minutes <= 246.6: no (12)
##                             total_day_minutes > 246.6:
##                             :...total_day_minutes <= 254.9: yes (4)
##                                 total_day_minutes > 254.9: no (2)
## 
## 
## Evaluation on training data (3333 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      26  139( 4.2%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     362   121    (a): class yes
##      18  2832    (b): class no
## 
## 
##  Attribute usage:
## 
##  100.00% total_day_minutes
##   93.67% number_customer_service_calls
##   87.73% international_plan
##   23.76% total_eve_charge
##    8.97% voice_mail_plan
##    8.01% total_intl_calls
##    6.48% total_intl_minutes
##    6.33% total_night_minutes
## 
## 
## Time: 0.1 secs

plot(model2, type = "simple")

pred_train <- predict(model2, churnTrain2, type = "class")
confusionMatrix(pred_train, churnTrain2$churn)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes  362   18
##        no   121 2832
##                                           
##                Accuracy : 0.9583          
##                  95% CI : (0.9509, 0.9648)
##     No Information Rate : 0.8551          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8154          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7495          
##             Specificity : 0.9937          
##          Pos Pred Value : 0.9526          
##          Neg Pred Value : 0.9590          
##              Prevalence : 0.1449          
##          Detection Rate : 0.1086          
##    Detection Prevalence : 0.1140          
##       Balanced Accuracy : 0.8716          
##                                           
##        'Positive' Class : yes             
##

# Test
head(churnTest)

##   state account_length     area_code international_plan voice_mail_plan
## 1    HI            101 area_code_510                 no              no
## 2    MT            137 area_code_510                 no              no
## 3    OH            103 area_code_408                 no             yes
## 4    NM             99 area_code_415                 no              no
## 5    SC            108 area_code_415                 no              no
## 6    IA            117 area_code_415                 no              no
##   number_vmail_messages total_day_minutes total_day_calls total_day_charge
## 1                     0              70.9             123            12.05
## 2                     0             223.6              86            38.01
## 3                    29             294.7              95            50.10
## 4                     0             216.8             123            36.86
## 5                     0             197.4              78            33.56
## 6                     0             226.5              85            38.51
##   total_eve_minutes total_eve_calls total_eve_charge total_night_minutes
## 1             211.9              73            18.01               236.0
## 2             244.8             139            20.81                94.2
## 3             237.3             105            20.17               300.3
## 4             126.4              88            10.74               220.6
## 5             124.0             101            10.54               204.5
## 6             141.6              68            12.04               223.0
##   total_night_calls total_night_charge total_intl_minutes total_intl_calls
## 1                73              10.62               10.6                3
## 2                81               4.24                9.5                7
## 3               127              13.51               13.7                6
## 4                82               9.93               15.7                2
## 5               107               9.20                7.7                4
## 6                90              10.04                6.9                5
##   total_intl_charge number_customer_service_calls churn
## 1              2.86                             3    no
## 2              2.57                             0    no
## 3              3.70                             1    no
## 4              4.24                             1    no
## 5              2.08                             2    no
## 6              1.86                             1    no

churnTest$pred <- predict(model2, churnTest, type = "class")        # 예측결과
churnTest$pred_prob <- predict(model2, churnTest, type = "prob")    # 확률
confusionMatrix(churnTest$pred, churnTest$churn)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes  146    9
##        no    78 1434
##                                         
##                Accuracy : 0.9478        
##                  95% CI : (0.936, 0.958)
##     No Information Rate : 0.8656        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.7421        
##  Mcnemar's Test P-Value : 3.091e-13     
##                                         
##             Sensitivity : 0.65179       
##             Specificity : 0.99376       
##          Pos Pred Value : 0.94194       
##          Neg Pred Value : 0.94841       
##              Prevalence : 0.13437       
##          Detection Rate : 0.08758       
##    Detection Prevalence : 0.09298       
##       Balanced Accuracy : 0.82277       
##                                         
##        'Positive' Class : yes           
##

# Model Evaluation by ROCR chart
head(churnTest$pred_prob)

##          yes        no
## 1 0.02706792 0.9729321
## 2 0.01114727 0.9888527
## 3 0.02488945 0.9751106
## 4 0.02706792 0.9729321
## 5 0.02706792 0.9729321
## 6 0.07481390 0.9251861

c5_pred <- prediction(churnTest$pred_prob[, "yes"], churnTest$churn)
c5_model.perf <- performance(c5_pred, "tpr", "fpr")
# True positive rate (tpr) = Sensitivity
# False positive rate (fpr) = 1 - Specificity

c5_model.perf

## An object of class "performance"
## Slot "x.name":
## [1] "False positive rate"
## 
## Slot "y.name":
## [1] "True positive rate"
## 
## Slot "alpha.name":
## [1] "Cutoff"
## 
## Slot "x.values":
## [[1]]
##  [1] 0.000000000 0.000000000 0.000000000 0.000000000 0.001386001
##  [6] 0.002079002 0.002079002 0.002772003 0.002772003 0.002772003
## [11] 0.005544006 0.006237006 0.006237006 0.006930007 0.009009009
## [16] 0.049203049 0.158697159 0.169092169 0.227304227 0.228690229
## [21] 0.975744976 0.989604990 0.992376992 0.995148995 1.000000000
## 
## 
## Slot "y.values":
## [[1]]
##  [1] 0.0000000 0.1026786 0.1830357 0.3616071 0.4017857 0.5491071 0.5625000
##  [8] 0.5669643 0.6294643 0.6339286 0.6428571 0.6517857 0.6562500 0.6562500
## [15] 0.6696429 0.7410714 0.7991071 0.8258929 0.8392857 0.8392857 0.9821429
## [22] 0.9866071 0.9955357 1.0000000 1.0000000
## 
## 
## Slot "alpha.values":
## [[1]]
##  [1]        Inf 0.98355605 0.98056624 0.98047278 0.95499550 0.95181143
##  [7] 0.92226496 0.82898290 0.82637087 0.78622863 0.70724572 0.69081909
## [13] 0.30641636 0.22898290 0.22463675 0.18431233 0.07481390 0.07155716
## [19] 0.04106273 0.02898290 0.02706792 0.02488945 0.01114727 0.01035104
## [25] 0.00690069

plot(c5_model.perf, col = "red")

AUROC <- performance(c5_pred, "auc")
AUROC@y.values

## [[1]]
## [1] 0.8804094

# 0.8804094 : Good

c5_model.lift <- performance(c5_pred, "lift", "rpp")  # rpp : Rate of positive predictions
plot(c5_model.lift, col = "red")

R Classification - Ensemble / ROCR

woosa7

ROCR