R Classification - Decision Tree

Classification - Decision Tree

Supervised Learning - 목표 변수 존재
 1) 분류 (Classification) : 목표 변수가 범주형
 2) 추정/예측 (Estimation) : 목표 변수가 연속형

Unsupervised Learning - 목표 변수가 없이 데이터 내부의 특성 이용
 1) 군집분석(Clustering)
 2) 연관규칙(Association rule)
 3) 연속규칙(Sequence rule)


--------------------------------------------------------------
분류분석 (Classification)
--------------------------------------------------------------

Supervised learning 기법
미리 정의된 그룹으로 데이터를 분류 (목표 변수가 범주형)
- 의사결정나무 (Decision Tree)
- 인공신경망 (Neural Networks)
- SVM (Support Vector Machines)
- 앙상블 기법 (Ensemble Methods)
- kNN (k Nearest Neighborhood)
- 로지스틱 회귀분석 (Logistic Regression)
- 베이지안 분류 (Naive Bayes & Baysian Belief Networks)

--------------------------------------------------------------
1. Decision Tree

(1) Decision Tree 활용

- 세분화/분류 : 데이터를 비슷한 특성을 갖는 몇 개의 그룹 또는 몇 개의 등급으로 분할
- 예측 : 데이터에서 규칙을 찾아낸 후 이를 이용하여 미래의 사건을 예측
- 차원 축소 및 변수 선택 : 목표 변수에 큰 영향을 미치는 변수를 골라냄
- 변수들 간의 교호작용 (interaction effect) 파악
- 목표 변수의 범주를 병합하거나 연속형 목표 변수를 이산화(binning)

(2) Decision Tree의 특징

- 결과를 설명하기 쉽다.
- 모형을 만드는 방법이 계산적으로 복잡하지 않다.
- 대용량 데이터도 빠르게 처리할 수 있다.
- 비정상 잡음 데이터에 대해서 민감하지 않다.
- 불필요한 변수가 있어도 크게 영향받지 않는다.

(3) Decision Tree 알고리즘

a. CART (Classification and Regression Tree)
 - 가장 많이 활용되는 알고리즘.
 - 목표 변수가 범주형, 연속형 모두 가능. 이진 분리.
 - 불순도 측도는 범주형일 경우 지니 지수, 연속형인 경우 분산을 사용.
 - 개별 입력변수 뿐만 아니라 입력변수들의 선형결합드 중에서 최적의 분리를 찾을 수 있다.

b. C4.5 & C5.0
 - 다지 분리(multiple split) 가능.
 - 명목형 목표 변수.
 - 불순도 측도는 엔트로피 지수 사용.

c. CHAID (Chi-squared Automatic Interaction Detection)
 - 적당한 크기에서 나무모형의 성장을 중지.
 - 연속형 목표변수.
 - 불순도 측도는 카이제곱 통계량 사용.

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

1. party package 이용한 Decision Tree

library(party)
library(caret)

# train / test data 분리 (6:4 or 7:3)
idx <- sample(2, nrow(iris), replace = T, prob = c(0.6, 0.4))
table(idx)

## idx
##  1  2 
## 98 52

train_1 <- iris[idx == 1, ]
test_1 <- iris[idx == 2, ]

# train data 이용한 모델링
tree_model <- ctree(Species ~ ., data = train_1)
tree_model

## 
##   Conditional inference tree with 4 terminal nodes
## 
## Response:  Species 
## Inputs:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width 
## Number of observations:  98 
## 
## 1) Petal.Length <= 1.9; criterion = 1, statistic = 91.023
##   2)*  weights = 33 
## 1) Petal.Length > 1.9
##   3) Petal.Width <= 1.6; criterion = 1, statistic = 45.1
##     4) Petal.Length <= 4.6; criterion = 0.997, statistic = 11.075
##       5)*  weights = 27 
##     4) Petal.Length > 4.6
##       6)*  weights = 9 
##   3) Petal.Width > 1.6
##     7)*  weights = 29

plot(tree_model)

plot(tree_model, type = "simple")

# 예측된 데이터와 실제 데이터 비교
table(train_1$Species)                        # real data

## 
##     setosa versicolor  virginica 
##         33         34         31

train_1$pred <- predict(tree_model)
confusionMatrix(train_1$pred, train_1$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         33          0         0
##   versicolor      0         33         3
##   virginica       0          1        28
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9592          
##                  95% CI : (0.8988, 0.9888)
##     No Information Rate : 0.3469          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9387          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9706           0.9032
## Specificity                 1.0000            0.9531           0.9851
## Pos Pred Value              1.0000            0.9167           0.9655
## Neg Pred Value              1.0000            0.9839           0.9565
## Prevalence                  0.3367            0.3469           0.3163
## Detection Rate              0.3367            0.3367           0.2857
## Detection Prevalence        0.3367            0.3673           0.2959
## Balanced Accuracy           1.0000            0.9619           0.9442

# Test Data로 검증
test_1$pred <- predict(tree_model, newdata = test_1)
confusionMatrix(test_1$pred, test_1$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         17          0         0
##   versicolor      0         15         1
##   virginica       0          1        18
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9615          
##                  95% CI : (0.8679, 0.9953)
##     No Information Rate : 0.3654          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9422          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9375           0.9474
## Specificity                 1.0000            0.9722           0.9697
## Pos Pred Value              1.0000            0.9375           0.9474
## Neg Pred Value              1.0000            0.9722           0.9697
## Prevalence                  0.3269            0.3077           0.3654
## Detection Rate              0.3269            0.2885           0.3462
## Detection Prevalence        0.3269            0.3077           0.3654
## Balanced Accuracy           1.0000            0.9549           0.9585

2. C50 패키지 이용한 Decision Tree

library(C50)

# train / test data 분리 (6:4)
idx <- sample(2, nrow(iris), replace = T, prob = c(0.6, 0.4))
table(idx)

## idx
##  1  2 
## 93 57

train_2 <- iris[idx == 1, ]
test_2 <- iris[idx == 2, ]

# modeling
c5_options <- C5.0Control(winnow = FALSE, noGlobalPruning = FALSE)
c5_model <- C5.0(Species ~ ., data = train_2, control=c5_options, rules=FALSE)
summary(c5_model)

## 
## Call:
## C5.0.formula(formula = Species ~ ., data = train_2, control =
##  c5_options, rules = FALSE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sun Mar 19 20:59:48 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 93 cases (5 attributes) from undefined.data
## 
## Decision tree:
## 
## Petal.Length <= 1.7: setosa (25)
## Petal.Length > 1.7:
## :...Petal.Width <= 1.7: versicolor (37/4)
##     Petal.Width > 1.7: virginica (31)
## 
## 
## Evaluation on training data (93 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       3    4( 4.3%)   <<
## 
## 
##     (a)   (b)   (c)    <-classified as
##    ----  ----  ----
##      25                (a): class setosa
##            33          (b): class versicolor
##             4    31    (c): class virginica
## 
## 
##  Attribute usage:
## 
##  100.00% Petal.Length
##   73.12% Petal.Width
## 
## 
## Time: 0.0 secs

plot(c5_model)

# validation
train_2$pred <- predict(c5_model, newdata = train_2)
confusionMatrix(train_2$pred, train_2$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         25          0         0
##   versicolor      0         33         4
##   virginica       0          0        31
## 
## Overall Statistics
##                                           
##                Accuracy : 0.957           
##                  95% CI : (0.8935, 0.9882)
##     No Information Rate : 0.3763          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9349          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.8857
## Specificity                 1.0000            0.9333           1.0000
## Pos Pred Value              1.0000            0.8919           1.0000
## Neg Pred Value              1.0000            1.0000           0.9355
## Prevalence                  0.2688            0.3548           0.3763
## Detection Rate              0.2688            0.3548           0.3333
## Detection Prevalence        0.2688            0.3978           0.3333
## Balanced Accuracy           1.0000            0.9667           0.9429

test_2$pred <- predict(c5_model, newdata = test_2)
confusionMatrix(test_2$pred, test_2$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         23          0         0
##   versicolor      2         16         1
##   virginica       0          1        14
## 
## Overall Statistics
##                                         
##                Accuracy : 0.9298        
##                  95% CI : (0.83, 0.9805)
##     No Information Rate : 0.4386        
##     P-Value [Acc > NIR] : 4.453e-15     
##                                         
##                   Kappa : 0.8928        
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 0.9200            0.9412           0.9333
## Specificity                 1.0000            0.9250           0.9762
## Pos Pred Value              1.0000            0.8421           0.9333
## Neg Pred Value              0.9412            0.9737           0.9762
## Prevalence                  0.4386            0.2982           0.2632
## Detection Rate              0.4035            0.2807           0.2456
## Detection Prevalence        0.4035            0.3333           0.2632
## Balanced Accuracy           0.9600            0.9331           0.9548

R 분류분석 Classification - Decision Tree

woosa7