탐색적 인자분석 Factor analysis

탐색적 인자분석(Exploratory Factor Analysis)

Factor Analysis : 인자분석 = 요인분석

심리학, 행동과학 등 많은 분야에서 직접 측정이 불가한 주된 관심의 개념을 간접적으로 측정
잠재변수 (latent variable) : 직접 측정이 불가하지만 간접적으로 측정할 수 있는 변수
(예) 시험점수 --> 지능

인자분석 : 측정변수와 잠재변수 사이의 관계를 밝히는 것.
탐색적 인자분석 : 어떤 측정변수가 어떤 인자에 관련된다는 특정한 가정 없이 조사
확인적 인자분석 : 사전에 가정된 특정한 인자 모형에 대해 측정하여 변수 사이의 공분산 또는 상관관계가 적합한지 검정.

f : 공통인자 common factor
u : 각 변수의 특정인자 specific factor. 공통인자를 완벽히 측정하지 못하는데서 오는 오차로 가정.
f, u 는 측정불가능
인자분석에서 추정하는 것은 분산 (공통인자의 분산 + 특정인자의 분산)
가정 : u 들간의 상호독립성, u와 f의 독립성

* 1 factor
X1 = L1*f1 + U1
var(X1) = L1**2 + var(U1) = 1

l : 인자적재값 (facor loading) - X와 f의 상관계수. 각 변수가 공통인자를 얼마나 반영하는지에 대한 계수.
li**2 : 공통인자에 의해 발생하는 분산의 비율 = Communality.
Communality 가 1에 가까우면 변수 Xi가 공통인자를 잘 반영한 것.
var(ui) 공통인자에 의해 설명되지 않는 부분. 특정인자의 분산.

* 2 factor
X1 = L11*f1 + L12*f2 + U1
var(x1) = L11**2 + L12^2 + var(U1)

# Example
app <- read.table("data/Applicant.TXT", header = T)
head(app)

##   ID X1.FL. X2.APP. X3.AA. X4.LA. X5.SC. X6.LC. X7.HON. X8.SMS. X9.EXP.
## 1  1      6       7      2      5      8      7       8       8       3
## 2  2      9      10      5      8     10      9       9      10       5
## 3  3      7       8      3      6      9      8       9       7       4
## 4  4      5       6      8      5      6      5       9       2       8
## 5  5      6       8      8      8      4      4       9       5       8
## 6  6      7       7      7      6      8      7      10       5       9
##   X10.DRV. X11.AMB. X12.GSP. X13.POT. X14.KJ. X15.SUIT.
## 1        8        9        7        5       7        10
## 2        9        9        8        8       8        10
## 3        9        9        8        6       8        10
## 4        4        5        8        7       6         5
## 5        5        5        8        8       7         7
## 6        6        5        8        6       6         6

app <- app[,-1]   # ID 제거

library(psych)
describe(app)

##           vars  n mean   sd median trimmed  mad min max range  skew
## X1.FL.       1 48 6.00 2.67    6.0    6.10 2.97   0  10    10 -0.24
## X2.APP.      2 48 7.08 1.97    7.0    7.22 1.48   3  10     7 -0.74
## X3.AA.       3 48 7.08 1.99    7.0    7.17 1.48   2  10     8 -0.37
## X4.LA.       4 48 6.15 2.81    7.0    6.35 2.97   0  10    10 -0.60
## X5.SC.       5 48 6.94 2.42    8.0    7.12 1.48   1  10     9 -0.72
## X6.LC.       6 48 6.31 3.17    8.0    6.58 2.97   0  10    10 -0.63
## X7.HON.      7 48 8.04 2.53    9.0    8.53 1.48   0  10    10 -1.72
## X8.SMS.      8 48 4.85 3.44    4.5    4.83 3.71   0  10    10  0.17
## X9.EXP.      9 48 4.23 3.31    3.0    4.08 2.97   0  10    10  0.56
## X10.DRV.    10 48 5.31 2.95    5.0    5.28 4.45   0  10    10  0.11
## X11.AMB.    11 48 5.98 2.94    6.0    6.10 4.45   0  10    10 -0.22
## X12.GSP.    12 48 6.25 3.04    7.0    6.50 2.22   0  10    10 -0.77
## X13.POT.    13 48 5.69 3.18    6.0    5.83 2.97   0  10    10 -0.49
## X14.KJ.     14 48 5.56 2.66    5.0    5.67 2.97   0  10    10 -0.27
## X15.SUIT.   15 48 5.96 3.30    6.0    6.15 4.45   0  10    10 -0.24
##           kurtosis   se
## X1.FL.       -0.81 0.39
## X2.APP.      -0.29 0.28
## X3.AA.       -0.46 0.29
## X4.LA.       -0.56 0.40
## X5.SC.       -0.55 0.35
## X6.LC.       -0.89 0.46
## X7.HON.       2.50 0.37
## X8.SMS.      -1.35 0.50
## X9.EXP.      -1.12 0.48
## X10.DRV.     -1.29 0.43
## X11.AMB.     -1.08 0.42
## X12.GSP.     -0.52 0.44
## X13.POT.     -1.01 0.46
## X14.KJ.      -0.28 0.38
## X15.SUIT.    -1.24 0.48

pairs.panels(app)

app_s <- scale(app)

fa1 <- factanal(app_s, 4)        # cor
print(fa1, digits = 2, sort = T)

## 
## Call:
## factanal(x = app_s, factors = 4)
## 
## Uniquenesses:
##    X1.FL.   X2.APP.    X3.AA.    X4.LA.    X5.SC.    X6.LC.   X7.HON. 
##      0.44      0.68      0.52      0.18      0.12      0.20      0.34 
##   X8.SMS.   X9.EXP.  X10.DRV.  X11.AMB.  X12.GSP.  X13.POT.   X14.KJ. 
##      0.14      0.36      0.23      0.14      0.15      0.09      0.00 
## X15.SUIT. 
##      0.25 
## 
## Loadings:
##           Factor1 Factor2 Factor3 Factor4
## X5.SC.     0.92            0.14          
## X6.LC.     0.84    0.11    0.29          
## X8.SMS.    0.88    0.26                  
## X10.DRV.   0.77    0.39    0.17          
## X11.AMB.   0.90    0.18                  
## X12.GSP.   0.79    0.28    0.35    0.15  
## X13.POT.   0.74    0.35    0.43    0.25  
## X1.FL.     0.13    0.72    0.11   -0.12  
## X9.EXP.            0.78            0.17  
## X15.SUIT.  0.36    0.77            0.14  
## X4.LA.     0.23    0.24    0.84          
## X7.HON.    0.25   -0.22    0.74          
## X3.AA.             0.13            0.68  
## X14.KJ.    0.42    0.39    0.55   -0.60  
## X2.APP.    0.46    0.14    0.24    0.16  
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings       5.57    2.47    2.10    1.01
## Proportion Var    0.37    0.16    0.14    0.07
## Cumulative Var    0.37    0.54    0.68    0.74
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 84 on 51 degrees of freedom.
## The p-value is 0.00247

fa2 <- factanal(app, 4)          # cov
print(fa2, digits = 2, sort = T)

## 
## Call:
## factanal(x = app, factors = 4)
## 
## Uniquenesses:
##    X1.FL.   X2.APP.    X3.AA.    X4.LA.    X5.SC.    X6.LC.   X7.HON. 
##      0.44      0.68      0.52      0.18      0.12      0.20      0.34 
##   X8.SMS.   X9.EXP.  X10.DRV.  X11.AMB.  X12.GSP.  X13.POT.   X14.KJ. 
##      0.14      0.36      0.23      0.14      0.15      0.09      0.00 
## X15.SUIT. 
##      0.25 
## 
## Loadings:
##           Factor1 Factor2 Factor3 Factor4
## X5.SC.     0.92            0.14          
## X6.LC.     0.84    0.11    0.29          
## X8.SMS.    0.88    0.26                  
## X10.DRV.   0.77    0.39    0.17          
## X11.AMB.   0.90    0.18                  
## X12.GSP.   0.79    0.28    0.35    0.15  
## X13.POT.   0.74    0.35    0.43    0.25  
## X1.FL.     0.13    0.72    0.11   -0.12  
## X9.EXP.            0.78            0.17  
## X15.SUIT.  0.36    0.77            0.14  
## X4.LA.     0.23    0.24    0.84          
## X7.HON.    0.25   -0.22    0.74          
## X3.AA.             0.13            0.68  
## X14.KJ.    0.42    0.39    0.55   -0.60  
## X2.APP.    0.46    0.14    0.24    0.16  
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings       5.57    2.47    2.10    1.01
## Proportion Var    0.37    0.16    0.14    0.07
## Cumulative Var    0.37    0.54    0.68    0.74
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 84 on 51 degrees of freedom.
## The p-value is 0.00247

# X6.LC 의 Communality = 0.84**2 + 0.11**2 + 0.29**2 = 0.80 (4개 factor에 의해 80% 설명됨)
# X6.LC 의 Uniquenesse = 1 - Communality = 0.20
# 즉, Uniquenesse 값이 작은 변수가 공통인자에 대한 설명력이 높다.

1 - fa1$uniquenesses # = Communality

##    X1.FL.   X2.APP.    X3.AA.    X4.LA.    X5.SC.    X6.LC.   X7.HON. 
## 0.5572552 0.3152922 0.4794454 0.8154054 0.8806047 0.8022471 0.6611150 
##   X8.SMS.   X9.EXP.  X10.DRV.  X11.AMB.  X12.GSP.  X13.POT.   X14.KJ. 
## 0.8618314 0.6430829 0.7743963 0.8630327 0.8474177 0.9104711 0.9950000 
## X15.SUIT. 
## 0.7484600

# X5에서 X13까지 factor1에 대한 로딩값이 크다.
# X1,9,15는 factor2에 대한 로딩값이 크다.

# x14 = 0.42*f1 + 0.39*f2 + 0.55 * f3 - 0.60 * f4

# loading matrix 세로 제곱합 : 각 factor의 설명력 (SS loadings) 총분산에서 factor가 설명해주는 양
# Factor1 = 5.57 / 15 (변수갯수) = 0.37

# Cumulative Var = 0.74 : 4개의 factor에 의해 원변수 변동량의 74%가 설명된다.

# Test of the hypothesis that 4 factors are sufficient. = H0 (귀무가설)
# p-value is 0.00247 < 0.05  귀무가설 기각 --> 4개로는 충분하지 않다.
# factor 갯수 선택시 이 방법에 의존하지는 말 것.
# 요인분석은 factor에 대한 해석이 가장 중요하다.

# factor loading 값의 산점도

load = fa1$loadings
plot(load, type = "n")
text(load, labels = colnames(app_s), cex = 0.7)

# fator rotation (인자 회전)

# 인자적재값의 구별이 쉬운 고유벡터를 찾아 회전

# (1) 직교 회전 (orthogonal ratation) : 회전된 인자들이 서로 상관되지 않도록 제약
# varimax : 한 공통인자에 대해 각 변수가 가지는 인자적재값 제곱의 분산이 최대가 되도록 변환
#           loadings matrix 각 열의 분산을 최대화 (가로 방향)
# quartimax : 한 변수가 각각의 공통인자에서 차지하는 비중의 제곱에 대한 분산을 최대화
#             loadings matrix 각 행의 분산을 최대화 (세로 방향)

# (2) 사각 회전 (oblique rotation) : 상관된 인자들을 허용
# oblimin : 인자들 사이의 상관성 정도를 제어
# promax : 회전에 의해 적재값을 어떤 승수로 올리는 방법. 인자들 사이에 낮은 상관성을 갖도록 함.


fa1 = factanal(app, 4)   # default : varimax
fa2 = factanal(app, 4, rotation = "none")
print(fa2, digits = 2, sort = T)

## 
## Call:
## factanal(x = app, factors = 4, rotation = "none")
## 
## Uniquenesses:
##    X1.FL.   X2.APP.    X3.AA.    X4.LA.    X5.SC.    X6.LC.   X7.HON. 
##      0.44      0.68      0.52      0.18      0.12      0.20      0.34 
##   X8.SMS.   X9.EXP.  X10.DRV.  X11.AMB.  X12.GSP.  X13.POT.   X14.KJ. 
##      0.14      0.36      0.23      0.14      0.15      0.09      0.00 
## X15.SUIT. 
##      0.25 
## 
## Loadings:
##           Factor1 Factor2 Factor3 Factor4
## X4.LA.     0.70            0.12    0.55  
## X10.DRV.   0.67    0.54           -0.15  
## X14.KJ.    0.99   -0.10                  
## X5.SC.     0.55    0.64   -0.40          
## X6.LC.     0.60    0.65   -0.13          
## X8.SMS.    0.63    0.64           -0.21  
## X11.AMB.   0.62    0.65   -0.14   -0.19  
## X12.GSP.   0.62    0.66            0.11  
## X13.POT.   0.62    0.67    0.18    0.22  
## X1.FL.     0.47            0.55   -0.18  
## X9.EXP.    0.24    0.19    0.72   -0.19  
## X15.SUIT.  0.44    0.37    0.62   -0.17  
## X7.HON.    0.46           -0.27    0.60  
## X2.APP.    0.33    0.43            0.14  
## X3.AA.    -0.27    0.48    0.34    0.26  
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings       5.02    3.45    1.65    1.03
## Proportion Var    0.33    0.23    0.11    0.07
## Cumulative Var    0.33    0.57    0.67    0.74
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 84 on 51 degrees of freedom.
## The p-value is 0.00247

library(psych)
library(GPArotation)

fa3 = fa(app, 4, rotate = "quartimax")
print(fa3, digits = 2, sort = T)

## Factor Analysis using method =  minres
## Call: fa(r = app, nfactors = 4, rotate = "quartimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##           item  MR2   MR3   MR4   MR1   h2    u2 com
## X11.AMB.    11 0.92  0.00 -0.11  0.08 0.86 0.137 1.0
## X8.SMS.      8 0.92  0.08 -0.11  0.08 0.86 0.138 1.1
## X5.SC.       5 0.89 -0.28 -0.07  0.07 0.88 0.120 1.2
## X12.GSP.    12 0.89  0.12  0.18 -0.10 0.85 0.153 1.2
## X6.LC.       6 0.89 -0.05  0.10 -0.04 0.80 0.198 1.0
## X13.POT.    13 0.87  0.21  0.28 -0.18 0.91 0.089 1.4
## X10.DRV.    10 0.84  0.22  0.00  0.11 0.77 0.226 1.2
## X2.APP.      2 0.52  0.06  0.15 -0.14 0.32 0.685 1.3
## X9.EXP.      9 0.23  0.77 -0.05 -0.07 0.64 0.356 1.2
## X15.SUIT.   15 0.51  0.70 -0.01 -0.05 0.75 0.251 1.8
## X1.FL.       1 0.28  0.65  0.08  0.21 0.56 0.443 1.6
## X4.LA.       4 0.44  0.13  0.76  0.14 0.82 0.185 1.8
## X7.HON.      7 0.36 -0.31  0.66  0.04 0.66 0.339 2.0
## X14.KJ.     14 0.59  0.19  0.40  0.67 1.00 0.005 2.8
## X3.AA.       3 0.11  0.19  0.04 -0.65 0.48 0.520 1.2
## 
##                        MR2  MR3  MR4  MR1
## SS loadings           6.85 1.89 1.36 1.05
## Proportion Var        0.46 0.13 0.09 0.07
## Cumulative Var        0.46 0.58 0.67 0.74
## Proportion Explained  0.61 0.17 0.12 0.09
## Cumulative Proportion 0.61 0.78 0.91 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 4 factors are sufficient.
## 
## The degrees of freedom for the null model are  105  and the objective function was  15.68 with Chi Square of  645.32
## The degrees of freedom for the model are 51  and the objective function was  2.18 
## 
## The root mean square of the residuals (RMSR) is  0.03 
## The df corrected root mean square of the residuals is  0.05 
## 
## The harmonic number of observations is  48 with the empirical chi square  11.41  with prob <  1 
## The total number of observations was  48  with Likelihood Chi Square =  84  with prob <  0.0025 
## 
## Tucker Lewis Index of factoring reliability =  0.864
## RMSEA index =  0.014  and the 90 % confidence intervals are  0.014 0.159
## BIC =  -113.43
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                 MR2  MR3  MR4  MR1
## Correlation of scores with factors             0.99 0.93 0.93 0.97
## Multiple R square of scores with factors       0.97 0.86 0.86 0.94
## Minimum correlation of possible factor scores  0.95 0.73 0.71 0.89

# MR : factor loadings
# h2 : communality
# u2 : uniquiness. specific factor 분산

fa.diagram(fa3)

# Practice

stock <- read.csv("data/stock_price.csv", header = T)
stock <- stock[, -1]
head(stock)

##     JPMorgan   Citibank WellsFargo RoyalDutchShell ExxonMobil
## 1  0.0130338 -0.0078431 -0.0031889      -0.0447693  0.0052151
## 2  0.0084862  0.0166886 -0.0062100       0.0119560  0.0134890
## 3 -0.0179153 -0.0086393  0.0100360       0.0000000 -0.0061428
## 4  0.0215589 -0.0034858  0.0174353      -0.0285917 -0.0069534
## 5  0.0108225  0.0037167 -0.0101345       0.0291900  0.0409751
## 6  0.0101713 -0.0121978 -0.0083768       0.0137083  0.0029895

# 1. 적절한 공통인자의 수를 구하시오

fa01 <- factanal(stock, 2)
print(fa01, digits = 2, sort = T)

## 
## Call:
## factanal(x = stock, factors = 2)
## 
## Uniquenesses:
##        JPMorgan        Citibank      WellsFargo RoyalDutchShell 
##            0.42            0.27            0.54            0.00 
##      ExxonMobil 
##            0.53 
## 
## Loadings:
##                 Factor1 Factor2
## JPMorgan        0.76           
## Citibank        0.82    0.23   
## WellsFargo      0.67    0.11   
## RoyalDutchShell 0.11    0.99   
## ExxonMobil      0.11    0.68   
## 
##                Factor1 Factor2
## SS loadings       1.72    1.51
## Proportion Var    0.34    0.30
## Cumulative Var    0.34    0.65
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 1.97 on 1 degree of freedom.
## The p-value is 0.16

# fa02 <- factanal(stock, 3)  # 3 factors are too many for 5 variables

    # common factor 갯수 = 2


# 2. 다양한 Rotation을 적용한 것과 하지 않은 것의 인자적재값을 비교하고 
#    적절하다고 판단되는 결과를 고르시오.

library(GPArotation)

fa_varimax = fa(stock, 2, rotate = "varimax")
fa_quartimax = fa(stock, 2, rotate = "quartimax")

print(fa_varimax, digits = 2, sort = T)

## Factor Analysis using method =  minres
## Call: fa(r = stock, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                 item  MR2  MR1   h2    u2 com
## Citibank           2 0.82 0.23 0.73 0.273 1.2
## JPMorgan           1 0.76 0.03 0.58 0.418 1.0
## WellsFargo         3 0.67 0.11 0.46 0.543 1.1
## RoyalDutchShell    4 0.11 0.99 1.00 0.005 1.0
## ExxonMobil         5 0.11 0.68 0.47 0.530 1.1
## 
##                        MR2  MR1
## SS loadings           1.72 1.51
## Proportion Var        0.34 0.30
## Cumulative Var        0.34 0.65
## Proportion Explained  0.53 0.47
## Cumulative Proportion 0.53 1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  10  and the objective function was  1.74 with Chi Square of  173.31
## The degrees of freedom for the model are 1  and the objective function was  0.02 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.06 
## 
## The harmonic number of observations is  103 with the empirical chi square  0.79  with prob <  0.38 
## The total number of observations was  103  with Likelihood Chi Square =  1.97  with prob <  0.16 
## 
## Tucker Lewis Index of factoring reliability =  0.939
## RMSEA index =  0.01  and the 90 % confidence intervals are  NA 0.301
## BIC =  -2.66
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                 MR2  MR1
## Correlation of scores with factors             0.90 1.00
## Multiple R square of scores with factors       0.82 0.99
## Minimum correlation of possible factor scores  0.64 0.98

print(fa_quartimax, digits = 2, sort = T)

## Factor Analysis using method =  minres
## Call: fa(r = stock, nfactors = 2, rotate = "quartimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                 item  MR2  MR1   h2    u2 com
## Citibank           2 0.82 0.22 0.73 0.273 1.1
## JPMorgan           1 0.76 0.02 0.58 0.418 1.0
## WellsFargo         3 0.67 0.10 0.46 0.543 1.0
## RoyalDutchShell    4 0.12 0.99 1.00 0.005 1.0
## ExxonMobil         5 0.12 0.68 0.47 0.530 1.1
## 
##                        MR2  MR1
## SS loadings           1.74 1.50
## Proportion Var        0.35 0.30
## Cumulative Var        0.35 0.65
## Proportion Explained  0.54 0.46
## Cumulative Proportion 0.54 1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  10  and the objective function was  1.74 with Chi Square of  173.31
## The degrees of freedom for the model are 1  and the objective function was  0.02 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.06 
## 
## The harmonic number of observations is  103 with the empirical chi square  0.79  with prob <  0.38 
## The total number of observations was  103  with Likelihood Chi Square =  1.97  with prob <  0.16 
## 
## Tucker Lewis Index of factoring reliability =  0.939
## RMSEA index =  0.01  and the 90 % confidence intervals are  NA 0.301
## BIC =  -2.66
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                 MR2  MR1
## Correlation of scores with factors             0.90 1.00
## Multiple R square of scores with factors       0.82 0.99
## Minimum correlation of possible factor scores  0.64 0.98

bank = apply(stock[, 1:3], 1, mean)
oil = apply(stock[, 4:5], 1, mean)

plot(bank, type = "l", ylim = c(-0.1,0.1))
lines(oil, col = "red")
legend("topright", c("bank", "oil"), lty = 1, col = 1:2)

탐색적 인자분석 Factor analysis

woosa7