탐색적 인자분석(Exploratory Factor Analysis)
Factor Analysis : 인자분석 = 요인분석
심리학, 행동과학 등 많은 분야에서 직접 측정이 불가한 주된 관심의 개념을 간접적으로 측정
잠재변수 (latent variable) : 직접 측정이 불가하지만 간접적으로 측정할 수 있는 변수
(예) 시험점수 --> 지능
인자분석 : 측정변수와 잠재변수 사이의 관계를 밝히는 것.
탐색적 인자분석 : 어떤 측정변수가 어떤 인자에 관련된다는 특정한 가정 없이 조사
확인적 인자분석 : 사전에 가정된 특정한 인자 모형에 대해 측정하여 변수 사이의 공분산 또는 상관관계가 적합한지 검정.
f : 공통인자 common factor
u : 각 변수의 특정인자 specific factor. 공통인자를 완벽히 측정하지 못하는데서 오는 오차로 가정.
f, u 는 측정불가능
인자분석에서 추정하는 것은 분산 (공통인자의 분산 + 특정인자의 분산)
가정 : u 들간의 상호독립성, u와 f의 독립성
* 1 factor
X1 = L1*f1 + U1
var(X1) = L1**2 + var(U1) = 1
l : 인자적재값 (facor loading) - X와 f의 상관계수. 각 변수가 공통인자를 얼마나 반영하는지에 대한 계수.
li**2 : 공통인자에 의해 발생하는 분산의 비율 = Communality.
Communality 가 1에 가까우면 변수 Xi가 공통인자를 잘 반영한 것.
var(ui) 공통인자에 의해 설명되지 않는 부분. 특정인자의 분산.
* 2 factor
X1 = L11*f1 + L12*f2 + U1
var(x1) = L11**2 + L12^2 + var(U1)
# Example
app <- read.table("data/Applicant.TXT", header = T)
head(app)
## ID X1.FL. X2.APP. X3.AA. X4.LA. X5.SC. X6.LC. X7.HON. X8.SMS. X9.EXP.
## 1 1 6 7 2 5 8 7 8 8 3
## 2 2 9 10 5 8 10 9 9 10 5
## 3 3 7 8 3 6 9 8 9 7 4
## 4 4 5 6 8 5 6 5 9 2 8
## 5 5 6 8 8 8 4 4 9 5 8
## 6 6 7 7 7 6 8 7 10 5 9
## X10.DRV. X11.AMB. X12.GSP. X13.POT. X14.KJ. X15.SUIT.
## 1 8 9 7 5 7 10
## 2 9 9 8 8 8 10
## 3 9 9 8 6 8 10
## 4 4 5 8 7 6 5
## 5 5 5 8 8 7 7
## 6 6 5 8 6 6 6
app <- app[,-1] # ID 제거
library(psych)
describe(app)
## vars n mean sd median trimmed mad min max range skew
## X1.FL. 1 48 6.00 2.67 6.0 6.10 2.97 0 10 10 -0.24
## X2.APP. 2 48 7.08 1.97 7.0 7.22 1.48 3 10 7 -0.74
## X3.AA. 3 48 7.08 1.99 7.0 7.17 1.48 2 10 8 -0.37
## X4.LA. 4 48 6.15 2.81 7.0 6.35 2.97 0 10 10 -0.60
## X5.SC. 5 48 6.94 2.42 8.0 7.12 1.48 1 10 9 -0.72
## X6.LC. 6 48 6.31 3.17 8.0 6.58 2.97 0 10 10 -0.63
## X7.HON. 7 48 8.04 2.53 9.0 8.53 1.48 0 10 10 -1.72
## X8.SMS. 8 48 4.85 3.44 4.5 4.83 3.71 0 10 10 0.17
## X9.EXP. 9 48 4.23 3.31 3.0 4.08 2.97 0 10 10 0.56
## X10.DRV. 10 48 5.31 2.95 5.0 5.28 4.45 0 10 10 0.11
## X11.AMB. 11 48 5.98 2.94 6.0 6.10 4.45 0 10 10 -0.22
## X12.GSP. 12 48 6.25 3.04 7.0 6.50 2.22 0 10 10 -0.77
## X13.POT. 13 48 5.69 3.18 6.0 5.83 2.97 0 10 10 -0.49
## X14.KJ. 14 48 5.56 2.66 5.0 5.67 2.97 0 10 10 -0.27
## X15.SUIT. 15 48 5.96 3.30 6.0 6.15 4.45 0 10 10 -0.24
## kurtosis se
## X1.FL. -0.81 0.39
## X2.APP. -0.29 0.28
## X3.AA. -0.46 0.29
## X4.LA. -0.56 0.40
## X5.SC. -0.55 0.35
## X6.LC. -0.89 0.46
## X7.HON. 2.50 0.37
## X8.SMS. -1.35 0.50
## X9.EXP. -1.12 0.48
## X10.DRV. -1.29 0.43
## X11.AMB. -1.08 0.42
## X12.GSP. -0.52 0.44
## X13.POT. -1.01 0.46
## X14.KJ. -0.28 0.38
## X15.SUIT. -1.24 0.48
pairs.panels(app)
app_s <- scale(app)
fa1 <- factanal(app_s, 4) # cor
print(fa1, digits = 2, sort = T)
##
## Call:
## factanal(x = app_s, factors = 4)
##
## Uniquenesses:
## X1.FL. X2.APP. X3.AA. X4.LA. X5.SC. X6.LC. X7.HON.
## 0.44 0.68 0.52 0.18 0.12 0.20 0.34
## X8.SMS. X9.EXP. X10.DRV. X11.AMB. X12.GSP. X13.POT. X14.KJ.
## 0.14 0.36 0.23 0.14 0.15 0.09 0.00
## X15.SUIT.
## 0.25
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4
## X5.SC. 0.92 0.14
## X6.LC. 0.84 0.11 0.29
## X8.SMS. 0.88 0.26
## X10.DRV. 0.77 0.39 0.17
## X11.AMB. 0.90 0.18
## X12.GSP. 0.79 0.28 0.35 0.15
## X13.POT. 0.74 0.35 0.43 0.25
## X1.FL. 0.13 0.72 0.11 -0.12
## X9.EXP. 0.78 0.17
## X15.SUIT. 0.36 0.77 0.14
## X4.LA. 0.23 0.24 0.84
## X7.HON. 0.25 -0.22 0.74
## X3.AA. 0.13 0.68
## X14.KJ. 0.42 0.39 0.55 -0.60
## X2.APP. 0.46 0.14 0.24 0.16
##
## Factor1 Factor2 Factor3 Factor4
## SS loadings 5.57 2.47 2.10 1.01
## Proportion Var 0.37 0.16 0.14 0.07
## Cumulative Var 0.37 0.54 0.68 0.74
##
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 84 on 51 degrees of freedom.
## The p-value is 0.00247
fa2 <- factanal(app, 4) # cov
print(fa2, digits = 2, sort = T)
##
## Call:
## factanal(x = app, factors = 4)
##
## Uniquenesses:
## X1.FL. X2.APP. X3.AA. X4.LA. X5.SC. X6.LC. X7.HON.
## 0.44 0.68 0.52 0.18 0.12 0.20 0.34
## X8.SMS. X9.EXP. X10.DRV. X11.AMB. X12.GSP. X13.POT. X14.KJ.
## 0.14 0.36 0.23 0.14 0.15 0.09 0.00
## X15.SUIT.
## 0.25
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4
## X5.SC. 0.92 0.14
## X6.LC. 0.84 0.11 0.29
## X8.SMS. 0.88 0.26
## X10.DRV. 0.77 0.39 0.17
## X11.AMB. 0.90 0.18
## X12.GSP. 0.79 0.28 0.35 0.15
## X13.POT. 0.74 0.35 0.43 0.25
## X1.FL. 0.13 0.72 0.11 -0.12
## X9.EXP. 0.78 0.17
## X15.SUIT. 0.36 0.77 0.14
## X4.LA. 0.23 0.24 0.84
## X7.HON. 0.25 -0.22 0.74
## X3.AA. 0.13 0.68
## X14.KJ. 0.42 0.39 0.55 -0.60
## X2.APP. 0.46 0.14 0.24 0.16
##
## Factor1 Factor2 Factor3 Factor4
## SS loadings 5.57 2.47 2.10 1.01
## Proportion Var 0.37 0.16 0.14 0.07
## Cumulative Var 0.37 0.54 0.68 0.74
##
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 84 on 51 degrees of freedom.
## The p-value is 0.00247
# X6.LC 의 Communality = 0.84**2 + 0.11**2 + 0.29**2 = 0.80 (4개 factor에 의해 80% 설명됨)
# X6.LC 의 Uniquenesse = 1 - Communality = 0.20
# 즉, Uniquenesse 값이 작은 변수가 공통인자에 대한 설명력이 높다.
1 - fa1$uniquenesses # = Communality
## X1.FL. X2.APP. X3.AA. X4.LA. X5.SC. X6.LC. X7.HON.
## 0.5572552 0.3152922 0.4794454 0.8154054 0.8806047 0.8022471 0.6611150
## X8.SMS. X9.EXP. X10.DRV. X11.AMB. X12.GSP. X13.POT. X14.KJ.
## 0.8618314 0.6430829 0.7743963 0.8630327 0.8474177 0.9104711 0.9950000
## X15.SUIT.
## 0.7484600
# X5에서 X13까지 factor1에 대한 로딩값이 크다.
# X1,9,15는 factor2에 대한 로딩값이 크다.
# x14 = 0.42*f1 + 0.39*f2 + 0.55 * f3 - 0.60 * f4
# loading matrix 세로 제곱합 : 각 factor의 설명력 (SS loadings) 총분산에서 factor가 설명해주는 양
# Factor1 = 5.57 / 15 (변수갯수) = 0.37
# Cumulative Var = 0.74 : 4개의 factor에 의해 원변수 변동량의 74%가 설명된다.
# Test of the hypothesis that 4 factors are sufficient. = H0 (귀무가설)
# p-value is 0.00247 < 0.05 귀무가설 기각 --> 4개로는 충분하지 않다.
# factor 갯수 선택시 이 방법에 의존하지는 말 것.
# 요인분석은 factor에 대한 해석이 가장 중요하다.
# factor loading 값의 산점도
load = fa1$loadings
plot(load, type = "n")
text(load, labels = colnames(app_s), cex = 0.7)
# fator rotation (인자 회전)
# 인자적재값의 구별이 쉬운 고유벡터를 찾아 회전
# (1) 직교 회전 (orthogonal ratation) : 회전된 인자들이 서로 상관되지 않도록 제약
# varimax : 한 공통인자에 대해 각 변수가 가지는 인자적재값 제곱의 분산이 최대가 되도록 변환
# loadings matrix 각 열의 분산을 최대화 (가로 방향)
# quartimax : 한 변수가 각각의 공통인자에서 차지하는 비중의 제곱에 대한 분산을 최대화
# loadings matrix 각 행의 분산을 최대화 (세로 방향)
# (2) 사각 회전 (oblique rotation) : 상관된 인자들을 허용
# oblimin : 인자들 사이의 상관성 정도를 제어
# promax : 회전에 의해 적재값을 어떤 승수로 올리는 방법. 인자들 사이에 낮은 상관성을 갖도록 함.
fa1 = factanal(app, 4) # default : varimax
fa2 = factanal(app, 4, rotation = "none")
print(fa2, digits = 2, sort = T)
##
## Call:
## factanal(x = app, factors = 4, rotation = "none")
##
## Uniquenesses:
## X1.FL. X2.APP. X3.AA. X4.LA. X5.SC. X6.LC. X7.HON.
## 0.44 0.68 0.52 0.18 0.12 0.20 0.34
## X8.SMS. X9.EXP. X10.DRV. X11.AMB. X12.GSP. X13.POT. X14.KJ.
## 0.14 0.36 0.23 0.14 0.15 0.09 0.00
## X15.SUIT.
## 0.25
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4
## X4.LA. 0.70 0.12 0.55
## X10.DRV. 0.67 0.54 -0.15
## X14.KJ. 0.99 -0.10
## X5.SC. 0.55 0.64 -0.40
## X6.LC. 0.60 0.65 -0.13
## X8.SMS. 0.63 0.64 -0.21
## X11.AMB. 0.62 0.65 -0.14 -0.19
## X12.GSP. 0.62 0.66 0.11
## X13.POT. 0.62 0.67 0.18 0.22
## X1.FL. 0.47 0.55 -0.18
## X9.EXP. 0.24 0.19 0.72 -0.19
## X15.SUIT. 0.44 0.37 0.62 -0.17
## X7.HON. 0.46 -0.27 0.60
## X2.APP. 0.33 0.43 0.14
## X3.AA. -0.27 0.48 0.34 0.26
##
## Factor1 Factor2 Factor3 Factor4
## SS loadings 5.02 3.45 1.65 1.03
## Proportion Var 0.33 0.23 0.11 0.07
## Cumulative Var 0.33 0.57 0.67 0.74
##
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 84 on 51 degrees of freedom.
## The p-value is 0.00247
library(psych)
library(GPArotation)
fa3 = fa(app, 4, rotate = "quartimax")
print(fa3, digits = 2, sort = T)
## Factor Analysis using method = minres
## Call: fa(r = app, nfactors = 4, rotate = "quartimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## item MR2 MR3 MR4 MR1 h2 u2 com
## X11.AMB. 11 0.92 0.00 -0.11 0.08 0.86 0.137 1.0
## X8.SMS. 8 0.92 0.08 -0.11 0.08 0.86 0.138 1.1
## X5.SC. 5 0.89 -0.28 -0.07 0.07 0.88 0.120 1.2
## X12.GSP. 12 0.89 0.12 0.18 -0.10 0.85 0.153 1.2
## X6.LC. 6 0.89 -0.05 0.10 -0.04 0.80 0.198 1.0
## X13.POT. 13 0.87 0.21 0.28 -0.18 0.91 0.089 1.4
## X10.DRV. 10 0.84 0.22 0.00 0.11 0.77 0.226 1.2
## X2.APP. 2 0.52 0.06 0.15 -0.14 0.32 0.685 1.3
## X9.EXP. 9 0.23 0.77 -0.05 -0.07 0.64 0.356 1.2
## X15.SUIT. 15 0.51 0.70 -0.01 -0.05 0.75 0.251 1.8
## X1.FL. 1 0.28 0.65 0.08 0.21 0.56 0.443 1.6
## X4.LA. 4 0.44 0.13 0.76 0.14 0.82 0.185 1.8
## X7.HON. 7 0.36 -0.31 0.66 0.04 0.66 0.339 2.0
## X14.KJ. 14 0.59 0.19 0.40 0.67 1.00 0.005 2.8
## X3.AA. 3 0.11 0.19 0.04 -0.65 0.48 0.520 1.2
##
## MR2 MR3 MR4 MR1
## SS loadings 6.85 1.89 1.36 1.05
## Proportion Var 0.46 0.13 0.09 0.07
## Cumulative Var 0.46 0.58 0.67 0.74
## Proportion Explained 0.61 0.17 0.12 0.09
## Cumulative Proportion 0.61 0.78 0.91 1.00
##
## Mean item complexity = 1.5
## Test of the hypothesis that 4 factors are sufficient.
##
## The degrees of freedom for the null model are 105 and the objective function was 15.68 with Chi Square of 645.32
## The degrees of freedom for the model are 51 and the objective function was 2.18
##
## The root mean square of the residuals (RMSR) is 0.03
## The df corrected root mean square of the residuals is 0.05
##
## The harmonic number of observations is 48 with the empirical chi square 11.41 with prob < 1
## The total number of observations was 48 with Likelihood Chi Square = 84 with prob < 0.0025
##
## Tucker Lewis Index of factoring reliability = 0.864
## RMSEA index = 0.014 and the 90 % confidence intervals are 0.014 0.159
## BIC = -113.43
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## MR2 MR3 MR4 MR1
## Correlation of scores with factors 0.99 0.93 0.93 0.97
## Multiple R square of scores with factors 0.97 0.86 0.86 0.94
## Minimum correlation of possible factor scores 0.95 0.73 0.71 0.89
# MR : factor loadings
# h2 : communality
# u2 : uniquiness. specific factor 분산
fa.diagram(fa3)
# Practice
stock <- read.csv("data/stock_price.csv", header = T)
stock <- stock[, -1]
head(stock)
## JPMorgan Citibank WellsFargo RoyalDutchShell ExxonMobil
## 1 0.0130338 -0.0078431 -0.0031889 -0.0447693 0.0052151
## 2 0.0084862 0.0166886 -0.0062100 0.0119560 0.0134890
## 3 -0.0179153 -0.0086393 0.0100360 0.0000000 -0.0061428
## 4 0.0215589 -0.0034858 0.0174353 -0.0285917 -0.0069534
## 5 0.0108225 0.0037167 -0.0101345 0.0291900 0.0409751
## 6 0.0101713 -0.0121978 -0.0083768 0.0137083 0.0029895
# 1. 적절한 공통인자의 수를 구하시오
fa01 <- factanal(stock, 2)
print(fa01, digits = 2, sort = T)
##
## Call:
## factanal(x = stock, factors = 2)
##
## Uniquenesses:
## JPMorgan Citibank WellsFargo RoyalDutchShell
## 0.42 0.27 0.54 0.00
## ExxonMobil
## 0.53
##
## Loadings:
## Factor1 Factor2
## JPMorgan 0.76
## Citibank 0.82 0.23
## WellsFargo 0.67 0.11
## RoyalDutchShell 0.11 0.99
## ExxonMobil 0.11 0.68
##
## Factor1 Factor2
## SS loadings 1.72 1.51
## Proportion Var 0.34 0.30
## Cumulative Var 0.34 0.65
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 1.97 on 1 degree of freedom.
## The p-value is 0.16
# fa02 <- factanal(stock, 3) # 3 factors are too many for 5 variables
# common factor 갯수 = 2
# 2. 다양한 Rotation을 적용한 것과 하지 않은 것의 인자적재값을 비교하고
# 적절하다고 판단되는 결과를 고르시오.
library(GPArotation)
fa_varimax = fa(stock, 2, rotate = "varimax")
fa_quartimax = fa(stock, 2, rotate = "quartimax")
print(fa_varimax, digits = 2, sort = T)
## Factor Analysis using method = minres
## Call: fa(r = stock, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## item MR2 MR1 h2 u2 com
## Citibank 2 0.82 0.23 0.73 0.273 1.2
## JPMorgan 1 0.76 0.03 0.58 0.418 1.0
## WellsFargo 3 0.67 0.11 0.46 0.543 1.1
## RoyalDutchShell 4 0.11 0.99 1.00 0.005 1.0
## ExxonMobil 5 0.11 0.68 0.47 0.530 1.1
##
## MR2 MR1
## SS loadings 1.72 1.51
## Proportion Var 0.34 0.30
## Cumulative Var 0.34 0.65
## Proportion Explained 0.53 0.47
## Cumulative Proportion 0.53 1.00
##
## Mean item complexity = 1.1
## Test of the hypothesis that 2 factors are sufficient.
##
## The degrees of freedom for the null model are 10 and the objective function was 1.74 with Chi Square of 173.31
## The degrees of freedom for the model are 1 and the objective function was 0.02
##
## The root mean square of the residuals (RMSR) is 0.02
## The df corrected root mean square of the residuals is 0.06
##
## The harmonic number of observations is 103 with the empirical chi square 0.79 with prob < 0.38
## The total number of observations was 103 with Likelihood Chi Square = 1.97 with prob < 0.16
##
## Tucker Lewis Index of factoring reliability = 0.939
## RMSEA index = 0.01 and the 90 % confidence intervals are NA 0.301
## BIC = -2.66
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## MR2 MR1
## Correlation of scores with factors 0.90 1.00
## Multiple R square of scores with factors 0.82 0.99
## Minimum correlation of possible factor scores 0.64 0.98
print(fa_quartimax, digits = 2, sort = T)
## Factor Analysis using method = minres
## Call: fa(r = stock, nfactors = 2, rotate = "quartimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## item MR2 MR1 h2 u2 com
## Citibank 2 0.82 0.22 0.73 0.273 1.1
## JPMorgan 1 0.76 0.02 0.58 0.418 1.0
## WellsFargo 3 0.67 0.10 0.46 0.543 1.0
## RoyalDutchShell 4 0.12 0.99 1.00 0.005 1.0
## ExxonMobil 5 0.12 0.68 0.47 0.530 1.1
##
## MR2 MR1
## SS loadings 1.74 1.50
## Proportion Var 0.35 0.30
## Cumulative Var 0.35 0.65
## Proportion Explained 0.54 0.46
## Cumulative Proportion 0.54 1.00
##
## Mean item complexity = 1.1
## Test of the hypothesis that 2 factors are sufficient.
##
## The degrees of freedom for the null model are 10 and the objective function was 1.74 with Chi Square of 173.31
## The degrees of freedom for the model are 1 and the objective function was 0.02
##
## The root mean square of the residuals (RMSR) is 0.02
## The df corrected root mean square of the residuals is 0.06
##
## The harmonic number of observations is 103 with the empirical chi square 0.79 with prob < 0.38
## The total number of observations was 103 with Likelihood Chi Square = 1.97 with prob < 0.16
##
## Tucker Lewis Index of factoring reliability = 0.939
## RMSEA index = 0.01 and the 90 % confidence intervals are NA 0.301
## BIC = -2.66
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## MR2 MR1
## Correlation of scores with factors 0.90 1.00
## Multiple R square of scores with factors 0.82 0.99
## Minimum correlation of possible factor scores 0.64 0.98
bank = apply(stock[, 1:3], 1, mean)
oil = apply(stock[, 4:5], 1, mean)
plot(bank, type = "l", ylim = c(-0.1,0.1))
lines(oil, col = "red")
legend("topright", c("bank", "oil"), lty = 1, col = 1:2)