금감원 전자공시(DART) 크롤링
[FinDA] 금융감독원 전자공시(DART) 크롤링. html을 로컬에 저장. BeautifulSoup를 사용하여 태그 분석. DB update.
전자공시 웹페이지
http://dart.fss.or.kr/dsac001/search.ax
- selectDate : 날짜
- Cookie : DSAC001_MAXRESULTS=100;
- 전자공시 목록을 조회하려면 Cookie를 사용해야 한다…!
wget을 사용하여 html 파일 다운받기
- https://eternallybored.org/misc/wget/
- 윈도우 64비트용 다운로드
- Anaconda3 폴더에 wget64.exe 파일 복사
! wget64 http://dart.fss.or.kr/dsac001/search.ax?selectDate=20170405 \
--header="Cookie:DSAC001_MAXRESULTS=5000;" \
-O DART-20170405.html
--2017-04-07 16:58:50-- http://dart.fss.or.kr/dsac001/search.ax?selectDate=20170405
Resolving dart.fss.or.kr (dart.fss.or.kr)... 61.73.60.200
Connecting to dart.fss.or.kr (dart.fss.or.kr)|61.73.60.200|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'DART-20170405.html'
0K .......... .......... .......... .......... .......... 796K
50K .......... .......... .......... .......... .......... 2.58M
100K .......... .......... .......... .......... .......... 3.81M
150K .......... .......... .......... .......... .......... 3.83M
200K .......... .......... .......... .......... .......... 3.49M
250K .......... .......... .......... .......... .......... 2.34M
300K .......... .......... .......... .......... .......... 4.48M
350K .......... .......... .......... .......... .......... 3.10M
400K .......... .......... .......... .......... .......... 2.58M
450K .......... .......... .......... .......... .......... 2.41M
500K .......... .......... .......... .......... .......... 2.07M
550K .......... .......... .......... .......... .......... 1.70M
600K .......... .......... .......... .......... .......... 1.66M
650K .......... .......... .......... .......... .......... 1.06M
700K .......... .......... .......... .......... .......... 175K
750K .......... .......... .......... .......... .......... 1.13M
800K .......... .......... .......... .......... .......... 1.44M
850K .......... .......... .......... .......... .......... 3.02M
900K .......... .......... .......... .......... .......... 2.31M
950K .......... .......... .......... .......... .......... 2.84M
1000K .......... .......... .......... .......... .......... 2.96M
1050K .......... .......... .......... .......... .......... 2.59M
1100K .......... .......... .......... .......... .......... 2.89M
1150K .......... .......... .......... .......... .......... 2.72M
1200K .......... .......... .......... .......... .......... 2.35M
1250K .......... .......... .......... .......... .......... 2.90M
1300K .......... .......... .......... .......... .......... 3.30M
1350K .......... .......... .......... .......... .......... 3.00M
1400K .......... .......... .......... .......... .......... 3.74M
1450K .......... .......... .......... .......... .......... 2.54M
1500K .......... .......... .......... .......... .......... 3.56M
1550K .......... .......... .......... .......... .......... 3.44M
1600K .......... .......... .......... .......... .......... 2.99M
1650K .......... .......... .......... .......... .......... 3.04M
1700K .......... .......... .......... .......... .......... 3.48M
1750K .......... .......... .......... .......... .......... 2.87M
1800K .......... .......... .......... .......... .......... 3.52M
1850K .......... .......... .......... .......... .......... 2.86M
1900K .......... .......... .......... .......... .......... 3.89M
1950K .......... .......... .......... .......... .......... 3.27M
2000K .......... .......... .......... .......... .......... 3.92M
2050K .......... .......... .......... .......... .......... 2.51M
2100K .......... .......... .......... .......... .......... 3.51M
2150K .......... .......... .......... .......... .......... 4.05M
2200K .......... .......... .......... .......... .......... 3.24M
2250K .......... .......... .......... .......... .......... 3.54M
2300K .......... .. 9.75M=1.2s
2017-04-07 16:58:54 (1.94 MB/s) - 'DART-20170405.html' saved [2367734]
request 를 이용하여 wget 함수 만들기
import requests
import os
from datetime import datetime, timedelta
data_dir = './dart_html/' # 파일 저장할 폴더
url_templete = 'http://dart.fss.or.kr/dsac001/search.ax?selectDate=%s'
headers = {'Cookie':'DSAC001_MAXRESULTS=5000;'}
# wget 함수
def wget(url, local_filename):
r = requests.get(url, headers=headers, stream=True)
f = open(local_filename, 'wb')
for chunk in r.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
f.flush()
return local_filename
# 크롤링할 날짜 범위
start_day = datetime(2017, 4, 1)
end_day = datetime(2017, 4, 5)
delta = end_day - start_day
for i in range(delta.days + 1):
d = start_day + timedelta(days=i) # 1일씩 증가
rdate = d.strftime('%Y%m%d')
filename = 'DART-' + rdate + '.html' # 저장할 로컬 파일명
wget(url_templete % rdate, data_dir + filename) # 크롤링
fsize = os.path.getsize(data_dir + filename) / (1024 ** 2)
print ("%s (%.2f MB) downloaded" % (filename, fsize))
DART-20170401.html (0.00 MB) downloaded
DART-20170402.html (0.00 MB) downloaded
DART-20170403.html (2.60 MB) downloaded
DART-20170404.html (2.08 MB) downloaded
DART-20170405.html (2.26 MB) downloaded
html 분석
from bs4 import BeautifulSoup
fname = "./dart_html/DART-20170405.html"
with open(fname, encoding='utf-8') as f:
text = f.read()
soup = BeautifulSoup(text, "lxml")
건수 & 날짜
# body-div-p-b
sel_list = soup.select('div p b')
count_str = sel_list[0].text
count_str
'\n\t\t\n\t\t\n\t\t\n\t\t\n \n \n\t\t전체\n\t\t \xa02,241건\n\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t(2017년 04월 05일)\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\n\t'
count_str = "".join(count_str.split())
count_str
'전체2,241건(2017년04월05일)'
전자공시 목록
# table
table = soup.find('table')
trs = table.findAll('tr')
print ("총 건수:", len(trs))
총 건수: 2242
# 0 (첫번째)는 테이블 헤더이므로 제외.
for tr in trs[1:6]:
tds = tr.findAll('td')
print(tds[0].text.strip(), end=" " ) # 시간
print(tds[1].text.strip(), end=" " ) # 공시대상회사
print(tds[1].img['alt']) # 이미지 태그 (유가증권, 코스닥, 코넥스 등)
print(" ".join(tds[2].text.split()) )
print('http://dart.fss.or.kr/' + tds[2].a['href'] ) # 링크
print(tds[3].text.strip(), end=" ") # 제출인
print(tds[4].text.strip().replace('.', '-') ) # 접수일자
print()
18:29 SK가스 유가증권시장
증권신고서(채무증권)
http://dart.fss.or.kr//dsaf001/main.do?rcpNo=20170405002607
SK가스 2017-04-05
18:11 오백볼트 코넥스시장
[기재정정]감사보고서제출
http://dart.fss.or.kr//dsaf001/main.do?rcpNo=20170405601087
오백볼트 2017-04-05
18:06 우리카드이천십칠의일유동화전문유한회사 기타법인
[기재정정]자산양도등의등록신청서
http://dart.fss.or.kr//dsaf001/main.do?rcpNo=20170317000629
우리카드 2017-03-17
18:06 디비에르메스유동화전문유한회사 기타법인
[첨부추가]자산양도등의등록신청서
http://dart.fss.or.kr//dsaf001/main.do?rcpNo=20170322000224
동부증권 2017-03-22
17:59 스테코 기타법인
감사보고서 (2016.12)
http://dart.fss.or.kr//dsaf001/main.do?rcpNo=20170405002553
삼일회계법인 2017-04-05
MySQL Table Update
html 파일을 DataFrame으로 변환
stock_dart Table
`doc_id` varchar(25) NOT NULL,
`date` datetime DEFAULT NULL,
`corp_name` varchar(50) DEFAULT NULL,
`market` varchar(50) DEFAULT NULL,
`title` varchar(255) DEFAULT NULL,
`link` varchar(128) DEFAULT NULL,
`reporter` varchar(50) DEFAULT NULL,
UNIQUE KEY `doc_id` (`doc_id`)
import pandas as pd
fname = "./dart_html/DART-20170405.html"
with open(fname, encoding='utf-8') as f:
text = f.read()
soup = BeautifulSoup(text, "lxml")
# 공시 갯수
table = soup.find('table')
trs = table.findAll('tr') # <tr> </tr> : 1개의 공시.
counts = len(trs)
counts
2242
# 1개 공시의 html 구조
tr = trs[1]
tr
<tr>
<td class="cen_txt">
18:29
</td>
<td>
<span class="nobr1" style="max-width:150px;">
<img alt="유가증권시장" src="/images/ico_kospi.gif" title="유가증권시장"/>
<a href="/dsae001/selectPopup.ax?selectKey=00144164" onclick="openCorpInfo('00144164'); return false;" title="SK가스 기업개황 새창">
SK가스
</a>
</span>
</td>
<td>
<a href="/dsaf001/main.do?rcpNo=20170405002607" id="r_20170405002607" onclick="openReportViewer('20170405002607'); return false;" title="증권신고서(채무증권) 공시뷰어 새창">증권신고서(채무증권)
</a>
</td>
<td title="SK가스"><div class="nobr" style="width:95px">SK가스</div></td>
<td class="cen_txt">2017.04.05</td>
<td class="cen_txt end"></td>
</tr>
# 각 컬럼별 데이터 추출
tds = tr.findAll('td')
link = 'http://dart.fss.or.kr' + tds[2].a['href']
doc_id = link.split('main.do?rcpNo=')[1]
time = tds[0].text.strip()
date = tds[4].text.strip().replace('.', '-') + ' ' + time
corp_name = tds[1].text.strip()
market = tds[1].img['alt']
title = " ".join(tds[2].text.split())
reporter = tds[3].text.strip()
dart_dict = {'doc_id':[doc_id],'date':[date],'corp_name':[corp_name],
'market':[market],'title':[title],'link':[link],'reporter':[reporter]}
df = pd.DataFrame(dart_dict)
df
corp_name | date | doc_id | link | market | reporter | title | |
---|---|---|---|---|---|---|---|
0 | SK가스 | 2017-04-05 18:29 | 20170405002607 | http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20... | 유가증권시장 | SK가스 | 증권신고서(채무증권) |
# 전체 데이터를 DataFrame으로 만들기
if counts > 0:
link_list = []
docid_list = []
date_list = []
corp_list = []
market_list = []
title_list = []
reporter_list = []
for tr in trs[1:]:
tds = tr.findAll('td')
link = 'http://dart.fss.or.kr' + tds[2].a['href']
doc_id = link.split('main.do?rcpNo=')[1]
time = tds[0].text.strip()
date = tds[4].text.strip().replace('.', '-') + ' ' + time
corp_name = tds[1].text.strip()
market = tds[1].img['alt']
title = " ".join(tds[2].text.split())
reporter = tds[3].text.strip()
link_list.append(link)
docid_list.append(doc_id)
date_list.append(date)
corp_list.append(corp_name)
market_list.append(market)
title_list.append(title)
reporter_list.append(reporter)
dart_dict = {'doc_id':docid_list,'date':date_list,'corp_name':corp_list,'market':market_list,'title':title_list,
'link':link_list,'reporter':reporter_list}
df_dart = pd.DataFrame(dart_dict)
df_dart.head()
corp_name | date | doc_id | link | market | reporter | title | |
---|---|---|---|---|---|---|---|
0 | SK가스 | 2017-04-05 18:29 | 20170405002607 | http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20... | 유가증권시장 | SK가스 | 증권신고서(채무증권) |
1 | 오백볼트 | 2017-04-05 18:11 | 20170405601087 | http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20... | 코넥스시장 | 오백볼트 | [기재정정]감사보고서제출 |
2 | 우리카드이천십칠의일유동화전문유한회사 | 2017-03-17 18:06 | 20170317000629 | http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20... | 기타법인 | 우리카드 | [기재정정]자산양도등의등록신청서 |
3 | 디비에르메스유동화전문유한회사 | 2017-03-22 18:06 | 20170322000224 | http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20... | 기타법인 | 동부증권 | [첨부추가]자산양도등의등록신청서 |
4 | 스테코 | 2017-04-05 17:59 | 20170405002553 | http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20... | 기타법인 | 삼일회계법인 | 감사보고서 (2016.12) |
df_dart.shape
(2241, 7)
df_dart.loc[df_dart.doc_id == '20170403000932']
corp_name | date | doc_id | link | market | reporter | title | |
---|---|---|---|---|---|---|---|
1187 | 제이니몹 | 2017-04-03 14:26 | 20170403000932 | http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20... | 기타법인 | 제이니몹 | [첨부추가]사업보고서 (2016.12) |
Database Table Update
import mysql.connector
from sqlalchemy import create_engine
con_str = 'mysql+mysqlconnector://woosa7:finda888@localhost/findb'
engine = create_engine(con_str, echo=False)
df_dart.to_sql('stock_dart', engine, if_exists='append', index=False) # If table exists, insert data. Create if does not exist.