[FinDA] 금융감독원 전자공시(DART) 크롤링. html을 로컬에 저장. BeautifulSoup를 사용하여 태그 분석. DB update.


전자공시 웹페이지

http://dart.fss.or.kr/dsac001/search.ax

  • selectDate : 날짜
  • Cookie : DSAC001_MAXRESULTS=100;
  • 전자공시 목록을 조회하려면 Cookie를 사용해야 한다…!

img

wget을 사용하여 html 파일 다운받기

  • https://eternallybored.org/misc/wget/
  • 윈도우 64비트용 다운로드
  • Anaconda3 폴더에 wget64.exe 파일 복사
! wget64 http://dart.fss.or.kr/dsac001/search.ax?selectDate=20170405 \
--header="Cookie:DSAC001_MAXRESULTS=5000;" \
-O DART-20170405.html
--2017-04-07 16:58:50--  http://dart.fss.or.kr/dsac001/search.ax?selectDate=20170405
Resolving dart.fss.or.kr (dart.fss.or.kr)... 61.73.60.200
Connecting to dart.fss.or.kr (dart.fss.or.kr)|61.73.60.200|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'DART-20170405.html'

     0K .......... .......... .......... .......... ..........  796K
    50K .......... .......... .......... .......... .......... 2.58M
   100K .......... .......... .......... .......... .......... 3.81M
   150K .......... .......... .......... .......... .......... 3.83M
   200K .......... .......... .......... .......... .......... 3.49M
   250K .......... .......... .......... .......... .......... 2.34M
   300K .......... .......... .......... .......... .......... 4.48M
   350K .......... .......... .......... .......... .......... 3.10M
   400K .......... .......... .......... .......... .......... 2.58M
   450K .......... .......... .......... .......... .......... 2.41M
   500K .......... .......... .......... .......... .......... 2.07M
   550K .......... .......... .......... .......... .......... 1.70M
   600K .......... .......... .......... .......... .......... 1.66M
   650K .......... .......... .......... .......... .......... 1.06M
   700K .......... .......... .......... .......... ..........  175K
   750K .......... .......... .......... .......... .......... 1.13M
   800K .......... .......... .......... .......... .......... 1.44M
   850K .......... .......... .......... .......... .......... 3.02M
   900K .......... .......... .......... .......... .......... 2.31M
   950K .......... .......... .......... .......... .......... 2.84M
  1000K .......... .......... .......... .......... .......... 2.96M
  1050K .......... .......... .......... .......... .......... 2.59M
  1100K .......... .......... .......... .......... .......... 2.89M
  1150K .......... .......... .......... .......... .......... 2.72M
  1200K .......... .......... .......... .......... .......... 2.35M
  1250K .......... .......... .......... .......... .......... 2.90M
  1300K .......... .......... .......... .......... .......... 3.30M
  1350K .......... .......... .......... .......... .......... 3.00M
  1400K .......... .......... .......... .......... .......... 3.74M
  1450K .......... .......... .......... .......... .......... 2.54M
  1500K .......... .......... .......... .......... .......... 3.56M
  1550K .......... .......... .......... .......... .......... 3.44M
  1600K .......... .......... .......... .......... .......... 2.99M
  1650K .......... .......... .......... .......... .......... 3.04M
  1700K .......... .......... .......... .......... .......... 3.48M
  1750K .......... .......... .......... .......... .......... 2.87M
  1800K .......... .......... .......... .......... .......... 3.52M
  1850K .......... .......... .......... .......... .......... 2.86M
  1900K .......... .......... .......... .......... .......... 3.89M
  1950K .......... .......... .......... .......... .......... 3.27M
  2000K .......... .......... .......... .......... .......... 3.92M
  2050K .......... .......... .......... .......... .......... 2.51M
  2100K .......... .......... .......... .......... .......... 3.51M
  2150K .......... .......... .......... .......... .......... 4.05M
  2200K .......... .......... .......... .......... .......... 3.24M
  2250K .......... .......... .......... .......... .......... 3.54M
  2300K .......... ..                                          9.75M=1.2s

2017-04-07 16:58:54 (1.94 MB/s) - 'DART-20170405.html' saved [2367734]

request 를 이용하여 wget 함수 만들기

import requests
import os
from datetime import datetime, timedelta
data_dir = './dart_html/'   # 파일 저장할 폴더
url_templete = 'http://dart.fss.or.kr/dsac001/search.ax?selectDate=%s'
headers = {'Cookie':'DSAC001_MAXRESULTS=5000;'}
# wget 함수
def wget(url, local_filename):
    r = requests.get(url, headers=headers, stream=True)
    f = open(local_filename, 'wb')
    for chunk in r.iter_content(chunk_size=1024): 
        if chunk:
            f.write(chunk)
            f.flush()
    return local_filename
# 크롤링할 날짜 범위
start_day = datetime(2017, 4, 1)
end_day = datetime(2017, 4, 5)
delta = end_day - start_day
for i in range(delta.days + 1):
    d = start_day + timedelta(days=i)        # 1일씩 증가
    rdate = d.strftime('%Y%m%d')
    filename = 'DART-' + rdate + '.html'     # 저장할 로컬 파일명
    wget(url_templete % rdate, data_dir + filename)   # 크롤링
    fsize = os.path.getsize(data_dir + filename) / (1024 ** 2)
    print ("%s (%.2f MB) downloaded" % (filename, fsize))
DART-20170401.html (0.00 MB) downloaded
DART-20170402.html (0.00 MB) downloaded
DART-20170403.html (2.60 MB) downloaded
DART-20170404.html (2.08 MB) downloaded
DART-20170405.html (2.26 MB) downloaded

html 분석

img

from bs4 import BeautifulSoup
fname = "./dart_html/DART-20170405.html"

with open(fname, encoding='utf-8') as f:
    text = f.read()
    soup = BeautifulSoup(text, "lxml")

건수 & 날짜

# body-div-p-b
sel_list = soup.select('div p b')
count_str = sel_list[0].text
count_str
'\n\t\t\n\t\t\n\t\t\n\t\t\n        \n        \n\t\t전체\n\t\t \xa02,241건\n\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t(2017년 04월 05일)\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\n\t'
count_str = "".join(count_str.split())
count_str
'전체2,241건(2017년04월05일)'

img

전자공시 목록

# table
table = soup.find('table')
trs = table.findAll('tr')
print ("총 건수:", len(trs))
총 건수: 2242
# 0 (첫번째)는 테이블 헤더이므로 제외.
for tr in trs[1:6]:
    tds = tr.findAll('td')
    print(tds[0].text.strip(), end=" " )          # 시간
    print(tds[1].text.strip(), end=" " )          # 공시대상회사
    print(tds[1].img['alt'])                      # 이미지 태그 (유가증권, 코스닥, 코넥스 등)
    print(" ".join(tds[2].text.split()) )
    print('http://dart.fss.or.kr/' + tds[2].a['href'] )   # 링크
    print(tds[3].text.strip(), end=" ")                   # 제출인
    print(tds[4].text.strip().replace('.', '-') )         # 접수일자
    print()
18:29 SK가스 유가증권시장
증권신고서(채무증권)
http://dart.fss.or.kr//dsaf001/main.do?rcpNo=20170405002607
SK가스 2017-04-05

18:11 오백볼트 코넥스시장
[기재정정]감사보고서제출
http://dart.fss.or.kr//dsaf001/main.do?rcpNo=20170405601087
오백볼트 2017-04-05

18:06 우리카드이천십칠의일유동화전문유한회사 기타법인
[기재정정]자산양도등의등록신청서
http://dart.fss.or.kr//dsaf001/main.do?rcpNo=20170317000629
우리카드 2017-03-17

18:06 디비에르메스유동화전문유한회사 기타법인
[첨부추가]자산양도등의등록신청서
http://dart.fss.or.kr//dsaf001/main.do?rcpNo=20170322000224
동부증권 2017-03-22

17:59 스테코 기타법인
감사보고서 (2016.12)
http://dart.fss.or.kr//dsaf001/main.do?rcpNo=20170405002553
삼일회계법인 2017-04-05

MySQL Table Update

html 파일을 DataFrame으로 변환

stock_dart Table

  `doc_id` varchar(25) NOT NULL,
  `date` datetime DEFAULT NULL,
  `corp_name` varchar(50) DEFAULT NULL,
  `market` varchar(50) DEFAULT NULL,
  `title` varchar(255) DEFAULT NULL,
  `link` varchar(128) DEFAULT NULL,
  `reporter` varchar(50) DEFAULT NULL,
  UNIQUE KEY `doc_id` (`doc_id`)
import pandas as pd
fname = "./dart_html/DART-20170405.html"

with open(fname, encoding='utf-8') as f:
    text = f.read()
    soup = BeautifulSoup(text, "lxml")
# 공시 갯수
table = soup.find('table')
trs = table.findAll('tr')     # <tr> </tr> : 1개의 공시.
counts = len(trs)
counts
2242
# 1개 공시의 html 구조
tr = trs[1]
tr
<tr>
<td class="cen_txt">
						18:29
					</td>
<td>
<span class="nobr1" style="max-width:150px;">
<img alt="유가증권시장" src="/images/ico_kospi.gif" title="유가증권시장"/> 
							<a href="/dsae001/selectPopup.ax?selectKey=00144164" onclick="openCorpInfo('00144164'); return false;" title="SK가스 기업개황 새창">
								SK가스
							</a>
</span>
</td>
<td>
<a href="/dsaf001/main.do?rcpNo=20170405002607" id="r_20170405002607" onclick="openReportViewer('20170405002607'); return false;" title="증권신고서(채무증권) 공시뷰어 새창">증권신고서(채무증권)
							
  							
						</a>
</td>
<td title="SK가스"><div class="nobr" style="width:95px">SK가스</div></td>
<td class="cen_txt">2017.04.05</td>
<td class="cen_txt end"></td>
</tr>
# 각 컬럼별 데이터 추출
tds = tr.findAll('td')

link = 'http://dart.fss.or.kr' + tds[2].a['href']
doc_id = link.split('main.do?rcpNo=')[1]

time = tds[0].text.strip()
date = tds[4].text.strip().replace('.', '-') + ' ' + time

corp_name = tds[1].text.strip()
market = tds[1].img['alt']

title = " ".join(tds[2].text.split())

reporter = tds[3].text.strip()

dart_dict = {'doc_id':[doc_id],'date':[date],'corp_name':[corp_name],
'market':[market],'title':[title],'link':[link],'reporter':[reporter]}
df = pd.DataFrame(dart_dict)
df
corp_name date doc_id link market reporter title
0 SK가스 2017-04-05 18:29 20170405002607 http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20... 유가증권시장 SK가스 증권신고서(채무증권)
# 전체 데이터를 DataFrame으로 만들기
if counts > 0:
    link_list = []
    docid_list = []
    date_list = []
    corp_list = []
    market_list = []
    title_list = []
    reporter_list = []
    
    for tr in trs[1:]:
        tds = tr.findAll('td')
        link = 'http://dart.fss.or.kr' + tds[2].a['href']
        doc_id = link.split('main.do?rcpNo=')[1]
        time = tds[0].text.strip()
        date = tds[4].text.strip().replace('.', '-') + ' ' + time
        corp_name = tds[1].text.strip()
        market = tds[1].img['alt']
        title = " ".join(tds[2].text.split())
        reporter = tds[3].text.strip()
        
        link_list.append(link)
        docid_list.append(doc_id)
        date_list.append(date)
        corp_list.append(corp_name)
        market_list.append(market)
        title_list.append(title)
        reporter_list.append(reporter)
        
    dart_dict = {'doc_id':docid_list,'date':date_list,'corp_name':corp_list,'market':market_list,'title':title_list,
                 'link':link_list,'reporter':reporter_list}
    df_dart = pd.DataFrame(dart_dict)    
df_dart.head()
corp_name date doc_id link market reporter title
0 SK가스 2017-04-05 18:29 20170405002607 http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20... 유가증권시장 SK가스 증권신고서(채무증권)
1 오백볼트 2017-04-05 18:11 20170405601087 http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20... 코넥스시장 오백볼트 [기재정정]감사보고서제출
2 우리카드이천십칠의일유동화전문유한회사 2017-03-17 18:06 20170317000629 http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20... 기타법인 우리카드 [기재정정]자산양도등의등록신청서
3 디비에르메스유동화전문유한회사 2017-03-22 18:06 20170322000224 http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20... 기타법인 동부증권 [첨부추가]자산양도등의등록신청서
4 스테코 2017-04-05 17:59 20170405002553 http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20... 기타법인 삼일회계법인 감사보고서 (2016.12)
df_dart.shape
(2241, 7)
df_dart.loc[df_dart.doc_id == '20170403000932']
corp_name date doc_id link market reporter title
1187 제이니몹 2017-04-03 14:26 20170403000932 http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20... 기타법인 제이니몹 [첨부추가]사업보고서 (2016.12)

Database Table Update

import mysql.connector
from sqlalchemy import create_engine
con_str = 'mysql+mysqlconnector://woosa7:finda888@localhost/findb'
engine = create_engine(con_str, echo=False)
df_dart.to_sql('stock_dart', engine, if_exists='append', index=False)  # If table exists, insert data. Create if does not exist.