2일차(데이터 크롤링)

Python/파이썬으로 배우는 머신러닝 기초 교육

2일차(데이터 크롤링)

hyunjoo 2021. 7. 15. 04:26

#데이터 크롤링 : 웹페이지를 그대로 가져와 데이터를 추출하는 것

#HTML : 사이트 구조와 내용

#JSP : 서버연동, 데이터 저장, 다양한 기능

#CSS : 포토샵

데이터 크롤링을 위해서는 내용을 가져오기 때문에 HTML 지식만으로도 충분하다.

<로또 데이터 크롤링>

>import requests     
>from bs4 import BeautifulSoup  
>import pandas as pd  

url=requests.get("https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=1&ie=utf8&query=%EB%A1%9C%EB%98%90+777%ED%9A%8C")
url
'<Response [200]>   #200:정상//400~ 이런 페이지 없습니다. //500~ 거절이나 잘못적었거나
url.text           #실행하면 정리되지 않은 데이터를 볼 수 있다.html 언어임,하지만 문자열로 가져왔지 때문에 html 언어로 변환 시켜야 함.
<!doctype html> <html lang="ko"> <head> <meta charset="utf-8"> <meta name="referrer" content="always"> 
me="description" lang="ko" content="\'로또 777회\'의 네이버 통합검색 결과입니다.">
...생략....
<title>로또 777회 : 네이버 통합검색</title> <link rel="shortcut icon" href="https://ssl.pstatic.net/sstatic/search/favicon/favicon_191118_pc.ico">  
<link rel="search" type="application/opensearchdescription+xml" href="https://ssl.pstatic.net/sstatic/search/opensearch-description.https.xml" title="Naver" />
<link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/sstatic/search/pc/css/search1_210708.css"> <link rel="stylesheet" type=

#requests는 URL주소에 있는 내용을 요청할 때 사용하는 모듈

#BeautifulSoup는 python언어로 HTML을 다루는 라이브러리

#url 로 불러온 데이터는 가공되지 않은 데이터로 정리되어 있지 않다. 또한 문자열 형태로 저장되기때문에 html 형태로 변환 시켜줘야 한다.

#html로 반환 하는 방법

>html=BeautifulSoup(url.text)
>html
<!DOCTYPE html>
<html lang="ko"> <head> <meta charset="utf-8"/> <meta content="always" name="referrer"/> <meta content="telephone=no,address=no,email=no" name="format-detection"/> <meta content="width=device-width,initial-scale=1.0,maximum-scale=2.0" name="viewport"/> 
    if (typeof NAVER === "undefined") {
      NAVER = {};
    }
  
    .....생략....

#태그이름 : div ,클래스의 속성값 이름 : num_box

>html.select('div.num_box') # 리스트로 불러왔기 때문에 html 꺼내야 함.
[<div class="num_box"> <span class="num ball6">6</span> <span class="num ball12">12</span> <span class="num ball17">17</span> <span class="num ball21">21</span> <span class="num ball34">34</span> <span class="num ball37">37</span> <span class="bonus">보너스번호</span> <span class="num ball18">18</span> <a class="btn_num" href="https://www.dhlottery.co.kr/gameResult.do?method=myWin" nocr="" onclick="return goOtherCR(this, 'a=nco_x5e*1.contents&amp;r=1&amp;i=0011AD9E_0000009BBC09&amp;u=' + urlencode(this.href));" target="_blank">내 번호 당첨조회</a> </div>]


>html.select('div.num_box')[0]
<div class="num_box"> <span class="num ball6">6</span> <span class="num ball12">12</span> <span class="num ball17">17</span> <span class="num ball21">21</span> <span class="num ball34">34</span> <span class="num ball37">37</span> <span class="bonus">보너스번호</span> <span class="num ball18">18</span> <a class="btn_num" href="https://www.dhlottery.co.kr/gameResult.do?method=myWin" nocr="" onclick="return goOtherCR(this, 'a=nco_x5e*1.contents&amp;r=1&amp;i=0011AD9E_0000009BBC09&amp;u=' + urlencode(this.href));" target="_blank">내 번호 당첨조회</a> </div>

#각 번호의 태그이름은 span

>lotto_number=html.select('div.num_box')[0].select('span')
>lotto_number
[<span class="num ball6">6</span>,
 <span class="num ball12">12</span>,
 <span class="num ball17">17</span>,
 <span class="num ball21">21</span>,
 <span class="num ball34">34</span>,
 <span class="num ball37">37</span>,
 <span class="bonus">보너스번호</span>,
 <span class="num ball18">18</span>]

#보너스 번호 삭제하기

>lotto_number=html.select('div.num_box')[0].select('span')
>del lotto_number[6]
>lotto_number
[<span class="num ball6">6</span>,
 <span class="num ball12">12</span>,
 <span class="num ball17">17</span>,
 <span class="num ball21">21</span>,
 <span class="num ball34">34</span>,
 <span class="num ball37">37</span>,
 <span class="num ball18">18</span>]

#여기서 텍스트란, 로또 숫자들 (6,12,17,21,34,37,18)

#텍스트만 불러오기

>for i in lotto_number:
>    print(i.text)
6
12
17
21
34
37
18

#box라는 리스트 생성하여 로또 번호들 저장

>box=[]
>for i in lotto_number:
>   box.append(i.text)
>box
['6', '12', '17', '21', '34', '37', '18']

#전체 데이터를 저장할 리스트 생성하고 로또 1회부터 100회 까지의 데이터 불러오기

>import requests
>from bs4 import BeautifulSoup
>import pandas as pd
>
>total=[]
>
>for n in range(1,101):
>
>    url=requests.get(f"https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=1&ie=utf8&query=%EB%A1%9C%EB%98%90+{n}%ED%9A%8C")
>    html=BeautifulSoup(url.text)
>
>    lotto_number=html.select('div.num_box')[0].select('span')
>    del lotto_number[6]
>
>    box=[]
>
>    for i in lotto_number:
>      box.append(int(i.text))

#for n in range(1,101): 로또 1회부터 100회 까지

#주소 중 "https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=1&ie=utf8&query=%EB%A1%9C%EB%98%90+777%ED%9A%8C"

'777' 은 777회를 뜻하며 n으로 바꿔주어 1회부터 100회까지 조회 가능 하게 해야함.

#주소 앞에 f 붙이기

#로또 1회부터 100회 까지 로또 번호를 불러오는 전체 코드 및 실행

>import requests
>from bs4 import BeautifulSoup
>import pandas as pd
>
>total=[]
>
>for n in range(1,101):
>
>    url=requests.get(f"https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=1&ie=utf8&query=%EB%A1%9C%EB%98%90+{n}%ED%9A%8C")
>    html=BeautifulSoup(url.text)
>
>    lotto_number=html.select('div.num_box')[0].select('span')
>    del lotto_number[6]
>
>    box=[]
>
>    for i in lotto_number:
>      box.append(int(i.text))
>
>    total.append(box)
>
>    print('로또 {}회차 데이터 저장완료 : {}'.format(n,box))
로또 1회차 데이터 저장완료 : [10, 23, 29, 33, 37, 40, 16]
로또 2회차 데이터 저장완료 : [9, 13, 21, 25, 32, 42, 2]
로또 3회차 데이터 저장완료 : [11, 16, 19, 21, 27, 31, 30]
로또 4회차 데이터 저장완료 : [14, 27, 30, 31, 40, 42, 2]
로또 5회차 데이터 저장완료 : [16, 24, 29, 40, 41, 42, 3]
로또 6회차 데이터 저장완료 : [14, 15, 26, 27, 40, 42, 34]
로또 7회차 데이터 저장완료 : [2, 9, 16, 25, 26, 40, 42]
....생략.....
로또 97회차 데이터 저장완료 : [6, 7, 14, 15, 20, 36, 3]
로또 98회차 데이터 저장완료 : [6, 9, 16, 23, 24, 32, 43]
로또 99회차 데이터 저장완료 : [1, 3, 10, 27, 29, 37, 11]
로또 100회차 데이터 저장완료 : [1, 7, 11, 23, 37, 42, 6]

#total에 저장된 모든 데이터 불러오기

>total
[[10, 23, 29, 33, 37, 40, 16],
 [9, 13, 21, 25, 32, 42, 2],
 [11, 16, 19, 21, 27, 31, 30],
 [14, 27, 30, 31, 40, 42, 2],
 [16, 24, 29, 40, 41, 42, 3],
 ...생략....
 [1, 3, 8, 21, 22, 31, 20],
 [6, 7, 14, 15, 20, 36, 3],
 [6, 9, 16, 23, 24, 32, 43],
 [1, 3, 10, 27, 29, 37, 11],
 [1, 7, 11, 23, 37, 42, 6]]

#데이터를 표,엑셀로 나타내기

#numpy,pandas,tensorflow
#pandas에서는 표를 DataFrame 이라 부른다.

>df=pd.DataFrame(total, columns=['번호1','번호2','번호3','번호4','번호5','번호6','보너스번호'])
>df

#df변수에 저장되어있는 total 데이터를 엑셀 파일에 저장

>df.to_excel('lotto.xlsx')

'Python > 파이썬으로 배우는 머신러닝 기초 교육' 카테고리의 다른 글

3일차(데이터 가공) (0)	2021.07.23
2일차(데이터 크롤링3) (0)	2021.07.20
2일차(데이터 크롤링2) (0)	2021.07.19
2일차(파이썬 기초) (0)	2021.07.15
1일차(파이썬 기초) (0)	2021.07.07

현재글2일차(데이터 크롤링)

LeeHyunjoo

부스팅모델, 분류성능지표, PYTHON, f1score, 백준, 파이썬, bagging, 셀레니움, 배깅모델, CatBoost, 크롤링, LangChain, 데이터통계, boosting, LLM, ensemble model, selenium, 세레니움 크롤링, 랭체인, 머신러닝,

Today :
Yesterday :

LeeHyunjoo