'크롤링' 태그의 글 목록

크롤링

[학습노트][인프런] 데이터과학과 IT영역에서의 파이썬과 크롤링 큰 그림 이해하기 (2)

많루 2021. 1. 1. 16:47

2021. 1. 1. 16:47

※ 학습 노트

학습 날짜	2020년 1월
학습 사이트	인프런
과정 제목	파이썬입문과 크롤링기초 부트캠프 (2020 업데이트) [쉽게! 견고한 자료까지!]
학습 시간	72강/941분
의견	- 크롤링이 뜻대로 되지 않는 부분에 대한 팁을 얻기 위해 등록한 과정이기 때문에 스파르타 코딩 클럽에서 배운 것과 겹치는 부분은 스킵하고 필요한 부분부터 우선적으로 듣기로 함 - 개념적인 것을 엄청 자세하게 설명해서 초보자(나)에게 좋음. 기존에 배운 것 복습 + 빈 곳을 채우는 느낌 - 10~20분 단위로 micro하게 구성
학습 모듈	크롤링과 웹 기본 - 실전부분

크롤링과 웹 기본__패턴으로 실습하며 익히기: HTML/CSS 이해를 바탕으로 크롤링하기

* 크롤링 패턴 코드

import requests
from bs4 import BeautifulSoup

res = requests.get('https://news.v.daum.net/v/20170615203441266')
soup = BeautifulSoup(res.content, 'html.parser')

mydata = soup.find_all('span','txt_info')
for item in mydata:
     print (item.get_text())

*meta tag는 검색 엔진에 해당 페이지를 잘 검색할 수 있도록 쓰이는 경우가 많음

*해당 tag가 html 내에서 여러 군데 사용되고 있는 경우 class를 활용

*크롤링을 하고 싶은 부분이 class를 여러 개 사용 있는 경우, class를 띄어쓰기 포함 그대로 사용

*mydata.get_text 또는 mydata.string 사용

크롤링과 웹 기본__패턴으로 실습하며 익히기: 실전 크롤링과 강력한 크롤링 기술 팁

* 오픈 크롬 개발자 모드를 오픈

: 윈도우 Ctrl + Shift + i 또는 F12

* 원하는 영역 오른쪽 클릭한 후 copy > Copy outer html 복사 가능

* find()로 더 크게 감싸는 html 태그로 추출 후, 추출된 데이터에서 find_all()로 원하는 부분을 추출

* 문자 전처리할 때: strip(), split() 함수 사용

* 리스트에 숫자 붙일 때

for index, title in enumerate(titles):
    print(str(index+1)+'.', title.get_text().split()[0].string())

크롤링과 웹 기본__CSS selector 사용해서 크롤링하기

* select()안에 태그 또는 CSS class 이름 등을 넣으면 결과값은 리스트 형태로 반환됨

* 하위 태그는 스페이스로 구분

* 모든 단계의 태그를 다 적을 필요는 없지만, 바로 밑의 태그에 해당하는 부분을 추출하고 싶을 때는 ul > a 식으로 > 사용

* class 이름을 사용하고 싶을 때에는 .사용, class 여러 개일때도 .사용하여 이어줌

id 는 # 사용

* find()/select()로 가져온 객체는 find()/select() 모두 교차 사용 가능

바로 실전 크롤링해보기: 네이버 쇼핑 사이트 크롤링하기

#네이버쇼핑 best100 인기검색어

import requests
from bs4 import BeautifulSoup

res = requests.get('https://search.shopping.naver.com/best100v2/main.nhn')
soup = BeautifulSoup(res.content, 'html.parser')

items = soup.select('#popular_srch_lst > li > span.txt > a')
for item in items:
    print(item.get_text())

#네이버쇼핑 특정 카테고리 상품 제목

import requests
from bs4 import BeautifulSoup

res = requests.get('https://search.shopping.naver.com/search/all?frm=NVBT100&origQuery=%EB%8B%A4%EC%9D%B4%EC%96%B4%EB%A6%AC&pagingIndex=2&pagingSize=40&productSet
=total&query=%EB%8B%A4%EC%9D%B4%EC%96%B4%EB%A6%AC&sort=rel&timestamp=&viewType=thumb')
soup = BeautifulSoup(res.content, 'html.parser')

items = soup.select('div.imgList_title__3yJlT > a')

for item in items:
     print(item.get_text())

#네이버증권 인기검색종목

import requests
from bs4 import BeautifulSoup

res = requests.get('https://finance.naver.com/sise/')
soup = BeautifulSoup(res.content, 'html.parser')

items = soup.select('#popularItemList > li > a')
for item in items:
    print(item.get_text())

또다른 크롤링 기술, urllib 라이브러리 사용법 이해

* 기존에는 urllib+bs4 세트로 많이 사용하였는데, 최근에는 requests + bs4 세트를 많이 사용

기존 코드 중 일부가 urllib 사용하는 경우가 있으며 인코딩 처리가 다름

* requests 라이버리를 사용해서 크롤링 진행하고, 문제가 있는 경우만 urllib 시도

from urllib.request import urlopen #urllib의 request class에 들어있는 urlopen 함수만 필요
from bs4 import BeautifulSoup

res = urlopen('https://davelee-fun.github.io/')
soup = BeautifulSoup(res, 'html.parser')

items = soup.select('h4.card-text')
for item in items:
    print(item.get_text().strip())

여러 페이지 한번에 크롤링하는 기법

* 주소에있는 페이지 번호를 활용하여 for 구문 사용

import requests
from bs4 import BeautifulSoup

for page_num in range(10):
    if page_num == 0:
        res = requests.get('https://davelee-fun.github.io/')
    else:
        res = requests.get('https://davelee-fun.github.io/page' + str(page_num + 1))
    soup = BeautifulSoup(res.content, 'html.parser')

    data = soup.select('h4.card-text')
    for item in data:
        print(item.get_text().strip())

크롤링해서 엑셀 파일로 데이터 저장하기

* openpyxl 라이브러리 활용

* 함수 설정

import openpyxl
def write_excel_template(filename,sheetname,listdata):
    excel_file = openpyxl.Workbook() #엑셀 파일 오픈
    excel_sheet = excel_file.active #엑셀 시트 선택
  
    if sheetname !='': #시트이름이 없으면 넘어가기
        excel_sheet.tile = sheetname

    for item in listdata:
        excel_sheet.append(item)
    excel_file.save(filename)
    excel_file.close()

* 크롤링해서 저장하기

import requests
from bs4 import BeautifulSoup

product_lists = list()

for page_num in range(10):
    if page_num == 0:
        res = requests.get('https://davelee-fun.github.io/')
    else:
        res = requests.get('https://davelee-fun.github.io/page'+str(page_num+1))
    soup = BeautifulSoup(res.content,'html.parser')

    data=soup.select('div.card')
    for item in data:
        product_name = item.select_one('div.card-body h4.card-text')
        product_date = item.select_one('div.wrapfooter span.post-date')
        product_info = [product_name.get_text().strip(),product_date.get_text()]
        product_lists.append(product_info)

write_excel_template('tmp.xlsx','상품정보',product_lists)

* 엑셀파일 불러오기

for item in excel_sheet.rows:
    print (item[0].value, item[1].value) #셀에 들어있는 값을 가져오기 위해 .value를 붙여줌

excel_file.close()

저작자표시 (새창열림)

'공부' 카테고리의 다른 글

[학습노트][Edwith] 3분에 익히는 머신러닝의 기본 원리 (0)	2021.01.03
[학습노트][McKinsey Insight] 디지털 전략 (0)	2021.01.03
[학습노트][인프런] 데이터과학과 IT영역에서의 파이썬과 크롤링 큰 그림 이해하기 (0)	2020.12.28
[Edwith]모두를 위한 프로그래밍: 파이썬 자료구조 (0)	2020.11.02
[스파르타 코딩 클럽] 3주차 복습 및 과제 기록 (0)	2020.10.19

[스파르타 코딩 클럽] 3주차 복습 및 과제 기록

많루 2020. 10. 19. 00:24

2020. 10. 19. 00:24

※복습 노트

* 파이썬 공식 스타일 가이드

- https://www.python.org/dev/peps/pep-0008/

- https://docs.python.org/ko/3/tutorial/controlflow.html#intermezzo-coding-style

* List와 Dictionary

a_list = [ ] #비어있는 리스트 만들기

a_list.append() #리스트에 값을 넣는다

a_dict = { }

* 파이썬 명령어

def 함수이름 (필요한 변수들) :

내릴 명령어들을 순차적으로 작성

#사용하기

함수이름(필요한 변수들

* 파이썬 조건문

def is_even(num): # is_even는 num을 변수로 받는 함수 이름

if num % 2 == 0: # num을 2로 나눈 나머지가 0이면

return True # True (참)을 반환한다.

else: # 아니면,

return False # False (거짓)을 반환한다.

#html 조건문

function is_even(num){

if (num % 2 == 0) {

return true;

} else {

return false;

}

* 파이썬 반복문

fruits = ['사과', '배', '참외']

for fruit in fruits: # fruit는 임의로 지은 이름

print (fruit)

#html 반복문

let fruits = ['사과','배','참외']

for (let i=0; i<fruits.length; i++) {

console.log(fruits[i])

}

* 파이썬 조건문 & 반복문으로 숫자 세기

fruits = ['사과', '배', '배', '감', '수박', '귤', '딸기', '사과', '배', '수박']

def count_fruits(name):

count = 0

for fruit in fruits:

if fruit == name:

count += 1

return count

subak_count = count_fruits('수박')

print(subak_count) # 수박의 갯수 출력

#html 조건문 & 반복문으로 숫자 세기

let fruits = ['사과', '배', '배', '감', '수박', '귤', '딸기', '사과', '배', '수박']

let count =0;

for (let i=0; i<fruits.length; i++) {

let fruit == fruits[i];

if (fruit =='수박'){

count +=1;

}

* 웹스크래핑

- 패키지 설치 : beautifulsoup4 HTML 코드를 쉽게 스크래핑 해오기 위한 도구

- 태그 안의 텍스트를 찍고 싶을 땐 → 태그.text

태그 안의 속성을 찍고 싶을 땐 → 태그['속성']

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get('https://sports.news.naver.com/kbaseball/record/index.nhn?category=kbo', headers=headers)

# HTML을 BeautifulSoup이라는 라이브러리를 활용해 검색하기 용이한 상태로 만듦
# soup이라는 변수에 "파싱 용이해진 html"이 담긴 상태가 됨
# 이제 코딩을 통해 필요한 부분을 추출하면 된다.

soup = BeautifulSoup(data.text, 'html.parser')

*크롤링한 내용에서 공백 없애기 .text.strip()

저작자표시 (새창열림)

'공부' 카테고리의 다른 글

[학습노트][인프런] 데이터과학과 IT영역에서의 파이썬과 크롤링 큰 그림 이해하기 (0)	2020.12.28
[Edwith]모두를 위한 프로그래밍: 파이썬 자료구조 (0)	2020.11.02
[People Analytics] 로지스틱 회귀분석 (0)	2020.10.18
[People Analytics] 종속 변수가 명목 변수일 때, 카이제곱 검정 (0)	2020.10.15
[People Analytics] 급해서 찾아보는 t-test(t검정) (0)	2020.10.11

PREV 이전 1 NEXT 다음

거의 모든 것의 마니아

크롤링

[학습노트][인프런] 데이터과학과 IT영역에서의 파이썬과 크롤링 큰 그림 이해하기 (2)

'공부' 카테고리의 다른 글

[스파르타 코딩 클럽] 3주차 복습 및 과제 기록

'공부' 카테고리의 다른 글

+ Recent posts

티스토리툴바