Beautiful Soup을 이용한 네이버 영화 평점 크롤링

1. 네이버 영화 평점

1.1 네이버 영화 평점

https://movie.naver.com/movie/sdb/rank/rmovie.nhn?sel=cur&date=20201021
네이버 영화에서 영화 평점을 크롤링 해보도록 해보자
학습용으로 서버에부하되지않을 정도로만 크롤링하자. 너무많이 크롤링하면 네이버측에서 제제가 들어올수 있다.
항상 크롤링은 robots.txt를 확인하자

1.2 URL 보기

'https://movie.naver.com/movie/sdb/rank/rmovie.nhn?sel=cur&date=20201021'

'https://movie.naver.com/movie/sdb/rank/rmovie.nhn?sel=cur&date=20201021'

URL 맨 뒤에 20201021은 date 형식으로 보임, 해당 날짜를 조금씩 바꾸면 다른 페이지에 접속이 가능
이처럼 웹페이지 URL에는 많은 정보가 담겨있음

1.3 한페이지 보기 with Beautiful Soup

from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd

url = 'https://movie.naver.com/movie/sdb/rank/rmovie.nhn?sel=cur&date=20201021'
page = urlopen(url)

soup = BeautifulSoup(page, 'html.parser')
soup

<!DOCTYPE html>

<html lang="ko">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="http://imgmovie.naver.com/today/naverme/naverme_profile.jpg" property="me2:image">
<meta content="네이버영화 " property="me2:post_tag">
<meta content="네이버영화" property="me2:category1"/>
<meta content="" property="me2:category2"/>
<meta content="랭킹 : 네이버 영화" property="og:title"/>
<meta content="영화, 영화인, 예매, 박스오피스 랭킹 정보 제공" property="og:description"/>
<meta content="article" property="og:type"/>
<meta content="https://movie.naver.com/movie/sdb/rank/rmovie.nhn?sel=cur&amp;date=20201021" property="og:url"/>
<meta content="http://static.naver.net/m/movie/icons/OG_270_270.png" property="og:image"/><!-- http://static.naver.net/m/movie/im/navermovie.jpg -->
<meta content="http://imgmovie.naver.com/today/naverme/naverme_profile.jpg" property="og:article:thumbnailUrl"/>
<meta content="네이버 영화" property="og:article:author"/>
<meta content="https://movie.naver.com/" property="og:article:author:url"/>
<link href="https://ssl.pstatic.net/static/m/movie/icons/naver_movie_favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<title>랭킹 : 네이버 영화</title>
<link href="/common/css/movie_tablet.css?20201015140005" rel="stylesheet" type="text/css"/>
...

window.addEventListener('pageshow', function(event) { lcs_do(); });

document.addEventListener('click', function (event) {
	var welSource = event.srcElement;	// jindo.$Element(oEvent.element);
	if (!document.getElementById("gnb").contains(welSource)) {
		gnbAllLayerClose();
	}
});
</script>
<!-- //Footer -->
</div>
</body>
</html> 

html형식을 beautifulsoup을 사용하여 가져옴

위의 내용 중 영화제목과 평점을 가져옴

1.4 영화제목 가져오기

크롬 개발자 도구로 확인해 본 결과, div 태그 tit5 클래스가 영화 제목

soup.find_all('div', 'tit5')[0].a.string

'소년시절의 너'

BeautifulSoup의 find_all 명령어로 제목을 모두 찾을수 있음.
그중 첫번째 ([0])인 내용만 가져왔고, string 형식인 제목만 가져옴

movie_name = [i.a.string for i in soup.find_all('div', 'tit5')]
movie_name

['소년시절의 너',
 '브레이크 더 사일런스: 더 무비',
 '울지마 톤즈',
 '다시 태어나도 우리',
 '언더독',
 '그대, 고맙소 : 김호중 생애 첫 팬미팅 무비',
 '우리들',
 '스파이더맨: 뉴 유니버스',
 '톰보이',
 '사랑과 영혼',
 '파수꾼',
 '제리 맥과이어',
 '삼진그룹 영어토익반',
 '공범자들',
 '타오르는 여인의 초상',
 '박하사탕',
 '인생 후르츠',
 '남매의 여름밤',
 '윤희에게',
 '비투스',
 '아웃포스트',
 '담보',
 '너의 이름은.',
 '소공녀',
 '벌새',
 '마미',
 '브리짓 존스의 일기',
 '찬실이는 복도 많지',
 '69세',
 '라라랜드',
 '기생충',
 '아무르',
 '로렌스 애니웨이 ',
 '3:10 투 유마',
 '검객',
 '주디',
 '리스본행 야간열차',
 '테넷',
 '경계선',
 '프란시스 하',
 '신문기자',
 '위크엔드 인 파리',
 '블레이드 러너 2049',
 '페이트 스테이 나이트 헤븐즈필 제2장 로스트 버터플라이',
 '환상의 빛',
 '날씨의 아이',
 '라붐',
 '21 브릿지: 테러 셧다운',
 '한여름의 판타지아',
 '다만 악에서 구하소서']

리스트 컴프리행션을 사용하여 한 페이지의 영화 제목을 모두 가져옴

1.4 평점 가져오기

td 태그에 point 클래스가 영화 평점

soup.find_all('td', 'point')[0].string

'9.39'

영화 제목 가져왔을떄랑 똑같이 find_all과 string을 사용하여 가져옴

movie_point = [i.string for i in soup.find_all('td', 'point')]
movie_point

['9.39',
 '9.36',
 '9.35',
 '9.34',
 '9.30',
 '9.29',
 '9.26',
 '9.20',
 '9.20',
 '9.19',
 '9.18',
 '9.16',
 '9.15',
 '9.10',
 '9.06',
 '9.03',
 '9.02',
 '9.02',
 '8.98',
 '8.94',
 '8.93',
 '8.86',
 '8.78',
 '8.77',
 '8.76',
 '8.69',
 '8.68',
 '8.67',
 '8.63',
 '8.61',
 '8.49',
 '8.48',
 '8.44',
 '8.41',
 '8.36',
 '8.35',
 '8.31',
 '8.27',
 '8.17',
 '8.14',
 '8.10',
 '8.06',
 '7.99',
 '7.98',
 '7.98',
 '7.97',
 '7.93',
 '7.91',
 '7.81',
 '7.64']

한 페이지의 평점을 가져옴

1.5 날짜 만들기

'https://movie.naver.com/movie/sdb/rank/rmovie.nhn?sel=cur&date=20201021'

'https://movie.naver.com/movie/sdb/rank/rmovie.nhn?sel=cur&date=20201021'

한페이지에서 데이터를 얻게됨
아까전에 이야기한 date를 바꾸면 웹 페이지가 바뀌기떄문에, 날짜를 바꿔가며 url을 변경함

date = pd.date_range('2020.09.01', periods=60, freq='D')
date

DatetimeIndex(['2020-09-01', '2020-09-02', '2020-09-03', '2020-09-04',
               '2020-09-05', '2020-09-06', '2020-09-07', '2020-09-08',
               '2020-09-09', '2020-09-10', '2020-09-11', '2020-09-12',
               '2020-09-13', '2020-09-14', '2020-09-15', '2020-09-16',
               '2020-09-17', '2020-09-18', '2020-09-19', '2020-09-20',
               '2020-09-21', '2020-09-22', '2020-09-23', '2020-09-24',
               '2020-09-25', '2020-09-26', '2020-09-27', '2020-09-28',
               '2020-09-29', '2020-09-30', '2020-10-01', '2020-10-02',
               '2020-10-03', '2020-10-04', '2020-10-05', '2020-10-06',
               '2020-10-07', '2020-10-08', '2020-10-09', '2020-10-10',
               '2020-10-11', '2020-10-12', '2020-10-13', '2020-10-14',
               '2020-10-15', '2020-10-16', '2020-10-17', '2020-10-18',
               '2020-10-19', '2020-10-20', '2020-10-21', '2020-10-22',
               '2020-10-23', '2020-10-24', '2020-10-25', '2020-10-26',
               '2020-10-27', '2020-10-28', '2020-10-29', '2020-10-30'],
              dtype='datetime64[ns]', freq='D')

Pandas의 date_range를 사용하여 날짜를 생성
시작날짜를 적어주고, 만들고 싶은 날짜 갯수와 날짜 형태를 적으면 됨

print(date[0].strftime('%y-%m-%d'))
print(date[0].strftime('%y.%m.%d'))
print(date[0].strftime('%y%m%d'))
print(date[0].strftime('%Y%m%d'))

날짜형 데이터는 strftime 명령으로 원하는 형태의 문자열로 만들수 있음
URL에서 필요한 형식은 맨 아래 20200901 형식임

1.6 여러날짜에서 영화제목과 평점가져오기

import time

movie_date = []
movie_name = []
movie_point = []
date = pd.date_range('2020.09.01', periods=45, freq='D')

for today in date:
    html = 'https://movie.naver.com/movie/sdb/rank/rmovie.nhn?sel=cur&date={date}'
    response = urlopen(html.format(date = today.strftime('%Y%m%d')))
    soup = BeautifulSoup(response, 'html.parser')
    
    movie_date.extend([today] * len(soup.find_all('td', 'point')))
    movie_name.extend([i.a.string for i in soup.find_all('div', 'tit5')])
    movie_point.extend([i.string for i in soup.find_all('td', 'point')])
    
    print(str(today))
    time.sleep(0.5)

2020-09-01 00:00:00
2020-09-02 00:00:00
2020-09-03 00:00:00
2020-09-04 00:00:00
2020-09-05 00:00:00
2020-09-06 00:00:00
2020-09-07 00:00:00
2020-09-08 00:00:00
2020-09-09 00:00:00
2020-09-10 00:00:00
2020-09-11 00:00:00
2020-09-12 00:00:00
2020-09-13 00:00:00
2020-09-14 00:00:00
2020-09-15 00:00:00
2020-09-16 00:00:00
2020-09-17 00:00:00
2020-09-18 00:00:00
2020-09-19 00:00:00
2020-09-20 00:00:00
2020-09-21 00:00:00
2020-09-22 00:00:00
2020-09-23 00:00:00
2020-09-24 00:00:00
2020-09-25 00:00:00
2020-09-26 00:00:00
2020-09-27 00:00:00
2020-09-28 00:00:00
2020-09-29 00:00:00
2020-09-30 00:00:00
2020-10-01 00:00:00
2020-10-02 00:00:00
2020-10-03 00:00:00
2020-10-04 00:00:00
2020-10-05 00:00:00
2020-10-06 00:00:00
2020-10-07 00:00:00
2020-10-08 00:00:00
2020-10-09 00:00:00
2020-10-10 00:00:00
2020-10-11 00:00:00
2020-10-12 00:00:00
2020-10-13 00:00:00
2020-10-14 00:00:00
2020-10-15 00:00:00

일단 20년 9월 1일 ~ 10월 15일까지의 영화 평점을 가져옴
한페이지를 크롤링하고 또 바로 크롤링하면 과부화 및 접속 차단등이 걸릴수 있으니 time.sleep을 걸어주지

len(movie_date), len(movie_name), len(movie_point)

(2158, 2158, 2158)

총 2158개의 영화 평점을 가져옴

1.7 데이터 프레임 생성

movie = pd.DataFrame({'date' : movie_date, 'name': movie_name, 'point' : movie_point})
movie.tail()

	date	name	point
2153	2020-10-15	언힌지드	6.69
2154	2020-10-15	죽지않는 인간들의 밤	6.62
2155	2020-10-15	강철비2: 정상회담	5.01
2156	2020-10-15	국제수사	4.87
2157	2020-10-15	뮬란	4.20

movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2158 entries, 0 to 2157
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    2158 non-null   datetime64[ns]
 1   name    2158 non-null   object        
 2   point   2158 non-null   object        
dtypes: datetime64[ns](1), object(2)
memory usage: 50.7+ KB

평점은 float으로 변경

movie['point'] = movie['point'].astype(float)
movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2158 entries, 0 to 2157
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    2158 non-null   datetime64[ns]
 1   name    2158 non-null   object        
 2   point   2158 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 50.7+ KB

평점타입 변경 완료

movie.to_csv('./data/naver_movie_points_20201022.csv', sep = ',', encoding = 'utf-8')

크롤링된 데이터는 csv파일로 저장함
만일 커널이 재시작되면 크롤링한 데이터는 날라감
해당 파일은 github에 올려둠 https://raw.githubusercontent.com/hmkim312/datas/main/navermoviepoints/naver_movie_points_20201022.csv{target:’_blank’}

2. 데이터 Preprocessing

2.1 Data Load

import numpy as np
import pandas as pd

movie = pd.read_csv('https://raw.githubusercontent.com/hmkim312/datas/main/navermoviepoints/naver_movie_points_20201022.csv', index_col = 0)
movie.tail()

	date	name	point
2153	2020-10-15	언힌지드	6.69
2154	2020-10-15	죽지않는 인간들의 밤	6.62
2155	2020-10-15	강철비2: 정상회담	5.01
2156	2020-10-15	국제수사	4.87
2157	2020-10-15	뮬란	4.20

index_col = 0 옵션을 넣어서 인덱스를 불러오지 않도록 함

2.2 평점 합산

movie_unique = pd.pivot_table(movie, index=['name'], aggfunc=np.sum)
movie_unique.sort_values('point', ascending = False).head(10)

	point
name
소년시절의 너	422.21
사랑과 영혼	413.48
제리 맥과이어	412.20
극장판 짱구는 못말려: 신혼여행 허리케인~ 사라진 아빠!	393.58
브리짓 존스의 일기	390.57
69세	388.31
500일의 썸머	378.90
라라랜드	378.84
테넷	372.53
타오르는 여인의 초상	371.08

영화 이름으로 인덱스를 잡고, 점수를 합산 후 내림차순 10개를 출력함, best 10

2.3 DataFrame Query

movie.query('name == ["테넷"]')

	date	name	point
32	2020-09-01	테넷	8.37
78	2020-09-02	테넷	8.36
129	2020-09-03	테넷	8.34
179	2020-09-04	테넷	8.33
225	2020-09-05	테넷	8.34
277	2020-09-06	테넷	8.31
326	2020-09-07	테넷	8.29
373	2020-09-08	테넷	8.28
418	2020-09-09	테넷	8.28
463	2020-09-10	테넷	8.28
509	2020-09-11	테넷	8.27
560	2020-09-12	테넷	8.27
610	2020-09-13	테넷	8.27
656	2020-09-14	테넷	8.27
706	2020-09-15	테넷	8.27
750	2020-09-16	테넷	8.27
794	2020-09-17	테넷	8.27
841	2020-09-18	테넷	8.28
887	2020-09-19	테넷	8.27
932	2020-09-20	테넷	8.27
975	2020-09-21	테넷	8.27
1018	2020-09-22	테넷	8.26
1055	2020-09-23	테넷	8.26
1096	2020-09-24	테넷	8.27
1146	2020-09-25	테넷	8.27
1196	2020-09-26	테넷	8.26
1247	2020-09-27	테넷	8.26
1293	2020-09-28	테넷	8.26
1340	2020-09-29	테넷	8.26
1392	2020-09-30	테넷	8.26
1442	2020-10-01	테넷	8.26
1492	2020-10-02	테넷	8.26
1542	2020-10-03	테넷	8.27
1589	2020-10-04	테넷	8.26
1643	2020-10-05	테넷	8.26
1698	2020-10-06	테넷	8.27
1747	2020-10-07	테넷	8.27
1800	2020-10-08	테넷	8.27
1849	2020-10-09	테넷	8.27
1899	2020-10-10	테넷	8.27
1945	2020-10-11	테넷	8.27
1990	2020-10-12	테넷	8.27
2038	2020-10-13	테넷	8.27
2088	2020-10-14	테넷	8.27
2138	2020-10-15	테넷	8.27

DataFrame Query로 검색을 해볼수 있음

2.4 날짜별 영화 평점 변화 그리기

import matplotlib.pyplot as plt

plt.figure(figsize=(24, 8))
plt.plot(movie.query('name == ["테넷"]')['date'],
         movie.query('name == ["테넷"]')['point'])
plt.legend(labels = ['point'])
plt.xticks(rotation=45)
plt.grid()
plt.show()

영화 테넷의 평점 변화를 봄
x 축의 길이가 너무 길어서 rotation을 45로 해줌
그래프의 곡선이 많이 움직이는것 같지만, 사실 y축을 보면 그렇게 크지 않음 최고점과 최저점이 0.1점 차이

2.5 영화 정리

movie_pivot = movie.pivot_table(movie, index = ['date'], columns = ['name'])
movie_pivot.columns = movie_pivot.columns.droplevel([0])
movie_pivot.tail()

name	500일의 썸머	69세	가버나움	감쪽같은 그녀	강철비2: 정상회담	검객	경계선	국제수사	그대, 고맙소 : 김호중 생애 첫 팬미팅 무비	그래비티	...	포드 V 페라리	폭스캐처	프란시스 하	피아노	피아니스트	피아니스트의 전설	피터와 드래곤	하녀	항거:유관순 이야기	홀리 모터스
date
2020-10-11	8.42	8.63	NaN	NaN	NaN	8.36	8.17	NaN	9.47	8.29	...	9.49	NaN	8.14	NaN	9.32	NaN	8.25	NaN	NaN	NaN
2020-10-12	8.42	8.63	NaN	NaN	NaN	8.36	8.17	NaN	9.45	NaN	...	9.49	NaN	8.14	NaN	9.32	NaN	NaN	NaN	NaN	7.52
2020-10-13	8.42	8.63	NaN	NaN	5.01	8.35	8.17	4.89	9.46	NaN	...	NaN	NaN	8.14	NaN	9.32	NaN	NaN	NaN	NaN	7.52
2020-10-14	8.42	8.63	NaN	NaN	5.01	8.36	8.17	4.88	9.44	NaN	...	NaN	NaN	8.14	NaN	9.32	NaN	NaN	NaN	NaN	7.52
2020-10-15	8.42	8.63	NaN	NaN	5.01	8.36	8.17	4.87	9.43	NaN	...	9.49	NaN	8.14	NaN	9.32	NaN	NaN	NaN	NaN	7.52

5 rows × 132 columns

크롤링한 영화를 컬럼으로 하게 pivot함

2.6 보고싶은 영화들의 평점들 시각화

movie_pivot.columns

Index(['500일의 썸머', '69세', '가버나움', '감쪽같은 그녀', '강철비2: 정상회담', '검객', '경계선', '국제수사',
       '그대, 고맙소 : 김호중 생애 첫 팬미팅 무비', '그래비티',
       ...
       '포드 V 페라리', '폭스캐처', '프란시스 하', '피아노', '피아니스트', '피아니스트의 전설', '피터와 드래곤',
       '하녀', '항거:유관순 이야기', '홀리 모터스'],
      dtype='object', name='name', length=132)

targer_col = ['극장판 짱구는 못말려: 신혼여행 허리케인~ 사라진 아빠!', '테넷', '라라랜드', '동주']

plt.figure(figsize=(12,8))
plt.plot(movie_pivot[targer_col])
plt.legend(targer_col, loc = 'best')
plt.tick_params(bottom = False, labelbottom = False)
plt.show()

보고싶은 영화의 평점만 골라서 비교해볼수 있음
중간에 선이 끊긴것은 평점 데이터가 없는것

2.7 엑셀로 저장

movie_pivot.to_excel('./data/naver_movie_points_pivot_20201022.xlsx')

to_excel을 사용하여 저장
이런식으로 저장됨

Beautiful Soup을 이용한 네이버 영화 평점 크롤링

1. 네이버 영화 평점

1.1 네이버 영화 평점

1.2 URL 보기

1.3 한페이지 보기 with Beautiful Soup

1.4 영화제목 가져오기

1.4 평점 가져오기

1.5 날짜 만들기

1.6 여러날짜에서 영화제목과 평점가져오기

1.7 데이터 프레임 생성

2. 데이터 Preprocessing

2.1 Data Load

2.2 평점 합산

2.3 DataFrame Query

2.4 날짜별 영화 평점 변화 그리기

2.5 영화 정리

2.6 보고싶은 영화들의 평점들 시각화

2.7 엑셀로 저장

Recent Update

Trending Tags

Contents

Trending Tags

Beautiful Soup을 이용한 네이버 영화 평점 크롤링

1. 네이버 영화 평점

1.1 네이버 영화 평점

1.2 URL 보기

1.3 한페이지 보기 with Beautiful Soup

1.4 영화제목 가져오기

1.4 평점 가져오기

1.5 날짜 만들기

1.6 여러날짜에서 영화제목과 평점가져오기

1.7 데이터 프레임 생성

2. 데이터 Preprocessing

2.1 Data Load

2.2 평점 합산

2.3 DataFrame Query

2.4 날짜별 영화 평점 변화 그리기

2.5 영화 정리

2.6 보고싶은 영화들의 평점들 시각화

2.7 엑셀로 저장

Recent Update

Trending Tags

Contents

Further Reading

크롤링(Crawling) 기초(1)

크롤링(Crawling) 기초(2)

Beautiful Soup을 이용한 시카고 샌드위치 맛집 정보 추출

Trending Tags