영화데이터로 해보는 추천 시스템(Recommendations)

1. 추천 시스템

1.1 추천 시스템

요즘 유튜브, 멜론 등등에는 추천 시스템이 적용되어 있음
특히 온라인 쇼핑몰 컨텐츠 등에서도 중요한 부분을 차지함
추천시스템은 콘텐츠 기반 필터링과 협업 필터링, 크개 2가지로 나뉜다.

1.2 콘텐츠 기반 필터링 추천 시스템

사용자가 특정한 아이템을 선호하는 경우 그 아이템과 비슷한 아이템을 추천하는 방식

1.3 최근접 이웃 협업 필터링

축적된 사용자 행동 데이터를 기반으로 사용자가 아직 평가하지 않은 아이템을 예측 평하
사용자 기반 : 당신과 비슷한 고객들이 다음 상품도 구매함
아이템 기반 : 이 상품을 선택한 다른 고객들은 다음 상품도 구매함
일반적으로는 사용자 기반 보다는 아이템기반 협업 필터링이 정확도가 더 높음
- 비슷한 영화를 좋아한다고 취향이 비슷하다고 판단하기 어려움
- 매우 유명한 영화는 취향과 관계없이 관람하는 경우가 많음
- 사용자들이 평점을 매기지 않는 경우가 많음

1.4 잠재 요인 협업 필터링

사용자 - 아이템 평점 행렬 데이터를 이용해서 잠재요인을 도출하는 것
주요인과 아이템에 대한 잠재요인에 대해 행렬분해를 하고 다시 행렬곱을 통해 아직 평점을 부여하지 않은 아이템에 대한 예측 평점을 생성

2. 콘텐츠 기반 필터링 실습

2.1 TMDB5000 영화 데이터 세트

https://www.kaggle.com/tmdb/tmdb-movie-metadata

kaggle에 있는 TMDB5000 영화 데이터 세트
4803개의 영화 정보가 들어 있음

2.2 데이터 로드

import pandas as pd
import numpy as np

movies = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/tmdb5000/tmdb_5000_movies.csv')
print(movies.shape)
movies.head()

(4803, 20)

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	production_countries	release_date	revenue	runtime	spoken_languages	status	tagline	title	vote_average	vote_count
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.avatarmovie.com/	19995	[{"id": 1463, "name": "culture clash"}, {"id":...	en	Avatar	In the 22nd century, a paraplegic Marine is di...	150.437577	[{"name": "Ingenious Film Partners", "id": 289...	[{"iso_3166_1": "US", "name": "United States o...	2009-12-10	2787965087	162.0	[{"iso_639_1": "en", "name": "English"}, {"iso...	Released	Enter the World of Pandora.	Avatar	7.2	11800
1	300000000	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	http://disney.go.com/disneypictures/pirates/	285	[{"id": 270, "name": "ocean"}, {"id": 726, "na...	en	Pirates of the Caribbean: At World's End	Captain Barbossa, long believed to be dead, ha...	139.082615	[{"name": "Walt Disney Pictures", "id": 2}, {"...	[{"iso_3166_1": "US", "name": "United States o...	2007-05-19	961000000	169.0	[{"iso_639_1": "en", "name": "English"}]	Released	At the end of the world, the adventure begins.	Pirates of the Caribbean: At World's End	6.9	4500
2	245000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.sonypictures.com/movies/spectre/	206647	[{"id": 470, "name": "spy"}, {"id": 818, "name...	en	Spectre	A cryptic message from Bond’s past sends him o...	107.376788	[{"name": "Columbia Pictures", "id": 5}, {"nam...	[{"iso_3166_1": "GB", "name": "United Kingdom"...	2015-10-26	880674609	148.0	[{"iso_639_1": "fr", "name": "Fran\u00e7ais"},...	Released	A Plan No One Escapes	Spectre	6.3	4466
3	250000000	[{"id": 28, "name": "Action"}, {"id": 80, "nam...	http://www.thedarkknightrises.com/	49026	[{"id": 849, "name": "dc comics"}, {"id": 853,...	en	The Dark Knight Rises	Following the death of District Attorney Harve...	112.312950	[{"name": "Legendary Pictures", "id": 923}, {"...	[{"iso_3166_1": "US", "name": "United States o...	2012-07-16	1084939099	165.0	[{"iso_639_1": "en", "name": "English"}]	Released	The Legend Ends	The Dark Knight Rises	7.6	9106
4	260000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://movies.disney.com/john-carter	49529	[{"id": 818, "name": "based on novel"}, {"id":...	en	John Carter	John Carter is a war-weary, former military ca...	43.926995	[{"name": "Walt Disney Pictures", "id": 2}]	[{"iso_3166_1": "US", "name": "United States o...	2012-03-07	284139100	132.0	[{"iso_639_1": "en", "name": "English"}]	Released	Lost in our world, found in another.	John Carter	6.1	2124

2.3 데이터 선택

movies_df = movies[['id', 'title', 'genres', 'vote_average',
                    'vote_count', 'popularity', 'keywords', 'overview']]
movies_df.head()

	id	title	genres	vote_average	vote_count	popularity	keywords	overview
0	19995	Avatar	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	7.2	11800	150.437577	[{"id": 1463, "name": "culture clash"}, {"id":...	In the 22nd century, a paraplegic Marine is di...
1	285	Pirates of the Caribbean: At World's End	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	6.9	4500	139.082615	[{"id": 270, "name": "ocean"}, {"id": 726, "na...	Captain Barbossa, long believed to be dead, ha...
2	206647	Spectre	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	6.3	4466	107.376788	[{"id": 470, "name": "spy"}, {"id": 818, "name...	A cryptic message from Bond’s past sends him o...
3	49026	The Dark Knight Rises	[{"id": 28, "name": "Action"}, {"id": 80, "nam...	7.6	9106	112.312950	[{"id": 849, "name": "dc comics"}, {"id": 853,...	Following the death of District Attorney Harve...
4	49529	John Carter	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	6.1	2124	43.926995	[{"id": 818, "name": "based on novel"}, {"id":...	John Carter is a war-weary, former military ca...

실습에 필요한 컬럼만 가져옴
id : unique한 id
title : 영화제목
genres : 영화 장르
vote_average : 평균 평점
vote_count : 투표수
popularity : 인기점수
keywords : 키워드
overview : 영화 개요

2.4 데이터 주의 사항

movies_df[['genres']][:1].values

array([['[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]']],
      dtype=object)

genres와 keywords는 컬럼안에 dict형으로 저장되어 있음

2.5 문자열로 된 데이터

from ast import literal_eval

code = """(1,2, {'foo', 'bar'})"""
code, type(code)

("(1,2, {'foo', 'bar'})", str)

genres와 keywords는 str로 되어 있음

literal_eval(code), type(literal_eval(code))

((1, 2, {'bar', 'foo'}), tuple)

literal_eval을 사용하여 tuple형태로 변경함

2.6 genresdhk keywords의 내용을 list와 dict으로 복구

from ast import literal_eval
import warnings; warnings.filterwarnings('ignore')

movies_df['genres'] = movies_df['genres'].apply(literal_eval)
movies_df['keywords'] = movies_df['keywords'].apply(literal_eval)
movies_df

	id	title	genres	vote_average	vote_count	popularity	keywords	overview
0	19995	Avatar	[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...	7.2	11800	150.437577	[{'id': 1463, 'name': 'culture clash'}, {'id':...	In the 22nd century, a paraplegic Marine is di...
1	285	Pirates of the Caribbean: At World's End	[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...	6.9	4500	139.082615	[{'id': 270, 'name': 'ocean'}, {'id': 726, 'na...	Captain Barbossa, long believed to be dead, ha...
2	206647	Spectre	[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...	6.3	4466	107.376788	[{'id': 470, 'name': 'spy'}, {'id': 818, 'name...	A cryptic message from Bond’s past sends him o...
3	49026	The Dark Knight Rises	[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...	7.6	9106	112.312950	[{'id': 849, 'name': 'dc comics'}, {'id': 853,...	Following the death of District Attorney Harve...
4	49529	John Carter	[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...	6.1	2124	43.926995	[{'id': 818, 'name': 'based on novel'}, {'id':...	John Carter is a war-weary, former military ca...
...	...	...	...	...	...	...	...	...
4798	9367	El Mariachi	[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...	6.6	238	14.269792	[{'id': 5616, 'name': 'united states–mexico ba...	El Mariachi just wants to play his guitar and ...
4799	72766	Newlyweds	[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...	5.9	5	0.642552	[]	A newlywed couple's honeymoon is upended by th...
4800	231617	Signed, Sealed, Delivered	[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...	7.0	6	1.444476	[{'id': 248, 'name': 'date'}, {'id': 699, 'nam...	"Signed, Sealed, Delivered" introduces a dedic...
4801	126186	Shanghai Calling	[]	5.7	7	0.857008	[]	When ambitious New York attorney Sam is sent t...
4802	25975	My Date with Drew	[{'id': 99, 'name': 'Documentary'}]	6.3	16	1.929883	[{'id': 1523, 'name': 'obsession'}, {'id': 224...	Ever since the second grade when he first saw ...

4803 rows × 8 columns

literal_eval을 사용하여 keywords와 genres를 list안에 dict으로 변경함

2.7 dict의 value값을 특성으로 사용하도록 변경

movies_df['genres'] = movies_df['genres'].apply(lambda x : [y['name'] for y in x])
movies_df['keywords'] = movies_df['keywords'].apply(lambda x : [y['name'] for y in x])
movies_df[['genres', 'keywords']][:2]

	genres	keywords
0	[Action, Adventure, Fantasy, Science Fiction]	[culture clash, future, space war, space colon...
1	[Adventure, Fantasy, Action]	[ocean, drug abuse, exotic island, east india ...

dict 형태의 key : value로 들어있던 내용을, value값을 특성으로 사용하도록 for문과 lambda를 사용하여 해결

2.8 genres의 각 단어들을 하나의 문장으로 변환

movies_df['genres_literal'] = movies_df['genres'].apply(lambda x : (' ').join(x))
movies_df.head()

	id	title	genres	vote_average	vote_count	popularity	keywords	overview	genres_literal
0	19995	Avatar	[Action, Adventure, Fantasy, Science Fiction]	7.2	11800	150.437577	[culture clash, future, space war, space colon...	In the 22nd century, a paraplegic Marine is di...	Action Adventure Fantasy Science Fiction
1	285	Pirates of the Caribbean: At World's End	[Adventure, Fantasy, Action]	6.9	4500	139.082615	[ocean, drug abuse, exotic island, east india ...	Captain Barbossa, long believed to be dead, ha...	Adventure Fantasy Action
2	206647	Spectre	[Action, Adventure, Crime]	6.3	4466	107.376788	[spy, based on novel, secret agent, sequel, mi...	A cryptic message from Bond’s past sends him o...	Action Adventure Crime
3	49026	The Dark Knight Rises	[Action, Crime, Drama, Thriller]	7.6	9106	112.312950	[dc comics, crime fighter, terrorist, secret i...	Following the death of District Attorney Harve...	Action Crime Drama Thriller
4	49529	John Carter	[Action, Adventure, Science Fiction]	6.1	2124	43.926995	[based on novel, mars, medallion, space travel...	John Carter is a war-weary, former military ca...	Action Adventure Science Fiction

join을 사용하여 list로 되어있는 genres를 띄어쓰기로 된 컬럼을 생성

2.9 genres를 countvectorize

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(min_df=0, ngram_range=(1, 2))
genre_mat = count_vect.fit_transform(movies_df['genres_literal'])
print(genre_mat.shape)

(4803, 276)

countvectorize를 하여 4803개의 영화에서 276개의 말뭉치 단어가 생김

2.10 코사인 유사도

from sklearn.metrics.pairwise import cosine_similarity

genre_sim = cosine_similarity(genre_mat, genre_mat)
print(genre_sim.shape)
print(genre_sim[:2])

(4803, 4803)
[[1.         0.59628479 0.4472136  ... 0.         0.         0.        ]
 [0.59628479 1.         0.4        ... 0.         0.         0.        ]]

코사인 유사도를 통해, 아이템별로 비슷한 점수를 알수 있음

2.11 genre_sim 높은값 순으로 정렬

genre_sim_sorted_ind = genre_sim.argsort()[:, ::-1]
print(genre_sim_sorted_ind[:1])

[[   0 3494  813 ... 3038 3037 2401]]

2.12 추천 영화를 DataFrame으로 반환하는 함수 생성

def find_sim_movie(df, sorted_ind, title_name, top_n = 10):
    title_movie = df[df['title'] == title_name]
    
    title_index = title_movie.index.values
    similar_indexes = sorted_ind[title_index, :(top_n)]
    
    print(similar_indexes)
    similar_indexes = similar_indexes.reshape(-1)
    
    return df.iloc[similar_indexes]

2.13 비슷한 영화 찾기

similar_movies = find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather', 10)
similar_movies[['title', 'vote_average']]

[[2731 1243 3636 1946 2640 4065 1847 4217  883 3866]]

	title	vote_average
2731	The Godfather: Part II	8.3
1243	Mean Streets	7.2
3636	Light Sleeper	5.7
1946	The Bad Lieutenant: Port of Call - New Orleans	6.0
2640	Things to Do in Denver When You're Dead	6.7
4065	Mi America	0.0
1847	GoodFellas	8.2
4217	Kids	6.8
883	Catch Me If You Can	7.7
3866	City of God	8.1

대부2는 맞는거 같은데 그 아래는 좀..
문제가 있는듯 하다

2.14 vote_average 확인

movies_df[['title', 'vote_average', 'vote_count']].sort_values('vote_average', ascending = False)[:10]

	title	vote_average	vote_count
3519	Stiff Upper Lips	10.0	1
4247	Me You and Five Bucks	10.0	2
4045	Dancer, Texas Pop. 81	10.0	1
4662	Little Big Top	10.0	1
3992	Sardaarji	9.5	2
2386	One Man's Hero	9.3	2
2970	There Goes My Baby	8.5	2
1881	The Shawshank Redemption	8.5	8205
2796	The Prisoner of Zenda	8.4	11
3337	The Godfather	8.4	5893

평점과 평점을 매긴 횟수를 보면 문제 데이터가 보임
평점은 10점이나 횟수는 1점 등등..

2.15 영화 선정을 위한 가중치 선정

v : 개별 영화에 평점을 투표한 횟수
m : 평점을 부여하기 위한 최소 투표 횟수
R : 개별 영화에 대한 평균 평점
C : 전체 영화에 대한 평균 평점

2.16 영화 전체 평균 평점과 최소 투표횟수를 60%로 지정

C = movies_df['vote_average'].mean()
m = movies_df['vote_count'].quantile(0.6)
print('C:', round(C, 3), 'm:', round(m,3))

C: 6.092 m: 370.2

영화의 평균 평점은 6점정도이고, 최소 투표회수 60%지점은 약 370회
370회 미만의 평점 투표 영화는 제외 됨

2.17 가중치가 부여된 평점을 계산하기 위한 함수 생성

def weighted_vote_average(record):
    v = record['vote_count']
    R = record['vote_average']
    
    return ((v/(v+m)) * R) + ((m/(m+v))*C)

위의 가중치 계산식을 함수로 작성

2.18 다시 계산

movies_df['weighted_vote'] = movies_df.apply(weighted_vote_average, axis = 1)
movies_df.head()

	id	title	genres	vote_average	vote_count	popularity	keywords	overview	genres_literal	weighted_vote
0	19995	Avatar	[Action, Adventure, Fantasy, Science Fiction]	7.2	11800	150.437577	[culture clash, future, space war, space colon...	In the 22nd century, a paraplegic Marine is di...	Action Adventure Fantasy Science Fiction	7.166301
1	285	Pirates of the Caribbean: At World's End	[Adventure, Fantasy, Action]	6.9	4500	139.082615	[ocean, drug abuse, exotic island, east india ...	Captain Barbossa, long believed to be dead, ha...	Adventure Fantasy Action	6.838594
2	206647	Spectre	[Action, Adventure, Crime]	6.3	4466	107.376788	[spy, based on novel, secret agent, sequel, mi...	A cryptic message from Bond’s past sends him o...	Action Adventure Crime	6.284091
3	49026	The Dark Knight Rises	[Action, Crime, Drama, Thriller]	7.6	9106	112.312950	[dc comics, crime fighter, terrorist, secret i...	Following the death of District Attorney Harve...	Action Crime Drama Thriller	7.541095
4	49529	John Carter	[Action, Adventure, Science Fiction]	6.1	2124	43.926995	[based on novel, mars, medallion, space travel...	John Carter is a war-weary, former military ca...	Action Adventure Science Fiction	6.098838

전체 데이터에 함수 적용

2.19 가중치 부여된 평점 순으로 정렬

movies_df[['title', 'vote_average', 'weighted_vote', 'vote_count']].sort_values('weighted_vote', ascending = False)[:10]

	title	vote_average	weighted_vote	vote_count
1881	The Shawshank Redemption	8.5	8.396052	8205
3337	The Godfather	8.4	8.263591	5893
662	Fight Club	8.3	8.216455	9413
3232	Pulp Fiction	8.3	8.207102	8428
65	The Dark Knight	8.2	8.136930	12002
1818	Schindler's List	8.3	8.126069	4329
3865	Whiplash	8.3	8.123248	4254
809	Forrest Gump	8.2	8.105954	7927
2294	Spirited Away	8.3	8.105867	3840
2731	The Godfather: Part II	8.3	8.079586	3338

평점이 8.39 이상, 투표한 횟수가 적은 영화는 제외됨

2.20 유사 영화 찾는 함수 변경

def find_sim_movie(df, sorted_ind, title_name, top_n = 10):
    title_movie = df[df['title'] == title_name]
    title_index = title_movie.index.values
    
    similar_indexes = sorted_ind[title_index, :(top_n*2)]
    similar_indexes = similar_indexes.reshape(-1)
    
    similar_indexes = similar_indexes[similar_indexes != title_index]
    
    return df.iloc[similar_indexes].sort_values('weighted_vote', ascending =False)[:top_n]

이전에 작성한 함수에서, 가중치를 부여한 평점을 추가함

2.21 대부와 유사한 영화 찾기

similar_movies = find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather', 10)
similar_movies[['title', 'vote_average', 'weighted_vote']]

	title	vote_average	weighted_vote
2731	The Godfather: Part II	8.3	8.079586
1847	GoodFellas	8.2	7.976937
3866	City of God	8.1	7.759693
1663	Once Upon a Time in America	8.2	7.657811
883	Catch Me If You Can	7.7	7.557097
281	American Gangster	7.4	7.141396
4041	This Is England	7.4	6.739664
1149	American Hustle	6.8	6.717525
1243	Mean Streets	7.2	6.626569
2839	Rounders	6.9	6.530427

아까보다는 좀 나은듯 하다. 가중치륿 부여해서 그런듯

3. 아이템 기반 최근접 이웃 협업 필터링

3.1 무비렌즈 데이터

https://grouplens.org/datasets/movielens/latest/

영화의 평점을 매긴 사용자와 영화 평점 행렬 등의 데이터
1메가짜리 small 데이터 다운

3.2 데이터 불러오기

import pandas as pd
import numpy as np

movies = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/ml-latest-small/movies.csv')
ratings = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/ml-latest-small/ratings.csv')

print(movies.shape)
print(ratings.shape)

(9742, 3)
(100836, 4)

movies.head()

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

ratings.head()

	userId	movieId	rating	timestamp
0	1	1	4.0	964982703
1	1	3	4.0	964981247
2	1	6	4.0	964982224
3	1	47	5.0	964983815
4	1	50	5.0	964982931

movie 데이터는 9742개로 영화 제목과 장르가 있음
ratings 데이터는 100836개로 영화 평점이 사용자별로 존재함

raw_data2개를 위 사진처럼 정리 해야함

3.3 피벗테이블

ratings = ratings[['userId', 'movieId', 'rating']]
ratings_matrix = ratings.pivot_table('rating', index = 'userId', columns = 'movieId')
ratings_matrix.head()

movieId	1	2	3	4	5	6	7	8	9	10	...	193565	193567	193571	193573	193579	193581	193583	193585	193587	193609
userId
1	4.0	NaN	4.0	NaN	NaN	4.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	4.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 9724 columns

유저별 영화의 평점은 만들었다.
영화 이름이 id로 되어있어서 알수 없으니, movie 데이터에서 합치기

3.4 ratings와 movie를 movieId로 결합

rating_movies = pd.merge(ratings, movies, on = 'movieId')
rating_movies.head()

	userId	movieId	rating	title	genres
0	1	1	4.0	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	5	1	4.0	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
2	7	1	4.5	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
3	15	1	2.5	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
4	17	1	4.5	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy

rating에 title을 붙였으니, 다시 정리

3.5 피벗테이블2

ratings = ratings[['userId', 'movieId', 'rating']]
ratings_matrix = rating_movies.pivot_table('rating', index = 'userId', columns = 'title')
ratings_matrix.head()

title	'71 (2014)	'Hellboy': The Seeds of Creation (2004)	'Round Midnight (1986)	'Salem's Lot (2004)	'Til There Was You (1997)	'Tis the Season for Love (2015)	'burbs, The (1989)	'night Mother (1986)	(500) Days of Summer (2009)	*batteries not included (1987)	...	Zulu (2013)	[REC] (2007)	[REC]² (2009)	[REC]³ 3 Génesis (2012)	anohana: The Flower We Saw That Day - The Movie (2013)	eXistenZ (1999)	xXx (2002)	xXx: State of the Union (2005)	¡Three Amigos! (1986)	À nous la liberté (Freedom for Us) (1931)
userId
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4.0	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 9719 columns

이제는 유저가 영화에 점수를 준것을 정리함
단 NaN데이터가 있음

3.6 NaN 데이터 처리

ratings_matrix = ratings_matrix.fillna(0)
ratings_matrix.head()

title	'71 (2014)	'Hellboy': The Seeds of Creation (2004)	'Round Midnight (1986)	'Salem's Lot (2004)	'Til There Was You (1997)	'Tis the Season for Love (2015)	'burbs, The (1989)	'night Mother (1986)	(500) Days of Summer (2009)	*batteries not included (1987)	...	Zulu (2013)	[REC] (2007)	[REC]² (2009)	[REC]³ 3 Génesis (2012)	anohana: The Flower We Saw That Day - The Movie (2013)	eXistenZ (1999)	xXx (2002)	xXx: State of the Union (2005)	¡Three Amigos! (1986)	À nous la liberté (Freedom for Us) (1931)
userId
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	4.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 9719 columns

NaN 데이터는 fillna를 사용해 0으로 채워줌

3.7 행렬 transpose

ratings_matrix_T = ratings_matrix.transpose()
ratings_matrix_T.head()

userId	1	2	3	4	5	6	7	8	9	10	...	601	602	603	604	605	606	607	608	609	610
title
'71 (2014)	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	4.0
'Hellboy': The Seeds of Creation (2004)	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
'Round Midnight (1986)	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
'Salem's Lot (2004)	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
'Til There Was You (1997)	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 610 columns

행렬의 유사도 측정을 하기 위해 행렬을 transpose함

3.8 유사도 측정결과

from sklearn.metrics.pairwise import cosine_similarity

item_sim = cosine_similarity(ratings_matrix_T, ratings_matrix_T)
item_sim_df = pd.DataFrame(data = item_sim, index=ratings_matrix.columns, columns=ratings_matrix.columns)

print(item_sim_df.shape)
item_sim_df.head()

(9719, 9719)

title	'71 (2014)	'Hellboy': The Seeds of Creation (2004)	'Round Midnight (1986)	'Salem's Lot (2004)	'Til There Was You (1997)	'Tis the Season for Love (2015)	'burbs, The (1989)	'night Mother (1986)	(500) Days of Summer (2009)	*batteries not included (1987)	...	Zulu (2013)	[REC] (2007)	[REC]² (2009)	[REC]³ 3 Génesis (2012)	anohana: The Flower We Saw That Day - The Movie (2013)	eXistenZ (1999)	xXx (2002)	xXx: State of the Union (2005)	¡Three Amigos! (1986)	À nous la liberté (Freedom for Us) (1931)
title
'71 (2014)	1.0	0.000000	0.000000	0.000000	0.000000	0.0	0.000000	0.0	0.141653	0.0	...	0.0	0.342055	0.543305	0.707107	0.0	0.0	0.139431	0.327327	0.0	0.0
'Hellboy': The Seeds of Creation (2004)	0.0	1.000000	0.707107	0.000000	0.000000	0.0	0.000000	0.0	0.000000	0.0	...	0.0	0.000000	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0
'Round Midnight (1986)	0.0	0.707107	1.000000	0.000000	0.000000	0.0	0.176777	0.0	0.000000	0.0	...	0.0	0.000000	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0
'Salem's Lot (2004)	0.0	0.000000	0.000000	1.000000	0.857493	0.0	0.000000	0.0	0.000000	0.0	...	0.0	0.000000	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0
'Til There Was You (1997)	0.0	0.000000	0.000000	0.857493	1.000000	0.0	0.000000	0.0	0.000000	0.0	...	0.0	0.000000	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0

5 rows × 9719 columns

전체 영화에 대해 유사도 점수가 나옴

3.9 대부와 유사한 영화는?

item_sim_df['Godfather, The (1972)'].sort_values(ascending = False)[:10]

title
Godfather, The (1972)                                    1.000000
Godfather: Part II, The (1974)                           0.821773
Goodfellas (1990)                                        0.664841
One Flew Over the Cuckoo's Nest (1975)                   0.620536
Star Wars: Episode IV - A New Hope (1977)                0.595317
Fargo (1996)                                             0.588614
Star Wars: Episode V - The Empire Strikes Back (1980)    0.586030
Fight Club (1999)                                        0.581279
Reservoir Dogs (1992)                                    0.579059
Pulp Fiction (1994)                                      0.575270
Name: Godfather, The (1972), dtype: float64

대부2편, 등 스타워즈..음.. 잘 모르겠다

3.10 인셉션과 비슷한 영화는?

item_sim_df['Inception (2010)'].sort_values(ascending = False)[:10]

title
Inception (2010)                 1.000000
Dark Knight, The (2008)          0.727263
Inglourious Basterds (2009)      0.646103
Shutter Island (2010)            0.617736
Dark Knight Rises, The (2012)    0.617504
Fight Club (1999)                0.615417
Interstellar (2014)              0.608150
Up (2009)                        0.606173
Avengers, The (2012)             0.586504
Django Unchained (2012)          0.581342
Name: Inception (2010), dtype: float64

다크나이트, 인터스텔라 등 뭔가 비슷해 보이기도 하는 영화가 나왔음

4. 요약

4.1 요약

추천 시스템은 많은 분야에서 사용된다.
컨텐츠 기반, 아이템 기반 추천 시스템을 간단히 실습해보았다.
추천 시스템은 제대로 되었는지, 추천 받는 사람이 직접적인 평가를 해주지 않기 떄문에 사실 제대로 하고 있는지 알수 없다.
그래도, 넷플릭스, 유튜브 등 에서 추천알고리즘이 많이 사용되고 있고, 자연어처럼 발전이 무궁무진하다고 생각한다.
더 자세한 공부가 필요하다.

영화데이터로 해보는 추천 시스템(Recommendations)

1. 추천 시스템

1.1 추천 시스템

1.2 콘텐츠 기반 필터링 추천 시스템

1.3 최근접 이웃 협업 필터링

1.4 잠재 요인 협업 필터링

2. 콘텐츠 기반 필터링 실습

2.1 TMDB5000 영화 데이터 세트

2.2 데이터 로드

2.3 데이터 선택

2.4 데이터 주의 사항

2.5 문자열로 된 데이터

2.6 genresdhk keywords의 내용을 list와 dict으로 복구

2.7 dict의 value값을 특성으로 사용하도록 변경

2.8 genres의 각 단어들을 하나의 문장으로 변환

2.9 genres를 countvectorize

2.10 코사인 유사도

2.11 genre_sim 높은값 순으로 정렬

2.12 추천 영화를 DataFrame으로 반환하는 함수 생성

2.13 비슷한 영화 찾기

2.14 vote_average 확인

2.15 영화 선정을 위한 가중치 선정

2.16 영화 전체 평균 평점과 최소 투표횟수를 60%로 지정

2.17 가중치가 부여된 평점을 계산하기 위한 함수 생성

2.18 다시 계산

2.19 가중치 부여된 평점 순으로 정렬

2.20 유사 영화 찾는 함수 변경

2.21 대부와 유사한 영화 찾기

3. 아이템 기반 최근접 이웃 협업 필터링

3.1 무비렌즈 데이터

3.2 데이터 불러오기

3.3 피벗테이블

3.4 ratings와 movie를 movieId로 결합

3.5 피벗테이블2

3.6 NaN 데이터 처리

3.7 행렬 transpose

3.8 유사도 측정결과

3.9 대부와 유사한 영화는?

3.10 인셉션과 비슷한 영화는?

4. 요약

4.1 요약

Recent Update

Trending Tags

Contents

Further Reading

서프라이즈(Surprise)를 사용한 추천시스템(Recommendations)

Good Books 데이터로 해보는 추천 시스템(Recommendations)

군집 분석 (2) (Clustering)

Trending Tags