Posts 영화데이터로 해보는 추천 시스템(Recommendations)
Post
Cancel

영화데이터로 해보는 추천 시스템(Recommendations)

1. 추천 시스템


1.1 추천 시스템

  • 요즘 유튜브, 멜론 등등에는 추천 시스템이 적용되어 있음
  • 특히 온라인 쇼핑몰 컨텐츠 등에서도 중요한 부분을 차지함
  • 추천시스템은 콘텐츠 기반 필터링과 협업 필터링, 크개 2가지로 나뉜다.


1.2 콘텐츠 기반 필터링 추천 시스템

  • 사용자가 특정한 아이템을 선호하는 경우 그 아이템과 비슷한 아이템을 추천하는 방식


1.3 최근접 이웃 협업 필터링

  • 축적된 사용자 행동 데이터를 기반으로 사용자가 아직 평가하지 않은 아이템을 예측 평하
  • 사용자 기반 : 당신과 비슷한 고객들이 다음 상품도 구매함
  • 아이템 기반 : 이 상품을 선택한 다른 고객들은 다음 상품도 구매함
  • 일반적으로는 사용자 기반 보다는 아이템기반 협업 필터링이 정확도가 더 높음
    • 비슷한 영화를 좋아한다고 취향이 비슷하다고 판단하기 어려움
    • 매우 유명한 영화는 취향과 관계없이 관람하는 경우가 많음
    • 사용자들이 평점을 매기지 않는 경우가 많음


1.4 잠재 요인 협업 필터링

  • 사용자 - 아이템 평점 행렬 데이터를 이용해서 잠재요인을 도출하는 것
  • 주요인과 아이템에 대한 잠재요인에 대해 행렬분해를 하고 다시 행렬곱을 통해 아직 평점을 부여하지 않은 아이템에 대한 예측 평점을 생성


2. 콘텐츠 기반 필터링 실습


2.1 TMDB5000 영화 데이터 세트

https://www.kaggle.com/tmdb/tmdb-movie-metadata

  • kaggle에 있는 TMDB5000 영화 데이터 세트
  • 4803개의 영화 정보가 들어 있음


2.2 데이터 로드

1
2
3
4
5
6
import pandas as pd
import numpy as np

movies = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/tmdb5000/tmdb_5000_movies.csv')
print(movies.shape)
movies.head()
1
(4803, 20)
budgetgenreshomepageidkeywordsoriginal_languageoriginal_titleoverviewpopularityproduction_companiesproduction_countriesrelease_daterevenueruntimespoken_languagesstatustaglinetitlevote_averagevote_count
0237000000[{"id": 28, "name": "Action"}, {"id": 12, "nam...http://www.avatarmovie.com/19995[{"id": 1463, "name": "culture clash"}, {"id":...enAvatarIn the 22nd century, a paraplegic Marine is di...150.437577[{"name": "Ingenious Film Partners", "id": 289...[{"iso_3166_1": "US", "name": "United States o...2009-12-102787965087162.0[{"iso_639_1": "en", "name": "English"}, {"iso...ReleasedEnter the World of Pandora.Avatar7.211800
1300000000[{"id": 12, "name": "Adventure"}, {"id": 14, "...http://disney.go.com/disneypictures/pirates/285[{"id": 270, "name": "ocean"}, {"id": 726, "na...enPirates of the Caribbean: At World's EndCaptain Barbossa, long believed to be dead, ha...139.082615[{"name": "Walt Disney Pictures", "id": 2}, {"...[{"iso_3166_1": "US", "name": "United States o...2007-05-19961000000169.0[{"iso_639_1": "en", "name": "English"}]ReleasedAt the end of the world, the adventure begins.Pirates of the Caribbean: At World's End6.94500
2245000000[{"id": 28, "name": "Action"}, {"id": 12, "nam...http://www.sonypictures.com/movies/spectre/206647[{"id": 470, "name": "spy"}, {"id": 818, "name...enSpectreA cryptic message from Bond’s past sends him o...107.376788[{"name": "Columbia Pictures", "id": 5}, {"nam...[{"iso_3166_1": "GB", "name": "United Kingdom"...2015-10-26880674609148.0[{"iso_639_1": "fr", "name": "Fran\u00e7ais"},...ReleasedA Plan No One EscapesSpectre6.34466
3250000000[{"id": 28, "name": "Action"}, {"id": 80, "nam...http://www.thedarkknightrises.com/49026[{"id": 849, "name": "dc comics"}, {"id": 853,...enThe Dark Knight RisesFollowing the death of District Attorney Harve...112.312950[{"name": "Legendary Pictures", "id": 923}, {"...[{"iso_3166_1": "US", "name": "United States o...2012-07-161084939099165.0[{"iso_639_1": "en", "name": "English"}]ReleasedThe Legend EndsThe Dark Knight Rises7.69106
4260000000[{"id": 28, "name": "Action"}, {"id": 12, "nam...http://movies.disney.com/john-carter49529[{"id": 818, "name": "based on novel"}, {"id":...enJohn CarterJohn Carter is a war-weary, former military ca...43.926995[{"name": "Walt Disney Pictures", "id": 2}][{"iso_3166_1": "US", "name": "United States o...2012-03-07284139100132.0[{"iso_639_1": "en", "name": "English"}]ReleasedLost in our world, found in another.John Carter6.12124


2.3 데이터 선택

1
2
3
movies_df = movies[['id', 'title', 'genres', 'vote_average',
                    'vote_count', 'popularity', 'keywords', 'overview']]
movies_df.head()
idtitlegenresvote_averagevote_countpopularitykeywordsoverview
019995Avatar[{"id": 28, "name": "Action"}, {"id": 12, "nam...7.211800150.437577[{"id": 1463, "name": "culture clash"}, {"id":...In the 22nd century, a paraplegic Marine is di...
1285Pirates of the Caribbean: At World's End[{"id": 12, "name": "Adventure"}, {"id": 14, "...6.94500139.082615[{"id": 270, "name": "ocean"}, {"id": 726, "na...Captain Barbossa, long believed to be dead, ha...
2206647Spectre[{"id": 28, "name": "Action"}, {"id": 12, "nam...6.34466107.376788[{"id": 470, "name": "spy"}, {"id": 818, "name...A cryptic message from Bond’s past sends him o...
349026The Dark Knight Rises[{"id": 28, "name": "Action"}, {"id": 80, "nam...7.69106112.312950[{"id": 849, "name": "dc comics"}, {"id": 853,...Following the death of District Attorney Harve...
449529John Carter[{"id": 28, "name": "Action"}, {"id": 12, "nam...6.1212443.926995[{"id": 818, "name": "based on novel"}, {"id":...John Carter is a war-weary, former military ca...
  • 실습에 필요한 컬럼만 가져옴
  • id : unique한 id
  • title : 영화제목
  • genres : 영화 장르
  • vote_average : 평균 평점
  • vote_count : 투표수
  • popularity : 인기점수
  • keywords : 키워드
  • overview : 영화 개요


2.4 데이터 주의 사항

1
movies_df[['genres']][:1].values
1
2
array([['[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]']],
      dtype=object)
  • genres와 keywords는 컬럼안에 dict형으로 저장되어 있음


2.5 문자열로 된 데이터

1
2
3
4
from ast import literal_eval

code = """(1,2, {'foo', 'bar'})"""
code, type(code)
1
("(1,2, {'foo', 'bar'})", str)
  • genres와 keywords는 str로 되어 있음


1
literal_eval(code), type(literal_eval(code))
1
((1, 2, {'bar', 'foo'}), tuple)
  • literal_eval을 사용하여 tuple형태로 변경함


2.6 genresdhk keywords의 내용을 list와 dict으로 복구

1
2
3
4
5
6
from ast import literal_eval
import warnings; warnings.filterwarnings('ignore')

movies_df['genres'] = movies_df['genres'].apply(literal_eval)
movies_df['keywords'] = movies_df['keywords'].apply(literal_eval)
movies_df
idtitlegenresvote_averagevote_countpopularitykeywordsoverview
019995Avatar[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...7.211800150.437577[{'id': 1463, 'name': 'culture clash'}, {'id':...In the 22nd century, a paraplegic Marine is di...
1285Pirates of the Caribbean: At World's End[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...6.94500139.082615[{'id': 270, 'name': 'ocean'}, {'id': 726, 'na...Captain Barbossa, long believed to be dead, ha...
2206647Spectre[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...6.34466107.376788[{'id': 470, 'name': 'spy'}, {'id': 818, 'name...A cryptic message from Bond’s past sends him o...
349026The Dark Knight Rises[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...7.69106112.312950[{'id': 849, 'name': 'dc comics'}, {'id': 853,...Following the death of District Attorney Harve...
449529John Carter[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...6.1212443.926995[{'id': 818, 'name': 'based on novel'}, {'id':...John Carter is a war-weary, former military ca...
...........................
47989367El Mariachi[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...6.623814.269792[{'id': 5616, 'name': 'united states–mexico ba...El Mariachi just wants to play his guitar and ...
479972766Newlyweds[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...5.950.642552[]A newlywed couple's honeymoon is upended by th...
4800231617Signed, Sealed, Delivered[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...7.061.444476[{'id': 248, 'name': 'date'}, {'id': 699, 'nam..."Signed, Sealed, Delivered" introduces a dedic...
4801126186Shanghai Calling[]5.770.857008[]When ambitious New York attorney Sam is sent t...
480225975My Date with Drew[{'id': 99, 'name': 'Documentary'}]6.3161.929883[{'id': 1523, 'name': 'obsession'}, {'id': 224...Ever since the second grade when he first saw ...

4803 rows × 8 columns

  • literal_eval을 사용하여 keywords와 genres를 list안에 dict으로 변경함


2.7 dict의 value값을 특성으로 사용하도록 변경

1
2
3
movies_df['genres'] = movies_df['genres'].apply(lambda x : [y['name'] for y in x])
movies_df['keywords'] = movies_df['keywords'].apply(lambda x : [y['name'] for y in x])
movies_df[['genres', 'keywords']][:2]
genreskeywords
0[Action, Adventure, Fantasy, Science Fiction][culture clash, future, space war, space colon...
1[Adventure, Fantasy, Action][ocean, drug abuse, exotic island, east india ...
  • dict 형태의 key : value로 들어있던 내용을, value값을 특성으로 사용하도록 for문과 lambda를 사용하여 해결


2.8 genres의 각 단어들을 하나의 문장으로 변환

1
2
movies_df['genres_literal'] = movies_df['genres'].apply(lambda x : (' ').join(x))
movies_df.head()
idtitlegenresvote_averagevote_countpopularitykeywordsoverviewgenres_literal
019995Avatar[Action, Adventure, Fantasy, Science Fiction]7.211800150.437577[culture clash, future, space war, space colon...In the 22nd century, a paraplegic Marine is di...Action Adventure Fantasy Science Fiction
1285Pirates of the Caribbean: At World's End[Adventure, Fantasy, Action]6.94500139.082615[ocean, drug abuse, exotic island, east india ...Captain Barbossa, long believed to be dead, ha...Adventure Fantasy Action
2206647Spectre[Action, Adventure, Crime]6.34466107.376788[spy, based on novel, secret agent, sequel, mi...A cryptic message from Bond’s past sends him o...Action Adventure Crime
349026The Dark Knight Rises[Action, Crime, Drama, Thriller]7.69106112.312950[dc comics, crime fighter, terrorist, secret i...Following the death of District Attorney Harve...Action Crime Drama Thriller
449529John Carter[Action, Adventure, Science Fiction]6.1212443.926995[based on novel, mars, medallion, space travel...John Carter is a war-weary, former military ca...Action Adventure Science Fiction
  • join을 사용하여 list로 되어있는 genres를 띄어쓰기로 된 컬럼을 생성


2.9 genres를 countvectorize

1
2
3
4
5
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(min_df=0, ngram_range=(1, 2))
genre_mat = count_vect.fit_transform(movies_df['genres_literal'])
print(genre_mat.shape)
1
(4803, 276)
  • countvectorize를 하여 4803개의 영화에서 276개의 말뭉치 단어가 생김


2.10 코사인 유사도

1
2
3
4
5
from sklearn.metrics.pairwise import cosine_similarity

genre_sim = cosine_similarity(genre_mat, genre_mat)
print(genre_sim.shape)
print(genre_sim[:2])
1
2
3
(4803, 4803)
[[1.         0.59628479 0.4472136  ... 0.         0.         0.        ]
 [0.59628479 1.         0.4        ... 0.         0.         0.        ]]

  • 코사인 유사도를 통해, 아이템별로 비슷한 점수를 알수 있음


2.11 genre_sim 높은값 순으로 정렬

1
2
genre_sim_sorted_ind = genre_sim.argsort()[:, ::-1]
print(genre_sim_sorted_ind[:1])
1
[[   0 3494  813 ... 3038 3037 2401]]


2.12 추천 영화를 DataFrame으로 반환하는 함수 생성

1
2
3
4
5
6
7
8
9
10
def find_sim_movie(df, sorted_ind, title_name, top_n = 10):
    title_movie = df[df['title'] == title_name]
    
    title_index = title_movie.index.values
    similar_indexes = sorted_ind[title_index, :(top_n)]
    
    print(similar_indexes)
    similar_indexes = similar_indexes.reshape(-1)
    
    return df.iloc[similar_indexes]


2.13 비슷한 영화 찾기

1
2
similar_movies = find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather', 10)
similar_movies[['title', 'vote_average']]
1
[[2731 1243 3636 1946 2640 4065 1847 4217  883 3866]]
titlevote_average
2731The Godfather: Part II8.3
1243Mean Streets7.2
3636Light Sleeper5.7
1946The Bad Lieutenant: Port of Call - New Orleans6.0
2640Things to Do in Denver When You're Dead6.7
4065Mi America0.0
1847GoodFellas8.2
4217Kids6.8
883Catch Me If You Can7.7
3866City of God8.1
  • 대부2는 맞는거 같은데 그 아래는 좀..
  • 문제가 있는듯 하다


2.14 vote_average 확인

1
movies_df[['title', 'vote_average', 'vote_count']].sort_values('vote_average', ascending = False)[:10]
titlevote_averagevote_count
3519Stiff Upper Lips10.01
4247Me You and Five Bucks10.02
4045Dancer, Texas Pop. 8110.01
4662Little Big Top10.01
3992Sardaarji9.52
2386One Man's Hero9.32
2970There Goes My Baby8.52
1881The Shawshank Redemption8.58205
2796The Prisoner of Zenda8.411
3337The Godfather8.45893
  • 평점과 평점을 매긴 횟수를 보면 문제 데이터가 보임
  • 평점은 10점이나 횟수는 1점 등등..


2.15 영화 선정을 위한 가중치 선정

  • v : 개별 영화에 평점을 투표한 횟수
  • m : 평점을 부여하기 위한 최소 투표 횟수
  • R : 개별 영화에 대한 평균 평점
  • C : 전체 영화에 대한 평균 평점


2.16 영화 전체 평균 평점과 최소 투표횟수를 60%로 지정

1
2
3
C = movies_df['vote_average'].mean()
m = movies_df['vote_count'].quantile(0.6)
print('C:', round(C, 3), 'm:', round(m,3))
1
C: 6.092 m: 370.2
  • 영화의 평균 평점은 6점정도이고, 최소 투표회수 60%지점은 약 370회
  • 370회 미만의 평점 투표 영화는 제외 됨


2.17 가중치가 부여된 평점을 계산하기 위한 함수 생성

1
2
3
4
5
def weighted_vote_average(record):
    v = record['vote_count']
    R = record['vote_average']
    
    return ((v/(v+m)) * R) + ((m/(m+v))*C)
  • 위의 가중치 계산식을 함수로 작성


2.18 다시 계산

1
2
movies_df['weighted_vote'] = movies_df.apply(weighted_vote_average, axis = 1)
movies_df.head()
idtitlegenresvote_averagevote_countpopularitykeywordsoverviewgenres_literalweighted_vote
019995Avatar[Action, Adventure, Fantasy, Science Fiction]7.211800150.437577[culture clash, future, space war, space colon...In the 22nd century, a paraplegic Marine is di...Action Adventure Fantasy Science Fiction7.166301
1285Pirates of the Caribbean: At World's End[Adventure, Fantasy, Action]6.94500139.082615[ocean, drug abuse, exotic island, east india ...Captain Barbossa, long believed to be dead, ha...Adventure Fantasy Action6.838594
2206647Spectre[Action, Adventure, Crime]6.34466107.376788[spy, based on novel, secret agent, sequel, mi...A cryptic message from Bond’s past sends him o...Action Adventure Crime6.284091
349026The Dark Knight Rises[Action, Crime, Drama, Thriller]7.69106112.312950[dc comics, crime fighter, terrorist, secret i...Following the death of District Attorney Harve...Action Crime Drama Thriller7.541095
449529John Carter[Action, Adventure, Science Fiction]6.1212443.926995[based on novel, mars, medallion, space travel...John Carter is a war-weary, former military ca...Action Adventure Science Fiction6.098838
  • 전체 데이터에 함수 적용


2.19 가중치 부여된 평점 순으로 정렬

1
movies_df[['title', 'vote_average', 'weighted_vote', 'vote_count']].sort_values('weighted_vote', ascending = False)[:10]
titlevote_averageweighted_votevote_count
1881The Shawshank Redemption8.58.3960528205
3337The Godfather8.48.2635915893
662Fight Club8.38.2164559413
3232Pulp Fiction8.38.2071028428
65The Dark Knight8.28.13693012002
1818Schindler's List8.38.1260694329
3865Whiplash8.38.1232484254
809Forrest Gump8.28.1059547927
2294Spirited Away8.38.1058673840
2731The Godfather: Part II8.38.0795863338
  • 평점이 8.39 이상, 투표한 횟수가 적은 영화는 제외됨


2.20 유사 영화 찾는 함수 변경

1
2
3
4
5
6
7
8
9
10
def find_sim_movie(df, sorted_ind, title_name, top_n = 10):
    title_movie = df[df['title'] == title_name]
    title_index = title_movie.index.values
    
    similar_indexes = sorted_ind[title_index, :(top_n*2)]
    similar_indexes = similar_indexes.reshape(-1)
    
    similar_indexes = similar_indexes[similar_indexes != title_index]
    
    return df.iloc[similar_indexes].sort_values('weighted_vote', ascending =False)[:top_n]
  • 이전에 작성한 함수에서, 가중치를 부여한 평점을 추가함


2.21 대부와 유사한 영화 찾기

1
2
similar_movies = find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather', 10)
similar_movies[['title', 'vote_average', 'weighted_vote']]
titlevote_averageweighted_vote
2731The Godfather: Part II8.38.079586
1847GoodFellas8.27.976937
3866City of God8.17.759693
1663Once Upon a Time in America8.27.657811
883Catch Me If You Can7.77.557097
281American Gangster7.47.141396
4041This Is England7.46.739664
1149American Hustle6.86.717525
1243Mean Streets7.26.626569
2839Rounders6.96.530427
  • 아까보다는 좀 나은듯 하다. 가중치륿 부여해서 그런듯


3. 아이템 기반 최근접 이웃 협업 필터링


3.1 무비렌즈 데이터

https://grouplens.org/datasets/movielens/latest/

  • 영화의 평점을 매긴 사용자와 영화 평점 행렬 등의 데이터
  • 1메가짜리 small 데이터 다운


3.2 데이터 불러오기

1
2
3
4
5
6
7
8
import pandas as pd
import numpy as np

movies = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/ml-latest-small/movies.csv')
ratings = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/ml-latest-small/ratings.csv')

print(movies.shape)
print(ratings.shape)
1
2
(9742, 3)
(100836, 4)


1
movies.head()
movieIdtitlegenres
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
12Jumanji (1995)Adventure|Children|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama|Romance
45Father of the Bride Part II (1995)Comedy


1
ratings.head()
userIdmovieIdratingtimestamp
0114.0964982703
1134.0964981247
2164.0964982224
31475.0964983815
41505.0964982931
  • movie 데이터는 9742개로 영화 제목과 장르가 있음
  • ratings 데이터는 100836개로 영화 평점이 사용자별로 존재함


  • raw_data2개를 위 사진처럼 정리 해야함


3.3 피벗테이블

1
2
3
ratings = ratings[['userId', 'movieId', 'rating']]
ratings_matrix = ratings.pivot_table('rating', index = 'userId', columns = 'movieId')
ratings_matrix.head()
movieId12345678910...193565193567193571193573193579193581193583193585193587193609
userId
14.0NaN4.0NaNNaN4.0NaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
54.0NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

5 rows × 9724 columns

  • 유저별 영화의 평점은 만들었다.
  • 영화 이름이 id로 되어있어서 알수 없으니, movie 데이터에서 합치기


3.4 ratings와 movie를 movieId로 결합

1
2
rating_movies = pd.merge(ratings, movies, on = 'movieId')
rating_movies.head()
userIdmovieIdratingtitlegenres
0114.0Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
1514.0Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
2714.5Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
31512.5Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
41714.5Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
  • rating에 title을 붙였으니, 다시 정리


3.5 피벗테이블2

1
2
3
ratings = ratings[['userId', 'movieId', 'rating']]
ratings_matrix = rating_movies.pivot_table('rating', index = 'userId', columns = 'title')
ratings_matrix.head()
title'71 (2014)'Hellboy': The Seeds of Creation (2004)'Round Midnight (1986)'Salem's Lot (2004)'Til There Was You (1997)'Tis the Season for Love (2015)'burbs, The (1989)'night Mother (1986)(500) Days of Summer (2009)*batteries not included (1987)...Zulu (2013)[REC] (2007)[REC]² (2009)[REC]³ 3 Génesis (2012)anohana: The Flower We Saw That Day - The Movie (2013)eXistenZ (1999)xXx (2002)xXx: State of the Union (2005)¡Three Amigos! (1986)À nous la liberté (Freedom for Us) (1931)
userId
1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaN4.0NaN
2NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
5NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

5 rows × 9719 columns

  • 이제는 유저가 영화에 점수를 준것을 정리함
  • 단 NaN데이터가 있음


3.6 NaN 데이터 처리

1
2
ratings_matrix = ratings_matrix.fillna(0)
ratings_matrix.head()
title'71 (2014)'Hellboy': The Seeds of Creation (2004)'Round Midnight (1986)'Salem's Lot (2004)'Til There Was You (1997)'Tis the Season for Love (2015)'burbs, The (1989)'night Mother (1986)(500) Days of Summer (2009)*batteries not included (1987)...Zulu (2013)[REC] (2007)[REC]² (2009)[REC]³ 3 Génesis (2012)anohana: The Flower We Saw That Day - The Movie (2013)eXistenZ (1999)xXx (2002)xXx: State of the Union (2005)¡Three Amigos! (1986)À nous la liberté (Freedom for Us) (1931)
userId
10.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.04.00.0
20.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
30.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
40.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
50.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0

5 rows × 9719 columns

  • NaN 데이터는 fillna를 사용해 0으로 채워줌


3.7 행렬 transpose

1
2
ratings_matrix_T = ratings_matrix.transpose()
ratings_matrix_T.head()
userId12345678910...601602603604605606607608609610
title
'71 (2014)0.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.04.0
'Hellboy': The Seeds of Creation (2004)0.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
'Round Midnight (1986)0.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
'Salem's Lot (2004)0.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
'Til There Was You (1997)0.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0

5 rows × 610 columns

  • 행렬의 유사도 측정을 하기 위해 행렬을 transpose함


3.8 유사도 측정결과

1
2
3
4
5
6
7
from sklearn.metrics.pairwise import cosine_similarity

item_sim = cosine_similarity(ratings_matrix_T, ratings_matrix_T)
item_sim_df = pd.DataFrame(data = item_sim, index=ratings_matrix.columns, columns=ratings_matrix.columns)

print(item_sim_df.shape)
item_sim_df.head()
1
(9719, 9719)
title'71 (2014)'Hellboy': The Seeds of Creation (2004)'Round Midnight (1986)'Salem's Lot (2004)'Til There Was You (1997)'Tis the Season for Love (2015)'burbs, The (1989)'night Mother (1986)(500) Days of Summer (2009)*batteries not included (1987)...Zulu (2013)[REC] (2007)[REC]² (2009)[REC]³ 3 Génesis (2012)anohana: The Flower We Saw That Day - The Movie (2013)eXistenZ (1999)xXx (2002)xXx: State of the Union (2005)¡Three Amigos! (1986)À nous la liberté (Freedom for Us) (1931)
title
'71 (2014)1.00.0000000.0000000.0000000.0000000.00.0000000.00.1416530.0...0.00.3420550.5433050.7071070.00.00.1394310.3273270.00.0
'Hellboy': The Seeds of Creation (2004)0.01.0000000.7071070.0000000.0000000.00.0000000.00.0000000.0...0.00.0000000.0000000.0000000.00.00.0000000.0000000.00.0
'Round Midnight (1986)0.00.7071071.0000000.0000000.0000000.00.1767770.00.0000000.0...0.00.0000000.0000000.0000000.00.00.0000000.0000000.00.0
'Salem's Lot (2004)0.00.0000000.0000001.0000000.8574930.00.0000000.00.0000000.0...0.00.0000000.0000000.0000000.00.00.0000000.0000000.00.0
'Til There Was You (1997)0.00.0000000.0000000.8574931.0000000.00.0000000.00.0000000.0...0.00.0000000.0000000.0000000.00.00.0000000.0000000.00.0

5 rows × 9719 columns

  • 전체 영화에 대해 유사도 점수가 나옴


3.9 대부와 유사한 영화는?

1
item_sim_df['Godfather, The (1972)'].sort_values(ascending = False)[:10]
1
2
3
4
5
6
7
8
9
10
11
12
title
Godfather, The (1972)                                    1.000000
Godfather: Part II, The (1974)                           0.821773
Goodfellas (1990)                                        0.664841
One Flew Over the Cuckoo's Nest (1975)                   0.620536
Star Wars: Episode IV - A New Hope (1977)                0.595317
Fargo (1996)                                             0.588614
Star Wars: Episode V - The Empire Strikes Back (1980)    0.586030
Fight Club (1999)                                        0.581279
Reservoir Dogs (1992)                                    0.579059
Pulp Fiction (1994)                                      0.575270
Name: Godfather, The (1972), dtype: float64
  • 대부2편, 등 스타워즈..음.. 잘 모르겠다


3.10 인셉션과 비슷한 영화는?

1
item_sim_df['Inception (2010)'].sort_values(ascending = False)[:10]
1
2
3
4
5
6
7
8
9
10
11
12
title
Inception (2010)                 1.000000
Dark Knight, The (2008)          0.727263
Inglourious Basterds (2009)      0.646103
Shutter Island (2010)            0.617736
Dark Knight Rises, The (2012)    0.617504
Fight Club (1999)                0.615417
Interstellar (2014)              0.608150
Up (2009)                        0.606173
Avengers, The (2012)             0.586504
Django Unchained (2012)          0.581342
Name: Inception (2010), dtype: float64
  • 다크나이트, 인터스텔라 등 뭔가 비슷해 보이기도 하는 영화가 나왔음


4. 요약


4.1 요약

  • 추천 시스템은 많은 분야에서 사용된다.
  • 컨텐츠 기반, 아이템 기반 추천 시스템을 간단히 실습해보았다.
  • 추천 시스템은 제대로 되었는지, 추천 받는 사람이 직접적인 평가를 해주지 않기 떄문에 사실 제대로 하고 있는지 알수 없다.
  • 그래도, 넷플릭스, 유튜브 등 에서 추천알고리즘이 많이 사용되고 있고, 자연어처럼 발전이 무궁무진하다고 생각한다.
  • 더 자세한 공부가 필요하다.
This post is licensed under CC BY 4.0 by the author.