1. Good Books
1.1 Good Books 데이터
https://www.kaggle.com/zygmunt/goodbooks-10k
- ratings, books,tag, book_tags, to_read의 10k(10,000) 데이터
2. 추천 시스템 실습
2.1 Data load
2.1.1 Books data
1
2
3
4
5
import numpy as np
import pandas as pd
books = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/goodbooks-10k/books.csv', encoding='ISO-8859-1')
books.head()
id | book_id | best_book_id | work_id | books_count | isbn | isbn13 | authors | original_publication_year | original_title | ... | ratings_count | work_ratings_count | work_text_reviews_count | ratings_1 | ratings_2 | ratings_3 | ratings_4 | ratings_5 | image_url | small_image_url | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2767052 | 2767052 | 2792775 | 272 | 439023483 | 9.780439e+12 | Suzanne Collins | 2008.0 | The Hunger Games | ... | 4780653 | 4942365 | 155254 | 66715 | 127936 | 560092 | 1481305 | 2706317 | https://images.gr-assets.com/books/1447303603m... | https://images.gr-assets.com/books/1447303603s... |
1 | 2 | 3 | 3 | 4640799 | 491 | 439554934 | 9.780440e+12 | J.K. Rowling, Mary GrandPré | 1997.0 | Harry Potter and the Philosopher's Stone | ... | 4602479 | 4800065 | 75867 | 75504 | 101676 | 455024 | 1156318 | 3011543 | https://images.gr-assets.com/books/1474154022m... | https://images.gr-assets.com/books/1474154022s... |
2 | 3 | 41865 | 41865 | 3212258 | 226 | 316015849 | 9.780316e+12 | Stephenie Meyer | 2005.0 | Twilight | ... | 3866839 | 3916824 | 95009 | 456191 | 436802 | 793319 | 875073 | 1355439 | https://images.gr-assets.com/books/1361039443m... | https://images.gr-assets.com/books/1361039443s... |
3 | 4 | 2657 | 2657 | 3275794 | 487 | 61120081 | 9.780061e+12 | Harper Lee | 1960.0 | To Kill a Mockingbird | ... | 3198671 | 3340896 | 72586 | 60427 | 117415 | 446835 | 1001952 | 1714267 | https://images.gr-assets.com/books/1361975680m... | https://images.gr-assets.com/books/1361975680s... |
4 | 5 | 4671 | 4671 | 245494 | 1356 | 743273567 | 9.780743e+12 | F. Scott Fitzgerald | 1925.0 | The Great Gatsby | ... | 2683664 | 2773745 | 51992 | 86236 | 197621 | 606158 | 936012 | 947718 | https://images.gr-assets.com/books/1490528560m... | https://images.gr-assets.com/books/1490528560s... |
5 rows × 23 columns
- Book에 대한 정보가 담긴 csv 파일
- 이번 데이터들은 encoding을 ISO-8859-1로 읽어야함
- rating 1 ~ 5의 의미는 별점 1점부터 5점의 갯수임
2.1.2 Ratings Data
1
2
ratings = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/goodbooks-10k/ratings.csv', encoding='ISO-8859-1')
ratings.head()
book_id | user_id | rating | |
---|---|---|---|
0 | 1 | 314 | 5 |
1 | 1 | 439 | 3 |
2 | 1 | 588 | 5 |
3 | 1 | 1169 | 4 |
4 | 1 | 1185 | 4 |
- rating 데이터에는 Book_id와 User_id 그리고 해당 유저가 준 rating 점수가 있음
2.1.3 Book tags Data load
1
2
book_tags = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/goodbooks-10k/book_tags.csv', encoding='ISO-8859-1')
book_tags.head()
goodreads_book_id | tag_id | count | |
---|---|---|---|
0 | 1 | 30574 | 167697 |
1 | 1 | 11305 | 37174 |
2 | 1 | 11557 | 34173 |
3 | 1 | 8717 | 12986 |
4 | 1 | 33114 | 12716 |
- Book의 id와 tag의 id가 있음
2.1.4 Tags Data load
1
2
tags = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/goodbooks-10k/tags.csv')
tags.tail()
tag_id | tag_name | |
---|---|---|
34247 | 34247 | Childrens |
34248 | 34248 | Favorites |
34249 | 34249 | Manga |
34250 | 34250 | SERIES |
34251 | 34251 | favourites |
- Tag의 id와 해당 tag와 연결되는 name이 있음
2.1.5 Read Data load
1
2
to_read = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/goodbooks-10k/to_read.csv')
to_read.head()
user_id | book_id | |
---|---|---|
0 | 1 | 112 |
1 | 1 | 235 |
2 | 1 | 533 |
3 | 1 | 1198 |
4 | 1 | 1874 |
- 유저가 어떤 책을 읽었는지에 대한 id가 적혀있음
2.2 Tag Data 전처리
1
2
tags_join_Df = pd.merge(book_tags, tags, left_on='tag_id', right_on='tag_id', how = 'inner')
tags_join_Df.head()
goodreads_book_id | tag_id | count | tag_name | |
---|---|---|---|---|
0 | 1 | 30574 | 167697 | to-read |
1 | 2 | 30574 | 24549 | to-read |
2 | 3 | 30574 | 496107 | to-read |
3 | 5 | 30574 | 11909 | to-read |
4 | 6 | 30574 | 298 | to-read |
- Tagid와 tag_name을 books id가 있는 데이터 프레임과 merge함
2.3 Authors로 Tfidf
1
books['authors'][:3]
1
2
3
4
0 Suzanne Collins
1 J.K. Rowling, Mary GrandPré
2 Stephenie Meyer
Name: authors, dtype: object
- books 데이터에는 작가명 컬럼이 있음
1
2
3
4
5
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df = 0, stop_words = 'english')
tfidf_matrix = tf.fit_transform(books['authors'])
tfidf_matrix
1
2
<10000x14742 sparse matrix of type '<class 'numpy.float64'>'
with 43235 stored elements in Compressed Sparse Row format>
- Books에 있는 작가명으로 Tfidf를 수행함
2.4 유사도 측정
1
2
3
4
from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim
1
2
3
4
5
6
7
array([[1., 0., 0., ..., 0., 0., 0.],
[0., 1., 0., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 1., 0., 0.],
[0., 0., 0., ..., 0., 1., 0.],
[0., 0., 0., ..., 0., 0., 1.]])
- 사이킷런의 linear_kernel을 사용하여 작가명으로 만든 Tfidf매트릭스를 유사도 행렬로 생성
2.5 Hobbit과 유사한 책은?
1
2
3
title = books['title']
indices = pd.Series(books.index, index=books['title'])
indices['The Hobbit']
1
6
- Hobbit의 index는 6번이다
- 6번 행을 불러와서 비슷한 책을 찾게 해보자
1
cosine_sim[indices['The Hobbit']]
1
array([0., 0., 0., ..., 0., 0., 0.])
- 유사도 행렬에서 hobbit의 인덱스의 행을 불러옴
1
cosine_sim[indices['The Hobbit']].shape
1
(10000,)
- 총 1만개의 책 데이터가 있음
1
list(enumerate(cosine_sim[indices['The Hobbit']]))[:3]
1
[(0, 0.0), (1, 0.0), (2, 0.0)]
- 유사도 행렬에서 The Hobbit의 인덱스만 가져오고, 해당 컬럼(다른책 책 인덱스)와 코사인 유사도 점수를 enumerate를 사용하여 튜플형식으로 만들고, 해당 데이터를 list에 넣는다
2.6 가장 유사한 책의 Index
1
2
3
sim_scores = list(enumerate(cosine_sim[indices['The Hobbit']]))
sim_scores = sorted(sim_scores, key = lambda x : x[1], reverse= True)
sim_scores[:3]
1
[(6, 1.0), (18, 1.0), (154, 1.0)]
- 호빗과 가장 유사한 책의 인덱스(여기서는 열)와 코사인 점수를 정렬하여 출력함
- 완전 똑같은 1점도 보인다. 18번, 154번
- 참고로 맨 앞에 (6, 1.0)은 본인 자신임
1
2
3
print(f'Index 6번의 책 이름 :', books['title'][6])
print(f'Index 18번의 책 이름 :', books['title'][18])
print(f'Index 154번의 책 이름 :', books['title'][154])
1
2
3
Index 6번의 책 이름 : The Hobbit
Index 18번의 책 이름 : The Fellowship of the Ring (The Lord of the Rings, #1)
Index 154번의 책 이름 : The Two Towers (The Lord of the Rings, #2)
- 호빗과 비슷한 책은 반지의 제왕 시리즈가 나옴
2.7 작가로 본 유사 책 검색
1
2
3
sim_scores = sim_scores[1:11]
book_indices = [i[0] for i in sim_scores]
title.iloc[book_indices]
1
2
3
4
5
6
7
8
9
10
11
18 The Fellowship of the Ring (The Lord of the Ri...
154 The Two Towers (The Lord of the Rings, #2)
160 The Return of the King (The Lord of the Rings,...
188 The Lord of the Rings (The Lord of the Rings, ...
963 J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...
4975 Unfinished Tales of Númenor and Middle-Earth
2308 The Children of Húrin
610 The Silmarillion (Middle-Earth Universe)
8271 The Complete Guide to Middle-Earth
1128 The History of the Hobbit, Part One: Mr. Baggins
Name: title, dtype: object
- 그 외의 다른 책들도 대부분 Hobbit이긴 하나, 아마 작가가 동일인일 가능성이 높다.
- 사실 생각해 보면 작가이름으로만 Tfidf를 했기 때문에, 작가 이름이 같다면 모두 동일한 점수(1)로 나올것이다
2.8 Tag 추가
1
2
books_with_tags = pd.merge(books, tags_join_Df, left_on= 'book_id', right_on='goodreads_book_id', how = 'inner')
books_with_tags.head()
id | book_id | best_book_id | work_id | books_count | isbn | isbn13 | authors | original_publication_year | original_title | ... | ratings_2 | ratings_3 | ratings_4 | ratings_5 | image_url | small_image_url | goodreads_book_id | tag_id | count | tag_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2767052 | 2767052 | 2792775 | 272 | 439023483 | 9.780439e+12 | Suzanne Collins | 2008.0 | The Hunger Games | ... | 127936 | 560092 | 1481305 | 2706317 | https://images.gr-assets.com/books/1447303603m... | https://images.gr-assets.com/books/1447303603s... | 2767052 | 30574 | 11314 | to-read |
1 | 1 | 2767052 | 2767052 | 2792775 | 272 | 439023483 | 9.780439e+12 | Suzanne Collins | 2008.0 | The Hunger Games | ... | 127936 | 560092 | 1481305 | 2706317 | https://images.gr-assets.com/books/1447303603m... | https://images.gr-assets.com/books/1447303603s... | 2767052 | 11305 | 10836 | fantasy |
2 | 1 | 2767052 | 2767052 | 2792775 | 272 | 439023483 | 9.780439e+12 | Suzanne Collins | 2008.0 | The Hunger Games | ... | 127936 | 560092 | 1481305 | 2706317 | https://images.gr-assets.com/books/1447303603m... | https://images.gr-assets.com/books/1447303603s... | 2767052 | 11557 | 50755 | favorites |
3 | 1 | 2767052 | 2767052 | 2792775 | 272 | 439023483 | 9.780439e+12 | Suzanne Collins | 2008.0 | The Hunger Games | ... | 127936 | 560092 | 1481305 | 2706317 | https://images.gr-assets.com/books/1447303603m... | https://images.gr-assets.com/books/1447303603s... | 2767052 | 8717 | 35418 | currently-reading |
4 | 1 | 2767052 | 2767052 | 2792775 | 272 | 439023483 | 9.780439e+12 | Suzanne Collins | 2008.0 | The Hunger Games | ... | 127936 | 560092 | 1481305 | 2706317 | https://images.gr-assets.com/books/1447303603m... | https://images.gr-assets.com/books/1447303603s... | 2767052 | 33114 | 25968 | young-adult |
5 rows × 27 columns
- Books 데이터 프레임에, 앞에서 만든 tagid와 tag name을 merge함
2.9 Tag를 Tfidf
1
2
3
tf_tag = TfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df = 0, stop_words='english')
tfidf_matrix_tag = tf_tag.fit_transform(books_with_tags['tag_name'].head(10000))
cosine_sim_tag = linear_kernel(tfidf_matrix_tag, tfidf_matrix_tag)
- 앞에선 작가 이름으로 Tfidf를 했고, 이번엔 Tag로 해본다
2.10 추천책을 반환하는 함수
1
2
3
4
5
6
7
8
9
10
11
title_tag = books['title']
indices_tag = pd.Series(books.index, index=books['title'])
def tags_recommendations(title):
idx = indices_tag[title]
sim_scores = list(enumerate(cosine_sim_tag[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:11]
book_indices = [i[0] for i in sim_scores]
return title_tag.iloc[book_indices]
- 이번에는 책의 제목을 넣으면 추천책을 반환하는 함수를 작성
- sim_scores = sim_scores[1:11]은 총 10개를 가리키며, 1부터 한것은 0번은 입력한 책 제목 자신이 나오기 떄문임
2.11 Tag로 찾아본 Hobbits와 유사책
1
tags_recommendations('The Hobbit').head(20)
1
2
3
4
5
6
7
8
9
10
11
16 Catching Fire (The Hunger Games, #2)
31 Of Mice and Men
107 Confessions of a Shopaholic (Shopaholic, #1)
125 Dune (Dune Chronicles #1)
149 The Red Tent
206 One for the Money (Stephanie Plum, #1)
214 Ready Player One
231 The Gunslinger (The Dark Tower, #1)
253 Shiver (The Wolves of Mercy Falls, #1)
313 Inkheart (Inkworld, #1)
Name: title, dtype: object
- 헝거게임, 듄 등 호빗과 비슷한 판타지 장르가 나오는듯 싶다.
2.12 Book id에 tag name을 한번에 붙이기
1
2
temp_df = books_with_tags.groupby('book_id')['tag_name'].apply(' '.join).reset_index()
temp_df.head()
book_id | tag_name | |
---|---|---|
0 | 1 | to-read fantasy favorites currently-reading yo... |
1 | 2 | to-read fantasy favorites currently-reading yo... |
2 | 3 | to-read fantasy favorites currently-reading yo... |
3 | 5 | to-read fantasy favorites currently-reading yo... |
4 | 6 | to-read fantasy young-adult fiction harry-pott... |
- Book Id에 있는 모든 tag_name들을 한번에 모아놓음
2.13 Boos에 Merge
1
2
books = pd.merge(books, temp_df, on = 'book_id', how = 'inner')
books.head()
id | book_id | best_book_id | work_id | books_count | isbn | isbn13 | authors | original_publication_year | original_title | ... | work_ratings_count | work_text_reviews_count | ratings_1 | ratings_2 | ratings_3 | ratings_4 | ratings_5 | image_url | small_image_url | tag_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2767052 | 2767052 | 2792775 | 272 | 439023483 | 9.780439e+12 | Suzanne Collins | 2008.0 | The Hunger Games | ... | 4942365 | 155254 | 66715 | 127936 | 560092 | 1481305 | 2706317 | https://images.gr-assets.com/books/1447303603m... | https://images.gr-assets.com/books/1447303603s... | to-read fantasy favorites currently-reading yo... |
1 | 2 | 3 | 3 | 4640799 | 491 | 439554934 | 9.780440e+12 | J.K. Rowling, Mary GrandPré | 1997.0 | Harry Potter and the Philosopher's Stone | ... | 4800065 | 75867 | 75504 | 101676 | 455024 | 1156318 | 3011543 | https://images.gr-assets.com/books/1474154022m... | https://images.gr-assets.com/books/1474154022s... | to-read fantasy favorites currently-reading yo... |
2 | 3 | 41865 | 41865 | 3212258 | 226 | 316015849 | 9.780316e+12 | Stephenie Meyer | 2005.0 | Twilight | ... | 3916824 | 95009 | 456191 | 436802 | 793319 | 875073 | 1355439 | https://images.gr-assets.com/books/1361039443m... | https://images.gr-assets.com/books/1361039443s... | to-read fantasy favorites currently-reading yo... |
3 | 4 | 2657 | 2657 | 3275794 | 487 | 61120081 | 9.780061e+12 | Harper Lee | 1960.0 | To Kill a Mockingbird | ... | 3340896 | 72586 | 60427 | 117415 | 446835 | 1001952 | 1714267 | https://images.gr-assets.com/books/1361975680m... | https://images.gr-assets.com/books/1361975680s... | to-read favorites currently-reading young-adul... |
4 | 5 | 4671 | 4671 | 245494 | 1356 | 743273567 | 9.780743e+12 | F. Scott Fitzgerald | 1925.0 | The Great Gatsby | ... | 2773745 | 51992 | 86236 | 197621 | 606158 | 936012 | 947718 | https://images.gr-assets.com/books/1490528560m... | https://images.gr-assets.com/books/1490528560s... | to-read favorites currently-reading young-adul... |
5 rows × 24 columns
- 이번에는 tag name이 하나의 컬럼에 여러개가 들어있음
2.14 작가와 Tag name을 합침
1
2
3
4
5
books['corpus'] = (pd.Series(books[['authors', 'tag_name']]
.fillna('')
.values.tolist()
).str.join(' '))
books['corpus'][:3]
1
2
3
4
0 Suzanne Collins to-read fantasy favorites curr...
1 J.K. Rowling, Mary GrandPré to-read fantasy f...
2 Stephenie Meyer to-read fantasy favorites curr...
Name: corpus, dtype: object
- corpus라는 컬럼에 저자와 태그가 한번에 모두 있음
2.15 Tfidf 실행
1
2
3
4
5
tf_corpus = TfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df=0, stop_words='english')
tfidf_matrix_corpus = tf_corpus.fit_transform(books['corpus'])
cosine_sim_corpus = linear_kernel(tfidf_matrix_corpus, tfidf_matrix_corpus)
titles = books['title']
indices = pd.Series(books.index, index=books['title'])
- 작가와 Tag name을 합친것을 Tfidf를 실행함
2.16 추천 함수 작성
1
2
3
4
5
6
7
def corpus_recommendations(title):
idx = indices[title]
sim_scores = list(enumerate(cosine_sim_corpus[idx]))
sim_scores = sorted(sim_scores, key = lambda x : x[1], reverse=True)
sim_scores = sim_scores[1:11]
book_indices = [i[0] for i in sim_scores]
return titles.iloc[book_indices]
2.17 비슷한 책은?
1
corpus_recommendations('The Hobbit')
1
2
3
4
5
6
7
8
9
10
11
188 The Lord of the Rings (The Lord of the Rings, ...
154 The Two Towers (The Lord of the Rings, #2)
160 The Return of the King (The Lord of the Rings,...
18 The Fellowship of the Ring (The Lord of the Ri...
610 The Silmarillion (Middle-Earth Universe)
4975 Unfinished Tales of Númenor and Middle-Earth
2308 The Children of Húrin
963 J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...
465 The Hobbit: Graphic Novel
8271 The Complete Guide to Middle-Earth
Name: title, dtype: object
- The Hobbit과 비슷한 책은 이제 잘 나오는듯 하다.
1
corpus_recommendations('Twilight (Twilight, #1)')
1
2
3
4
5
6
7
8
9
10
11
51 Eclipse (Twilight, #3)
48 New Moon (Twilight, #2)
991 The Twilight Saga (Twilight, #1-4)
833 Midnight Sun (Twilight, #1.5)
731 The Short Second Life of Bree Tanner: An Eclip...
1618 The Twilight Saga Complete Collection (Twilig...
4087 The Twilight Saga: The Official Illustrated Gu...
2020 The Twilight Collection (Twilight, #1-3)
72 The Host (The Host, #1)
219 Twilight: The Complete Illustrated Movie Compa...
Name: title, dtype: object
- 트와일라잇과 비슷한 책들
1
corpus_recommendations('Harry Potter and the Prisoner of Azkaban (Harry Potter, #3)')
1
2
3
4
5
6
7
8
9
10
11
1 Harry Potter and the Sorcerer's Stone (Harry P...
26 Harry Potter and the Half-Blood Prince (Harry ...
22 Harry Potter and the Chamber of Secrets (Harry...
24 Harry Potter and the Deathly Hallows (Harry Po...
23 Harry Potter and the Goblet of Fire (Harry Pot...
20 Harry Potter and the Order of the Phoenix (Har...
3752 Harry Potter Collection (Harry Potter, #1-6)
398 The Tales of Beedle the Bard
1285 Quidditch Through the Ages
421 Harry Potter Boxset (Harry Potter, #1-7)
Name: title, dtype: object
- 해리포터와 비슷한 책
1
corpus_recommendations('Romeo and Juliet')
1
2
3
4
5
6
7
8
9
10
11
352 Othello
769 Julius Caesar
124 Hamlet
153 Macbeth
247 A Midsummer Night's Dream
838 The Merchant of Venice
854 Twelfth Night
529 Much Ado About Nothing
713 King Lear
772 The Taming of the Shrew
Name: title, dtype: object
- 로미오와 줄리엣과 비슷한 책
3. 요약
3.1 요약
- 책 데이터로 해본 추천 시스템, Tfidf를 사용하였고, 사실 작가나 태그만 사용한다면 같은 작가, 같은 태그의 책들만 추천을 해줬을것이다.
- 하지만 하나의 컬럼에 모아서 Tfidf를 하였을땐 조금 다른 결과가 나왔으나, 이렇게 하는것이 맞는지, 혹은 더 다른 방법은 없는지 싶다
- 추천 시스템은 어려운듯 하다