Good Books 데이터로 해보는 추천 시스템(Recommendations)

1. Good Books

1.1 Good Books 데이터

https://www.kaggle.com/zygmunt/goodbooks-10k

ratings, books,tag, book_tags, to_read의 10k(10,000) 데이터

2. 추천 시스템 실습

2.1 Data load

2.1.1 Books data

import numpy as np
import pandas as pd

books = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/goodbooks-10k/books.csv', encoding='ISO-8859-1')
books.head()

	id	book_id	best_book_id	work_id	books_count	isbn	isbn13	authors	original_publication_year	original_title	...	ratings_count	work_ratings_count	work_text_reviews_count	ratings_1	ratings_2	ratings_3	ratings_4	ratings_5	image_url	small_image_url
0	1	2767052	2767052	2792775	272	439023483	9.780439e+12	Suzanne Collins	2008.0	The Hunger Games	...	4780653	4942365	155254	66715	127936	560092	1481305	2706317	https://images.gr-assets.com/books/1447303603m...	https://images.gr-assets.com/books/1447303603s...
1	2	3	3	4640799	491	439554934	9.780440e+12	J.K. Rowling, Mary GrandPrÃ©	1997.0	Harry Potter and the Philosopher's Stone	...	4602479	4800065	75867	75504	101676	455024	1156318	3011543	https://images.gr-assets.com/books/1474154022m...	https://images.gr-assets.com/books/1474154022s...
2	3	41865	41865	3212258	226	316015849	9.780316e+12	Stephenie Meyer	2005.0	Twilight	...	3866839	3916824	95009	456191	436802	793319	875073	1355439	https://images.gr-assets.com/books/1361039443m...	https://images.gr-assets.com/books/1361039443s...
3	4	2657	2657	3275794	487	61120081	9.780061e+12	Harper Lee	1960.0	To Kill a Mockingbird	...	3198671	3340896	72586	60427	117415	446835	1001952	1714267	https://images.gr-assets.com/books/1361975680m...	https://images.gr-assets.com/books/1361975680s...
4	5	4671	4671	245494	1356	743273567	9.780743e+12	F. Scott Fitzgerald	1925.0	The Great Gatsby	...	2683664	2773745	51992	86236	197621	606158	936012	947718	https://images.gr-assets.com/books/1490528560m...	https://images.gr-assets.com/books/1490528560s...

5 rows × 23 columns

Book에 대한 정보가 담긴 csv 파일
이번 데이터들은 encoding을 ISO-8859-1로 읽어야함
rating 1 ~ 5의 의미는 별점 1점부터 5점의 갯수임

2.1.2 Ratings Data

ratings = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/goodbooks-10k/ratings.csv', encoding='ISO-8859-1')
ratings.head()

	book_id	user_id	rating
0	1	314	5
1	1	439	3
2	1	588	5
3	1	1169	4
4	1	1185	4

rating 데이터에는 Book_id와 User_id 그리고 해당 유저가 준 rating 점수가 있음

2.1.3 Book tags Data load

book_tags = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/goodbooks-10k/book_tags.csv', encoding='ISO-8859-1')
book_tags.head()

	goodreads_book_id	tag_id	count
0	1	30574	167697
1	1	11305	37174
2	1	11557	34173
3	1	8717	12986
4	1	33114	12716

Book의 id와 tag의 id가 있음

2.1.4 Tags Data load

tags = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/goodbooks-10k/tags.csv')
tags.tail()

	tag_id	tag_name
34247	34247	Ｃhildrens
34248	34248	Ｆａｖｏｒｉｔｅｓ
34249	34249	Ｍａｎｇａ
34250	34250	ＳＥＲＩＥＳ
34251	34251	ｆａｖｏｕｒｉｔｅｓ

Tag의 id와 해당 tag와 연결되는 name이 있음

2.1.5 Read Data load

to_read = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/goodbooks-10k/to_read.csv')
to_read.head()

	user_id	book_id
0	1	112
1	1	235
2	1	533
3	1	1198
4	1	1874

유저가 어떤 책을 읽었는지에 대한 id가 적혀있음

2.2 Tag Data 전처리

tags_join_Df = pd.merge(book_tags, tags, left_on='tag_id', right_on='tag_id', how = 'inner')
tags_join_Df.head()

	goodreads_book_id	tag_id	count	tag_name
0	1	30574	167697	to-read
1	2	30574	24549	to-read
2	3	30574	496107	to-read
3	5	30574	11909	to-read
4	6	30574	298	to-read

Tagid와 tag_name을 books id가 있는 데이터 프레임과 merge함

2.3 Authors로 Tfidf

books['authors'][:3]

               Suzanne Collins
  J.K. Rowling, Mary GrandPrÃ©
               Stephenie Meyer
Name: authors, dtype: object

books 데이터에는 작가명 컬럼이 있음

from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df = 0, stop_words = 'english')
tfidf_matrix = tf.fit_transform(books['authors'])
tfidf_matrix

<10000x14742 sparse matrix of type '<class 'numpy.float64'>'
	with 43235 stored elements in Compressed Sparse Row format>

Books에 있는 작가명으로 Tfidf를 수행함

2.4 유사도 측정

from sklearn.metrics.pairwise import linear_kernel

cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

사이킷런의 linear_kernel을 사용하여 작가명으로 만든 Tfidf매트릭스를 유사도 행렬로 생성

2.5 Hobbit과 유사한 책은?

title = books['title']
indices = pd.Series(books.index, index=books['title'])
indices['The Hobbit']

6

Hobbit의 index는 6번이다
6번 행을 불러와서 비슷한 책을 찾게 해보자

cosine_sim[indices['The Hobbit']]

array([0., 0., 0., ..., 0., 0., 0.])

유사도 행렬에서 hobbit의 인덱스의 행을 불러옴

cosine_sim[indices['The Hobbit']].shape

(10000,)

총 1만개의 책 데이터가 있음

list(enumerate(cosine_sim[indices['The Hobbit']]))[:3]

[(0, 0.0), (1, 0.0), (2, 0.0)]

유사도 행렬에서 The Hobbit의 인덱스만 가져오고, 해당 컬럼(다른책 책 인덱스)와 코사인 유사도 점수를 enumerate를 사용하여 튜플형식으로 만들고, 해당 데이터를 list에 넣는다

2.6 가장 유사한 책의 Index

sim_scores = list(enumerate(cosine_sim[indices['The Hobbit']]))
sim_scores = sorted(sim_scores, key = lambda x : x[1], reverse= True)
sim_scores[:3]

[(6, 1.0), (18, 1.0), (154, 1.0)]

호빗과 가장 유사한 책의 인덱스(여기서는 열)와 코사인 점수를 정렬하여 출력함
완전 똑같은 1점도 보인다. 18번, 154번
참고로 맨 앞에 (6, 1.0)은 본인 자신임

print(f'Index 6번의 책 이름 :', books['title'][6])
print(f'Index 18번의 책 이름 :', books['title'][18])
print(f'Index 154번의 책 이름 :', books['title'][154])

Index 6번의 책 이름 : The Hobbit
Index 18번의 책 이름 : The Fellowship of the Ring (The Lord of the Rings, #1)
Index 154번의 책 이름 : The Two Towers (The Lord of the Rings, #2)

호빗과 비슷한 책은 반지의 제왕 시리즈가 나옴

2.7 작가로 본 유사 책 검색

sim_scores = sim_scores[1:11]
book_indices = [i[0] for i in sim_scores]
title.iloc[book_indices]

    The Fellowship of the Ring (The Lord of the Ri...
          The Two Towers (The Lord of the Rings, #2)
   The Return of the King (The Lord of the Rings,...
   The Lord of the Rings (The Lord of the Rings, ...
   J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...
      Unfinished Tales of NÃºmenor and Middle-Earth
                             The Children of HÃºrin
            The Silmarillion (Middle-Earth Universe)
                 The Complete Guide to Middle-Earth
   The History of the Hobbit, Part One: Mr. Baggins
Name: title, dtype: object

그 외의 다른 책들도 대부분 Hobbit이긴 하나, 아마 작가가 동일인일 가능성이 높다.
사실 생각해 보면 작가이름으로만 Tfidf를 했기 때문에, 작가 이름이 같다면 모두 동일한 점수(1)로 나올것이다

2.8 Tag 추가

books_with_tags = pd.merge(books, tags_join_Df, left_on= 'book_id', right_on='goodreads_book_id', how = 'inner')
books_with_tags.head()

	id	book_id	best_book_id	work_id	books_count	isbn	isbn13	authors	original_publication_year	original_title	...	ratings_2	ratings_3	ratings_4	ratings_5	image_url	small_image_url	goodreads_book_id	tag_id	count	tag_name
0	1	2767052	2767052	2792775	272	439023483	9.780439e+12	Suzanne Collins	2008.0	The Hunger Games	...	127936	560092	1481305	2706317	https://images.gr-assets.com/books/1447303603m...	https://images.gr-assets.com/books/1447303603s...	2767052	30574	11314	to-read
1	1	2767052	2767052	2792775	272	439023483	9.780439e+12	Suzanne Collins	2008.0	The Hunger Games	...	127936	560092	1481305	2706317	https://images.gr-assets.com/books/1447303603m...	https://images.gr-assets.com/books/1447303603s...	2767052	11305	10836	fantasy
2	1	2767052	2767052	2792775	272	439023483	9.780439e+12	Suzanne Collins	2008.0	The Hunger Games	...	127936	560092	1481305	2706317	https://images.gr-assets.com/books/1447303603m...	https://images.gr-assets.com/books/1447303603s...	2767052	11557	50755	favorites
3	1	2767052	2767052	2792775	272	439023483	9.780439e+12	Suzanne Collins	2008.0	The Hunger Games	...	127936	560092	1481305	2706317	https://images.gr-assets.com/books/1447303603m...	https://images.gr-assets.com/books/1447303603s...	2767052	8717	35418	currently-reading
4	1	2767052	2767052	2792775	272	439023483	9.780439e+12	Suzanne Collins	2008.0	The Hunger Games	...	127936	560092	1481305	2706317	https://images.gr-assets.com/books/1447303603m...	https://images.gr-assets.com/books/1447303603s...	2767052	33114	25968	young-adult

5 rows × 27 columns

Books 데이터 프레임에, 앞에서 만든 tagid와 tag name을 merge함

2.9 Tag를 Tfidf

tf_tag = TfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df = 0, stop_words='english')
tfidf_matrix_tag = tf_tag.fit_transform(books_with_tags['tag_name'].head(10000))
cosine_sim_tag = linear_kernel(tfidf_matrix_tag, tfidf_matrix_tag)

앞에선 작가 이름으로 Tfidf를 했고, 이번엔 Tag로 해본다

2.10 추천책을 반환하는 함수

title_tag = books['title']
indices_tag = pd.Series(books.index, index=books['title'])


def tags_recommendations(title):
    idx = indices_tag[title]
    sim_scores = list(enumerate(cosine_sim_tag[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    book_indices = [i[0] for i in sim_scores]
    return title_tag.iloc[book_indices]

이번에는 책의 제목을 넣으면 추천책을 반환하는 함수를 작성
sim_scores = sim_scores[1:11]은 총 10개를 가리키며, 1부터 한것은 0번은 입력한 책 제목 자신이 나오기 떄문임

2.11 Tag로 찾아본 Hobbits와 유사책

tags_recommendations('The Hobbit').head(20)

           Catching Fire (The Hunger Games, #2)
                                Of Mice and Men
  Confessions of a Shopaholic (Shopaholic, #1)
                     Dune (Dune Chronicles #1)
                                  The Red Tent
        One for the Money (Stephanie Plum, #1)
                              Ready Player One
           The Gunslinger (The Dark Tower, #1)
        Shiver (The Wolves of Mercy Falls, #1)
                       Inkheart (Inkworld, #1)
Name: title, dtype: object

헝거게임, 듄 등 호빗과 비슷한 판타지 장르가 나오는듯 싶다.

2.12 Book id에 tag name을 한번에 붙이기

temp_df = books_with_tags.groupby('book_id')['tag_name'].apply(' '.join).reset_index()
temp_df.head()

	book_id	tag_name
0	1	to-read fantasy favorites currently-reading yo...
1	2	to-read fantasy favorites currently-reading yo...
2	3	to-read fantasy favorites currently-reading yo...
3	5	to-read fantasy favorites currently-reading yo...
4	6	to-read fantasy young-adult fiction harry-pott...

Book Id에 있는 모든 tag_name들을 한번에 모아놓음

2.13 Boos에 Merge

books = pd.merge(books, temp_df, on = 'book_id', how = 'inner')
books.head()

	id	book_id	best_book_id	work_id	books_count	isbn	isbn13	authors	original_publication_year	original_title	...	work_ratings_count	work_text_reviews_count	ratings_1	ratings_2	ratings_3	ratings_4	ratings_5	image_url	small_image_url	tag_name
0	1	2767052	2767052	2792775	272	439023483	9.780439e+12	Suzanne Collins	2008.0	The Hunger Games	...	4942365	155254	66715	127936	560092	1481305	2706317	https://images.gr-assets.com/books/1447303603m...	https://images.gr-assets.com/books/1447303603s...	to-read fantasy favorites currently-reading yo...
1	2	3	3	4640799	491	439554934	9.780440e+12	J.K. Rowling, Mary GrandPrÃ©	1997.0	Harry Potter and the Philosopher's Stone	...	4800065	75867	75504	101676	455024	1156318	3011543	https://images.gr-assets.com/books/1474154022m...	https://images.gr-assets.com/books/1474154022s...	to-read fantasy favorites currently-reading yo...
2	3	41865	41865	3212258	226	316015849	9.780316e+12	Stephenie Meyer	2005.0	Twilight	...	3916824	95009	456191	436802	793319	875073	1355439	https://images.gr-assets.com/books/1361039443m...	https://images.gr-assets.com/books/1361039443s...	to-read fantasy favorites currently-reading yo...
3	4	2657	2657	3275794	487	61120081	9.780061e+12	Harper Lee	1960.0	To Kill a Mockingbird	...	3340896	72586	60427	117415	446835	1001952	1714267	https://images.gr-assets.com/books/1361975680m...	https://images.gr-assets.com/books/1361975680s...	to-read favorites currently-reading young-adul...
4	5	4671	4671	245494	1356	743273567	9.780743e+12	F. Scott Fitzgerald	1925.0	The Great Gatsby	...	2773745	51992	86236	197621	606158	936012	947718	https://images.gr-assets.com/books/1490528560m...	https://images.gr-assets.com/books/1490528560s...	to-read favorites currently-reading young-adul...

5 rows × 24 columns

이번에는 tag name이 하나의 컬럼에 여러개가 들어있음

2.14 작가와 Tag name을 합침

books['corpus'] = (pd.Series(books[['authors', 'tag_name']]
                            .fillna('')
                            .values.tolist()
                           ).str.join(' '))
books['corpus'][:3]

  Suzanne Collins to-read fantasy favorites curr...
  J.K. Rowling, Mary GrandPrÃ© to-read fantasy f...
  Stephenie Meyer to-read fantasy favorites curr...
Name: corpus, dtype: object

corpus라는 컬럼에 저자와 태그가 한번에 모두 있음

2.15 Tfidf 실행

tf_corpus = TfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df=0, stop_words='english')
tfidf_matrix_corpus = tf_corpus.fit_transform(books['corpus'])
cosine_sim_corpus = linear_kernel(tfidf_matrix_corpus, tfidf_matrix_corpus)
titles = books['title']
indices = pd.Series(books.index, index=books['title'])

작가와 Tag name을 합친것을 Tfidf를 실행함

2.16 추천 함수 작성

def corpus_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim_corpus[idx]))
    sim_scores = sorted(sim_scores, key = lambda x : x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    book_indices = [i[0] for i in sim_scores]
    return titles.iloc[book_indices]

2.17 비슷한 책은?

corpus_recommendations('The Hobbit')

   The Lord of the Rings (The Lord of the Rings, ...
          The Two Towers (The Lord of the Rings, #2)
   The Return of the King (The Lord of the Rings,...
    The Fellowship of the Ring (The Lord of the Ri...
            The Silmarillion (Middle-Earth Universe)
      Unfinished Tales of NÃºmenor and Middle-Earth
                             The Children of HÃºrin
   J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...
                           The Hobbit: Graphic Novel
                 The Complete Guide to Middle-Earth
Name: title, dtype: object

The Hobbit과 비슷한 책은 이제 잘 나오는듯 하다.

corpus_recommendations('Twilight (Twilight, #1)')

                               Eclipse (Twilight, #3)
                              New Moon (Twilight, #2)
                  The Twilight Saga (Twilight, #1-4)
                       Midnight Sun (Twilight, #1.5)
   The Short Second Life of Bree Tanner: An Eclip...
  The Twilight Saga Complete Collection  (Twilig...
  The Twilight Saga: The Official Illustrated Gu...
           The Twilight Collection (Twilight, #1-3)
                              The Host (The Host, #1)
   Twilight: The Complete Illustrated Movie Compa...
Name: title, dtype: object

트와일라잇과 비슷한 책들

corpus_recommendations('Harry Potter and the Prisoner of Azkaban (Harry Potter, #3)')

     Harry Potter and the Sorcerer's Stone (Harry P...
    Harry Potter and the Half-Blood Prince (Harry ...
    Harry Potter and the Chamber of Secrets (Harry...
    Harry Potter and the Deathly Hallows (Harry Po...
    Harry Potter and the Goblet of Fire (Harry Pot...
    Harry Potter and the Order of the Phoenix (Har...
       Harry Potter Collection (Harry Potter, #1-6)
                        The Tales of Beedle the Bard
                         Quidditch Through the Ages
            Harry Potter Boxset (Harry Potter, #1-7)
Name: title, dtype: object

해리포터와 비슷한 책

corpus_recommendations('Romeo and Juliet')

                    Othello
              Julius Caesar
                     Hamlet
                    Macbeth
  A Midsummer Night's Dream
     The Merchant of Venice
              Twelfth Night
     Much Ado About Nothing
                  King Lear
    The Taming of the Shrew
Name: title, dtype: object

로미오와 줄리엣과 비슷한 책

3. 요약

3.1 요약

책 데이터로 해본 추천 시스템, Tfidf를 사용하였고, 사실 작가나 태그만 사용한다면 같은 작가, 같은 태그의 책들만 추천을 해줬을것이다.
하지만 하나의 컬럼에 모아서 Tfidf를 하였을땐 조금 다른 결과가 나왔으나, 이렇게 하는것이 맞는지, 혹은 더 다른 방법은 없는지 싶다
추천 시스템은 어려운듯 하다

Good Books 데이터로 해보는 추천 시스템(Recommendations)

1. Good Books

1.1 Good Books 데이터

2. 추천 시스템 실습

2.1 Data load

2.1.1 Books data

2.1.2 Ratings Data

2.1.3 Book tags Data load

2.1.4 Tags Data load

2.1.5 Read Data load

2.2 Tag Data 전처리

2.3 Authors로 Tfidf

2.4 유사도 측정

2.5 Hobbit과 유사한 책은?

2.6 가장 유사한 책의 Index

2.7 작가로 본 유사 책 검색

2.8 Tag 추가

2.9 Tag를 Tfidf

2.10 추천책을 반환하는 함수

2.11 Tag로 찾아본 Hobbits와 유사책

2.12 Book id에 tag name을 한번에 붙이기

2.13 Boos에 Merge

2.14 작가와 Tag name을 합침

2.15 Tfidf 실행

2.16 추천 함수 작성

2.17 비슷한 책은?

3. 요약

3.1 요약

Recent Update

Trending Tags

Contents

Trending Tags

Good Books 데이터로 해보는 추천 시스템(Recommendations)

1. Good Books

1.1 Good Books 데이터

2. 추천 시스템 실습

2.1 Data load

2.1.1 Books data

2.1.2 Ratings Data

2.1.3 Book tags Data load

2.1.4 Tags Data load

2.1.5 Read Data load

2.2 Tag Data 전처리

2.3 Authors로 Tfidf

2.4 유사도 측정

2.5 Hobbit과 유사한 책은?

2.6 가장 유사한 책의 Index

2.7 작가로 본 유사 책 검색

2.8 Tag 추가

2.9 Tag를 Tfidf

2.10 추천책을 반환하는 함수

2.11 Tag로 찾아본 Hobbits와 유사책

2.12 Book id에 tag name을 한번에 붙이기

2.13 Boos에 Merge

2.14 작가와 Tag name을 합침

2.15 Tfidf 실행

2.16 추천 함수 작성

2.17 비슷한 책은?

3. 요약

3.1 요약

Recent Update

Trending Tags

Contents

Further Reading

문장 사이의 거리

네이버 영화 평점을 이용한 감성 분석

영화데이터로 해보는 추천 시스템(Recommendations)

Trending Tags