Posts 타이타닉 데이터로 해보는 PCA와 kNN
Post
Cancel

타이타닉 데이터로 해보는 PCA와 kNN

1. 타이타닉 데이터 preprocessing


1.1 Data load

1
2
3
4
5
import pandas as pd

titanic_url = 'https://github.com/hmkim312/datas/blob/main/titanic/titanic.xls?raw=true'
titanic = pd.read_excel(titanic_url)
titanic.head()
pclasssurvivednamesexagesibspparchticketfarecabinembarkedboatbodyhome.dest
011Allen, Miss. Elisabeth Waltonfemale29.00000024160211.3375B5S2NaNSt Louis, MO
111Allison, Master. Hudson Trevormale0.916712113781151.5500C22 C26S11NaNMontreal, PQ / Chesterville, ON
210Allison, Miss. Helen Lorainefemale2.000012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ON
310Allison, Mr. Hudson Joshua Creightonmale30.000012113781151.5500C22 C26SNaN135.0Montreal, PQ / Chesterville, ON
410Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.000012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ON


1.2 이름으로 title 만들기

1
2
3
4
5
6
7
8
import re

title = []
for idx, dataset in titanic.iterrows():
    title.append(re.search('\,\s\w+(\s\w+)?\.', dataset['name']).group()[2:-1])
    
titanic['title'] = title
titanic.head()
pclasssurvivednamesexagesibspparchticketfarecabinembarkedboatbodyhome.desttitle
011Allen, Miss. Elisabeth Waltonfemale29.00000024160211.3375B5S2NaNSt Louis, MOMiss
111Allison, Master. Hudson Trevormale0.916712113781151.5500C22 C26S11NaNMontreal, PQ / Chesterville, ONMaster
210Allison, Miss. Helen Lorainefemale2.000012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ONMiss
310Allison, Mr. Hudson Joshua Creightonmale30.000012113781151.5500C22 C26SNaN135.0Montreal, PQ / Chesterville, ONMr
410Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.000012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ONMrs
  • name 컬럼에서 Miss, Master 등을 포함하는 title 컬럼을 생성함


1.3 귀족과 평민 등급 구별

1
print(set(title))
1
{'Sir', 'Dr', 'Mme', 'Major', 'Col', 'Mlle', 'Don', 'Jonkheer', 'Rev', 'Mr', 'Master', 'Dona', 'Ms', 'Capt', 'Lady', 'Mrs', 'Miss', 'the Countess'}
  • Miss, Mr, Ms 등을 제외하고 귀족의 성이 보인다. 이를 하나의 귀족이름으로 변경


1
2
3
4
5
6
7
8
9
10
11
12
13
14
titanic['title'] = titanic['title'].replace('Mlle', 'Miss')
titanic['title'] = titanic['title'].replace('Ms', 'Miss')
titanic['title'] = titanic['title'].replace('Mme', 'Mrs')

Rare_f = ['Dona', 'Dr','Lady','the Countess']
Rare_m = ['Capt', 'Col','Don','Major','Rev','Sir','Jonkheer','Master']

for each in Rare_f:
    titanic['title'] = titanic['title'].replace(each, 'Rare_f')
    
for each in Rare_m:
    titanic['title'] = titanic['title'].replace(each, 'Rare_m')
    
titanic['title'].unique()
1
array(['Miss', 'Rare_m', 'Mr', 'Mrs', 'Rare_f'], dtype=object)
  • Mlle, MS는 Miss로 변경
  • Mm 는 Mrs로 변경함
  • Dona, Or, Lady 등은 여자 귀족이름으로 변경
  • Capt, Col, Don 등은 남자 귀족이름으로 변경함


1.4 Gender 컬럼 생성

1
2
3
4
5
6
7
from sklearn.preprocessing import LabelEncoder

le_sex = LabelEncoder()
le_sex.fit(titanic['sex'])
titanic['gender'] = le_sex.transform(titanic['sex'])

le_sex.classes_
1
array(['female', 'male'], dtype=object)
  • 성별 컬럼에서 female과 male을 0과 1로 LabelEncoder를 해줌
  • 컴퓨터는 female과 male을 알수없으니, 0과 1로 변경해주는 전처리를 해주는것
  • 다만 0이 1보다 낮거나 안좋은건 아님


1.5 Grade 컬럼 생성

1
2
3
4
5
le_grade = LabelEncoder()
le_grade.fit(titanic['title'])
titanic['grade'] = le_grade.transform(titanic['title'])

le_grade.classes_
1
array(['Miss', 'Mr', 'Mrs', 'Rare_f', 'Rare_m'], dtype=object)
  • 마찬가지로 title의 miss, mr, mrs, rare_f, rare_m도 labelencoding을 해줌


1.6 Null은 제외

1
2
3
titanic = titanic[titanic['age'].notnull()]
titanic = titanic[titanic['fare'].notnull()]
titanic.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1045 entries, 0 to 1308
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1045 non-null   int64  
 1   survived   1045 non-null   int64  
 2   name       1045 non-null   object 
 3   sex        1045 non-null   object 
 4   age        1045 non-null   float64
 5   sibsp      1045 non-null   int64  
 6   parch      1045 non-null   int64  
 7   ticket     1045 non-null   object 
 8   fare       1045 non-null   float64
 9   cabin      272 non-null    object 
 10  embarked   1043 non-null   object 
 11  boat       417 non-null    object 
 12  body       119 non-null    float64
 13  home.dest  685 non-null    object 
 14  title      1045 non-null   object 
 15  gender     1045 non-null   int64  
 16  grade      1045 non-null   int64  
dtypes: float64(3), int64(6), object(8)
memory usage: 147.0+ KB
  • age와 fare의 컬럼의 null값을 제거함
  • 그외 null값이 있는 컬럼은 사용하지 않은 컬럼


2. PCA


2.1 Data split

1
2
3
4
5
6
7
from sklearn.model_selection import train_test_split

X = titanic[['pclass', 'age', 'sibsp', 'parch', 'fare', 'gender', 'grade']].astype('float')

y = titanic['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 13)
  • ‘pclass’, ‘age’, ‘sibsp’, ‘parch’, ‘fare’, ‘gender’, ‘grade’ 컬럼만 사용하여 X 데이터로 만듬


2.2 PCA 함수 생성

1
2
3
4
5
6
7
from sklearn.decomposition import PCA

def get_pca_data(ss_data, n_components = 2):
    pca = PCA(n_components = n_components)
    pca.fit(ss_data)
    
    return pca.transform(ss_data), pca
  • PCA를 만드는 함수 작성


1
2
3
def get_pd_from_pca(pca_data, col_num):
    cols = ['pca_'+str(n) for n in range(col_num)]
    return pd.DataFrame(pca_data, columns = cols)
  • 데이터 프레임으로 만드는 함수 작성


1
2
3
4
5
6
import numpy as np

def print_variance_ratio(pca, only_sum = False):
    if only_sum == False:
        print('variance_ratio :', pca.explained_variance_ratio_)
    print('sum of variance_ratio: ', np.sum(pca.explained_variance_ratio_))
  • PCA의 설명력을 프린트하는 함수 작성


2.3 PCA 적용 (2개의 특성)

1
2
pca_data, pca = get_pca_data(X_train, n_components=2)
print_variance_ratio(pca)
1
2
variance_ratio : [0.93577394 0.06326916]
sum of variance_ratio:  0.9990431009511274
  • 2개의 특성으로도 데이터의 99%를 설명함


2.4 데이터 시각화

1
2
3
4
5
6
7
8
9
10
import seaborn as sns

pca_columns = ['pca_1', 'pca_2']
pca_pd = pd.DataFrame(pca_data, columns=pca_columns)
pca_pd['survived'] = y_train

sns.pairplot(pca_pd, hue='survived', height=5,
             x_vars=['pca_1'], y_vars=['pca_2'])

plt.show()

  • 생존자와 비 생존자가는 잘 구별이 안되는듯 하다


2.5 PCA 적용 (3개의 특성)

1
2
pca_data, pca = get_pca_data(X_train, n_components=3)
print_variance_ratio(pca)
1
2
variance_ratio : [9.35773938e-01 6.32691630e-02 4.00903990e-04]
sum of variance_ratio:  0.9994440049413533


2.6 데이터 프레임 생성

1
2
3
4
pca_pd = get_pd_from_pca(pca_data, 3)

pca_pd['survived'] = y_train.values
pca_pd.head()
pca_0pca_1pca_2survived
0-28.7631844.479379-0.4515310
141.58736222.0845940.0118340
2-19.598979-10.9999360.5581670
3-28.232483-6.559632-1.3492171
4-29.055717-1.510811-0.5388860
  • 3개의 특성으로 변환함


2.7 데이터 시각화

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from mpl_toolkits.mplot3d import Axes3D

markers = ['^', 'o']

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

for i, marker in enumerate(markers):
    x_axis_data = pca_pd[pca_pd['survived'] == i]['pca_0']
    y_axis_data = pca_pd[pca_pd['survived'] == i]['pca_1']
    z_axis_data = pca_pd[pca_pd['survived'] == i]['pca_2']

    ax.scatter(x_axis_data, y_axis_data, z_axis_data,
               s=20, alpha=0.5, marker=marker)
    
ax.view_init(30, 80)
plt.show()

2.8 Pipe Line 구축

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

estimatiors = [('scaler', StandardScaler()),
               ('pca', PCA(n_components=3)),
               ('clf', KNeighborsClassifier(n_neighbors=20))]

pipe = Pipeline(estimatiors)
pipe.fit(X_train, y_train)

pred = pipe.predict(X_test)
print(accuracy_score(y_test, pred))
1
0.7703349282296651
  • KNN, StandardScaler를 사용하여 Pipe라인을 구축함
  • accuracy는 0.77 나옴


2.9 디카프리오와 윈슬렛의 생존 확률

1
2
3
4
5
decaprio = np.array([[3, 18, 0, 0, 5, 1, 1]])
print('Decaprio : ', pipe.predict_proba(decaprio)[0, 1])

winslet = np.array([[1, 16, 1, 1, 100, 0, 3]])
print('Winslet : ', pipe.predict_proba(winslet)[0, 1])
1
2
Decaprio :  0.05
Winslet :  0.85
This post is licensed under CC BY 4.0 by the author.