머신러닝을 이용한 타이타닉 생존자 예측

1. 타이타닉 EDA

1.1 데이터 로드

import pandas as pd

titanic = pd.read_excel('https://github.com/hmkim312/datas/blob/main/titanic/titanic.xls?raw=true')
titanic.head()

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	NaN	135.0	Montreal, PQ / Chesterville, ON
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON

타이타닉은 1910년 당시 최대의 여객선으로 영국에거 미국 뉴욕으로 가던 여객선
위의 깃헙 링크에 타이타닉 파일을 올려놓았음
Pclass : 객실 등급
Survived : 생존 유무
Sex : 성별
Name : 이름
Age : 나이
Sibsp : 형제 혹은 부부의 수
Parch : 부모 혹은 자녀의 수
Fare : 지불한 요금
Boat : 탈출시 사용한 보트 번호

1.2 생존 상황

import matplotlib.pyplot as plt
import seaborn as sns

f, ax = plt.subplots(1, 2, figsize=(18, 8))

titanic['survived'].value_counts().plot.pie(
    explode=[0, 0.1], autopct='%1.1f%%', ax=ax[0], shadow=True)

ax[0].set_title('Pie plot - Survived')
ax[0].set_ylabel('')
sns.countplot('survived', data=titanic, ax=ax[1])
ax[1].set_title('Count plot - Survived')
plt.show()

38.2%의 생존율

1.3 성별에 따른 생존 상황

f, ax = plt.subplots(1, 2, figsize = (18, 8))

sns.countplot('sex', data = titanic, ax=ax[0])
ax[0].set_title('Count of Passengers of Sex')
ax[0].set_ylabel('')

sns.countplot('sex', hue = 'survived', data = titanic, ax=ax[1])
ax[1].set_title('Sex:Survived and Unsurvived')
plt.show()

탑승객은 남성이 많지만, 남성이 생존확률은 낮음

1.4 경재력 대비 생존률

pd.crosstab(titanic['pclass'], titanic['survived'], margins=True)

survived	0	1	All
pclass
1	123	200	323
2	158	119	277
3	528	181	709
All	809	500	1309

1등실의 생존 가능성이 아주 높음
여성의 생존률도 높음
1등실에는 여성이 많이 타고있었는가?

1.5 선실 등급별 성별 상황

grid = sns.FacetGrid(titanic, row = 'pclass', col = 'sex', height = 4, aspect=2)
grid.map(plt.hist, 'age', alpha =0.8, bins = 20)
grid.add_legend()
plt.show()

3등실에는 남성이 많았음, 특히 20대 남성

1.6 나이별 승객 현황

import plotly.express as px

fig = px.histogram(titanic, x = 'age')
fig.show()

아이들과 20대 ~ 30대 인원이 많음

1.7 객실별 생존율을 연연별로 관찰

grid = sns.FacetGrid(titanic, col = 'survived', row = 'pclass', height = 4, aspect = 2)
grid.map(plt.hist, 'age', alpha = .5, bins = 20)
grid.add_legend()
plt.show()

선싱 등급이 높으면 생존율이 높음

1.8 나이를 5단계로 정리하기

titanic['age_cat'] = pd.cut(titanic['age'], bins=[0, 7, 15, 30, 60, 100],
                            include_lowest=True, labels=['baby', 'teen', 'young', 'adult', 'old'])
titanic.head()

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest	age_cat
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO	young
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON	baby
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON	baby
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	NaN	135.0	Montreal, PQ / Chesterville, ON	young
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON	young

1.9 나이, 성별, 등급별 생존자 수를 한번에 파악하기

plt.figure(figsize=(12, 4))
plt.subplot(131)
sns.barplot('pclass', 'survived', data=titanic)
plt.subplot(132)
sns.barplot('age_cat', 'survived', data=titanic)
plt.subplot(133)
sns.barplot('sex', 'survived', data=titanic)
plt.subplots_adjust(top=1, bottom=0.1, left=0.1,
                    right=1, hspace=0.5, wspace=0.5)
plt.show()

어리고, 여성이고, 1등실이면 생존하기 유리했을듯하다

1.10 남/여 나이별 생존 상황

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 6))

women = titanic[titanic['sex'] == 'female']
men = titanic[titanic['sex'] == 'male']

ax = sns.distplot(women[women['survived'] == 1]['age'], bins=20,
                        label='survived', ax=axes[0], kde=False)
ax = sns.distplot(women[women['survived'] == 0]['age'], bins=40,
                        label='survived', ax=axes[0], kde=False)
ax.legend()
ax.set_title('Female')

ax = sns.distplot(men[men['survived'] == 1]['age'], bins=18,
                      label='survived', ax=axes[1], kde=False)
ax = sns.distplot(men[men['survived'] == 0]['age'], bins=40,
                      label='survived', ax=axes[1], kde=False)
ax.legend()
ax.set_title('Male')

Text(0.5, 1.0, 'Male')

1.11 탑승객의 이름에서 신분을 확인

for idx, dataset in titanic.iterrows():
    print(dataset['name'])

Allen, Miss. Elisabeth Walton
Allison, Master. Hudson Trevor
Allison, Miss. Helen Loraine
Allison, Mr. Hudson Joshua Creighton
Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
...

1.13 응용해서 title이라는 사회적 신분을 얻음

import re

title = []

for idx, dataset in titanic.iterrows():
    tmp = dataset['name']
    title.append(re.search('\,\s\w+(\s\w+)?\.', tmp).group()[2:-1])
    
titanic['title'] = title
titanic.head()

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest	age_cat	title
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO	young	Miss
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON	baby	Master
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON	baby	Miss
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	NaN	135.0	Montreal, PQ / Chesterville, ON	young	Mr
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON	young	Mrs

1.14 성별별로 본 귀족

pd.crosstab(titanic['title'], titanic['sex'])

sex	female	male
title
Capt	0	1
Col	0	4
Don	0	1
Dona	1	0
Dr	1	7
Jonkheer	0	1
Lady	1	0
Major	0	2
Master	0	61
Miss	260	0
Mlle	2	0
Mme	1	0
Mr	0	757
Mrs	197	0
Ms	2	0
Rev	0	8
Sir	0	1
the Countess	1	0

1.15 사회적 신분 전처리

titanic['title'] = titanic['title'].replace('Mlle', 'Miss')
titanic['title'] = titanic['title'].replace('Ms', 'Miss')
titanic['title'] = titanic['title'].replace('Mme', 'Mrs')

Rare_f = ['Dona', 'Dr', 'Lady', 'the Countess']
Rare_m = ['Capt', 'Col', 'Don', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Master']

for each in Rare_f:
    titanic['title'] = titanic['title'].replace(each, 'Rare_f')

for each in Rare_m:
    titanic['title'] = titanic['title'].replace(each, 'Rare_m')

titanic['title'].unique()

array(['Miss', 'Rare_m', 'Mr', 'Mrs', 'Rare_f'], dtype=object)

귀속이나 다른 이름을 가지고 있는 사람들은 Miss, Mrs, Mr, Rare_m, Rafe_f로 변경

1.16 귀족도 생각보다 많이 살았음

titanic[['title','survived']].groupby(['title'], as_index=False).mean()

	title	survived
0	Miss	0.678030
1	Mr	0.162483
2	Mrs	0.787879
3	Rare_f	0.636364
4	Rare_m	0.443038

귀속의 생존율도 높은편이다.

2. 머신러닝을 이용한 생존자 예측

2.1 구조확인

titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 16 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   pclass     1309 non-null   int64   
 1   survived   1309 non-null   int64   
 2   name       1309 non-null   object  
 3   sex        1309 non-null   object  
 4   age        1046 non-null   float64 
 5   sibsp      1309 non-null   int64   
 6   parch      1309 non-null   int64   
 7   ticket     1309 non-null   object  
 8   fare       1308 non-null   float64 
 9   cabin      295 non-null    object  
 10  embarked   1307 non-null   object  
 11  boat       486 non-null    object  
 12  body       121 non-null    float64 
 13  home.dest  745 non-null    object  
 14  age_cat    1046 non-null   category
 15  title      1309 non-null   object  
dtypes: category(1), float64(3), int64(4), object(8)
memory usage: 155.0+ KB

null값의 처리와, 라벨인코딩도 필요할듯 하다

2.2 성별 컬럼을 숫자로 변경하기

titanic['sex'].unique()

array(['female', 'male'], dtype=object)

2.3 Label Encode를 사용하기

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(titanic['sex'])
titanic['gender'] = le.transform(titanic['sex'])
titanic.head()

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest	age_cat	title	gender
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO	young	Miss	0
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON	baby	Rare_m	1
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON	baby	Miss	0
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	NaN	135.0	Montreal, PQ / Chesterville, ON	young	Mr	1
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON	young	Mrs	0

gender라는 컬럼을 생성하여 성별을 변경함

2.4 결측치 제외

titanic = titanic[titanic['age'].notnull()]
titanic = titanic[titanic['fare'].notnull()]
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1045 entries, 0 to 1308
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 pclass     1045 non-null   int64   
 survived   1045 non-null   int64   
 name       1045 non-null   object  
 sex        1045 non-null   object  
 age        1045 non-null   float64 
 sibsp      1045 non-null   int64   
 parch      1045 non-null   int64   
 ticket     1045 non-null   object  
 fare       1045 non-null   float64 
 cabin      272 non-null    object  
embarked   1043 non-null   object  
boat       417 non-null    object  
body       119 non-null    float64 
home.dest  685 non-null    object  
age_cat    1045 non-null   category
title      1045 non-null   object  
gender     1045 non-null   int64   
dtypes: category(1), float64(3), int64(5), object(8)
memory usage: 140.0+ KB

gender 컬럼 생성

2.5 상관관계

correlation_matrix = titanic.corr().round(1)
sns.heatmap(data=correlation_matrix, annot=True, cmap='bwr')
plt.show()

실수형 데이터들의 상관관계를 보았을때, 생존(Survived)는 성별과 pclass의 상관관계가 높다

2.6 특성 선택 후 데이터 나누기

from sklearn.model_selection import train_test_split

X = titanic[['pclass', 'age', 'sibsp', 'parch', 'fare','gender']]
y = titanic['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 13)

‘pclass’, ‘age’, ‘sibsp’, ‘parch’, ‘fare’, ‘gender’ 만 사용하여 예측에 사용해봄

2.7 DecisionTree

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dt = DecisionTreeClassifier(max_depth= 4, random_state= 13)
dt.fit(X_train, y_train)

pred = dt.predict(X_test)

print(accuracy_score(y_test, pred))

0.7655502392344498

의사결정나무로 해보았을때 accuracy는 0.76으로 나온다.
생각보다 높지는 않아보임

2.8 LogisticRegression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

lr = LogisticRegression(random_state= 13, solver='liblinear')
lr.fit(X_train, y_train)

pred = lr.predict(X_test)
print(accuracy_score(y_test, pred))

0.7511961722488039

로지스틱 회귀는 0.75가 나온다.

2.9 RandomForest

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf = RandomForestClassifier(random_state= 13, n_estimators= 100, max_depth=4)
rf.fit(X_train, y_train)

pred = rf.predict(X_test)
print(accuracy_score(y_test, pred))

0.7799043062200957

앙상블인 랜덤포레스트는 다른 두개보다 높은 결과를 가져온다

2.10 Pipeline을 만들고

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

estimators = [('scaler', StandardScaler()),
             ('clf', RandomForestClassifier(random_state=13))]

pipe = Pipeline(estimators)

Standard스케일러를 사용하여, 랜덤포레스트로 Pipeline을 만들었음

2.11 최적의 파라미터를 위한 그리드 서치

from sklearn.model_selection import GridSearchCV

params = [{
    'clf__max_depth': [6, 8, 10, 100],
    'clf__n_estimators': [50, 100, 200, 1000]
}]

gridsearch = GridSearchCV(
    estimator=pipe, param_grid=params, return_train_score=True, cv=5, verbose=2)

gridsearch.fit(X, y)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] clf__max_depth=6, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=6, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=6, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=50 ..........................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s


[CV] ........... clf__max_depth=6, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=6, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=6, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=6, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=6, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=6, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=6, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=6, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=6, clf__n_estimators=1000, total=   1.0s
[CV] clf__max_depth=6, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=6, clf__n_estimators=1000, total=   1.1s
[CV] clf__max_depth=6, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=6, clf__n_estimators=1000, total=   1.0s
[CV] clf__max_depth=6, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=6, clf__n_estimators=1000, total=   1.1s
[CV] clf__max_depth=6, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=6, clf__n_estimators=1000, total=   1.1s
[CV] clf__max_depth=8, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=8, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=8, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=8, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=8, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=8, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=8, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=8, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=8, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=8, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=8, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=8, clf__n_estimators=1000, total=   1.1s
[CV] clf__max_depth=8, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=8, clf__n_estimators=1000, total=   1.1s
[CV] clf__max_depth=8, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=8, clf__n_estimators=1000, total=   1.1s
[CV] clf__max_depth=8, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=8, clf__n_estimators=1000, total=   1.1s
[CV] clf__max_depth=8, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=8, clf__n_estimators=1000, total=   1.1s
[CV] clf__max_depth=10, clf__n_estimators=50 .........................
[CV] .......... clf__max_depth=10, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=50 .........................
[CV] .......... clf__max_depth=10, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=50 .........................
[CV] .......... clf__max_depth=10, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=50 .........................
[CV] .......... clf__max_depth=10, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=50 .........................
[CV] .......... clf__max_depth=10, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=100 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=100 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=100 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=100 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=100 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=200 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=10, clf__n_estimators=200 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=10, clf__n_estimators=200 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=10, clf__n_estimators=200 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=10, clf__n_estimators=200 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=10, clf__n_estimators=1000 .......................
[CV] ........ clf__max_depth=10, clf__n_estimators=1000, total=   1.2s
[CV] clf__max_depth=10, clf__n_estimators=1000 .......................
[CV] ........ clf__max_depth=10, clf__n_estimators=1000, total=   1.2s
[CV] clf__max_depth=10, clf__n_estimators=1000 .......................
[CV] ........ clf__max_depth=10, clf__n_estimators=1000, total=   1.2s
[CV] clf__max_depth=10, clf__n_estimators=1000 .......................
[CV] ........ clf__max_depth=10, clf__n_estimators=1000, total=   1.2s
[CV] clf__max_depth=10, clf__n_estimators=1000 .......................
[CV] ........ clf__max_depth=10, clf__n_estimators=1000, total=   1.2s
[CV] clf__max_depth=100, clf__n_estimators=50 ........................
[CV] ......... clf__max_depth=100, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=50 ........................
[CV] ......... clf__max_depth=100, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=50 ........................
[CV] ......... clf__max_depth=100, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=50 ........................
[CV] ......... clf__max_depth=100, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=50 ........................
[CV] ......... clf__max_depth=100, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=100 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=100 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=100 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=100 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=100 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=200 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=100, clf__n_estimators=200 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=100, clf__n_estimators=200 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=100, clf__n_estimators=200 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=100, clf__n_estimators=200 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=100, clf__n_estimators=1000 ......................
[CV] ....... clf__max_depth=100, clf__n_estimators=1000, total=   1.2s
[CV] clf__max_depth=100, clf__n_estimators=1000 ......................
[CV] ....... clf__max_depth=100, clf__n_estimators=1000, total=   1.2s
[CV] clf__max_depth=100, clf__n_estimators=1000 ......................
[CV] ....... clf__max_depth=100, clf__n_estimators=1000, total=   1.2s
[CV] clf__max_depth=100, clf__n_estimators=1000 ......................
[CV] ....... clf__max_depth=100, clf__n_estimators=1000, total=   1.2s
[CV] clf__max_depth=100, clf__n_estimators=1000 ......................
[CV] ....... clf__max_depth=100, clf__n_estimators=1000, total=   1.3s


[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed:   35.1s finished



GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('clf',
                                        RandomForestClassifier(random_state=13))]),
             param_grid=[{'clf__max_depth': [6, 8, 10, 100],
                          'clf__n_estimators': [50, 100, 200, 1000]}],
             return_train_score=True, verbose=2)

pipeline을 사용하여 gridsearchcv를 사용해보았음
max_depth는 6,8,10,100, n_estimators는 50, 100, 200, 1000개를 사용함
pipeline의 gridsearch는 __를 붙여야 한다.

2.12 각 모델간 성능 비교

score_df = pd.DataFrame(gridsearch.cv_results_)
score_df[['params', 'rank_test_score', 'mean_train_score',
          'mean_test_score', 'std_train_score']]

	params	rank_test_score	mean_train_score	mean_test_score	std_train_score
0	{'clf__max_depth': 6, 'clf__n_estimators': 50}	4	0.861962	0.691866	0.013440
1	{'clf__max_depth': 6, 'clf__n_estimators': 100}	2	0.860526	0.694737	0.014366
2	{'clf__max_depth': 6, 'clf__n_estimators': 200}	3	0.859091	0.692823	0.013937
3	{'clf__max_depth': 6, 'clf__n_estimators': 1000}	1	0.859809	0.704306	0.013605
4	{'clf__max_depth': 8, 'clf__n_estimators': 50}	8	0.898325	0.684211	0.011597
5	{'clf__max_depth': 8, 'clf__n_estimators': 100}	5	0.900239	0.688038	0.012491
6	{'clf__max_depth': 8, 'clf__n_estimators': 200}	6	0.898804	0.686124	0.013247
7	{'clf__max_depth': 8, 'clf__n_estimators': 1000}	6	0.899043	0.686124	0.012985
8	{'clf__max_depth': 10, 'clf__n_estimators': 50}	12	0.933971	0.664115	0.013730
9	{'clf__max_depth': 10, 'clf__n_estimators': 100}	11	0.933971	0.666029	0.013264
10	{'clf__max_depth': 10, 'clf__n_estimators': 200}	9	0.932775	0.670813	0.012417
11	{'clf__max_depth': 10, 'clf__n_estimators': 1000}	9	0.935646	0.670813	0.014540
12	{'clf__max_depth': 100, 'clf__n_estimators': 50}	13	0.981100	0.648804	0.004102
13	{'clf__max_depth': 100, 'clf__n_estimators': 100}	14	0.981818	0.645933	0.003960
14	{'clf__max_depth': 100, 'clf__n_estimators': 200}	16	0.981818	0.643062	0.003960
15	{'clf__max_depth': 100, 'clf__n_estimators': 1...	14	0.981818	0.645933	0.003960

각 모델간의 성능을 비교하여 데이터프레임으로 생성함

2.13 Best Model

gridsearch.best_estimator_

Pipeline(steps=[('scaler', StandardScaler()),
                ('clf',
                 RandomForestClassifier(max_depth=6, n_estimators=1000,
                                        random_state=13))])

베스트모델 확인

2.14 Test 데이터에서 다시 확인

pred = gridsearch.best_estimator_.predict(X_test)
print(accuracy_score(y_test, pred))

0.8325358851674641

Gridsearch를 사용하여 찾아낸 Best모델을 적용한 Accuracy는 0.83으로 나옴

3. 디카프리오와 윈슬릿의 생존율은?

3.1 디카프리오

import numpy as np
decaprio = np.array([[3, 18, 0, 0, 5, 1]])
print('Decaprio :', gridsearch.best_estimator_.predict_proba(decaprio)[0, 1])

Decaprio : 0.16496996405863845

디카프리오의 정보(3등급, 18세, 형제나 배우자 없음, 부모 없음, 요금은 5불, 성별은 남자 )로 넣음
생존확률은 16% 정도 나온다.

3.2 윈슬릿

import numpy as np
winslet = np.array([[1, 16, 1, 1, 100, 0]])
print('Winslet :', gridsearch.best_estimator_.predict_proba(winslet)[0, 1])

Winslet : 0.9628936507308983 넣고

윈슬릿의 정보 (1등급, 16세, 형제 1명, 부모 1명, 요금은 100불, 여성)로 넣음
생존확률은 96% 나온다.

머신러닝을 이용한 타이타닉 생존자 예측

1. 타이타닉 EDA

1.1 데이터 로드

1.2 생존 상황

1.3 성별에 따른 생존 상황

1.4 경재력 대비 생존률

1.5 선실 등급별 성별 상황

1.6 나이별 승객 현황

1.7 객실별 생존율을 연연별로 관찰

1.8 나이를 5단계로 정리하기

1.9 나이, 성별, 등급별 생존자 수를 한번에 파악하기

1.10 남/여 나이별 생존 상황

1.11 탑승객의 이름에서 신분을 확인

1.13 응용해서 title이라는 사회적 신분을 얻음

1.14 성별별로 본 귀족

1.15 사회적 신분 전처리

1.16 귀족도 생각보다 많이 살았음

2. 머신러닝을 이용한 생존자 예측

2.1 구조확인

2.2 성별 컬럼을 숫자로 변경하기

2.3 Label Encode를 사용하기

2.4 결측치 제외

2.5 상관관계

2.6 특성 선택 후 데이터 나누기

2.7 DecisionTree

2.8 LogisticRegression

2.9 RandomForest

2.10 Pipeline을 만들고

2.11 최적의 파라미터를 위한 그리드 서치

2.12 각 모델간 성능 비교

2.13 Best Model

2.14 Test 데이터에서 다시 확인

3. 디카프리오와 윈슬릿의 생존율은?

3.1 디카프리오

3.2 윈슬릿

Recent Update

Trending Tags

Contents

Trending Tags

머신러닝을 이용한 타이타닉 생존자 예측

1. 타이타닉 EDA

1.1 데이터 로드

1.2 생존 상황

1.3 성별에 따른 생존 상황

1.4 경재력 대비 생존률

1.5 선실 등급별 성별 상황

1.6 나이별 승객 현황

1.7 객실별 생존율을 연연별로 관찰

1.8 나이를 5단계로 정리하기

1.9 나이, 성별, 등급별 생존자 수를 한번에 파악하기

1.10 남/여 나이별 생존 상황

1.11 탑승객의 이름에서 신분을 확인

1.13 응용해서 title이라는 사회적 신분을 얻음

1.14 성별별로 본 귀족

1.15 사회적 신분 전처리

1.16 귀족도 생각보다 많이 살았음

2. 머신러닝을 이용한 생존자 예측

2.1 구조확인

2.2 성별 컬럼을 숫자로 변경하기

2.3 Label Encode를 사용하기

2.4 결측치 제외

2.5 상관관계

2.6 특성 선택 후 데이터 나누기

2.7 DecisionTree

2.8 LogisticRegression

2.9 RandomForest

2.10 Pipeline을 만들고

2.11 최적의 파라미터를 위한 그리드 서치

2.12 각 모델간 성능 비교

2.13 Best Model

2.14 Test 데이터에서 다시 확인

3. 디카프리오와 윈슬릿의 생존율은?

3.1 디카프리오

3.2 윈슬릿

Recent Update

Trending Tags

Contents

Further Reading

IBM HR Data로 해보는 퇴사자 예측

강아지와 고양이 분류기 on PCA

타이타닉 데이터로 해보는 PCA와 kNN

Trending Tags