1. 타이타닉 EDA
1.1 데이터 로드
1
2
3
4
import pandas as pd
titanic = pd.read_excel('https://github.com/hmkim312/datas/blob/main/titanic/titanic.xls?raw=true')
titanic.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
- 타이타닉은 1910년 당시 최대의 여객선으로 영국에거 미국 뉴욕으로 가던 여객선
- 위의 깃헙 링크에 타이타닉 파일을 올려놓았음
- Pclass : 객실 등급
- Survived : 생존 유무
- Sex : 성별
- Name : 이름
- Age : 나이
- Sibsp : 형제 혹은 부부의 수
- Parch : 부모 혹은 자녀의 수
- Fare : 지불한 요금
- Boat : 탈출시 사용한 보트 번호
1.2 생존 상황
1
2
3
4
5
6
7
8
9
10
11
12
13
import matplotlib.pyplot as plt
import seaborn as sns
f, ax = plt.subplots(1, 2, figsize=(18, 8))
titanic['survived'].value_counts().plot.pie(
explode=[0, 0.1], autopct='%1.1f%%', ax=ax[0], shadow=True)
ax[0].set_title('Pie plot - Survived')
ax[0].set_ylabel('')
sns.countplot('survived', data=titanic, ax=ax[1])
ax[1].set_title('Count plot - Survived')
plt.show()
- 38.2%의 생존율
1.3 성별에 따른 생존 상황
1
2
3
4
5
6
7
8
9
f, ax = plt.subplots(1, 2, figsize = (18, 8))
sns.countplot('sex', data = titanic, ax=ax[0])
ax[0].set_title('Count of Passengers of Sex')
ax[0].set_ylabel('')
sns.countplot('sex', hue = 'survived', data = titanic, ax=ax[1])
ax[1].set_title('Sex:Survived and Unsurvived')
plt.show()
- 탑승객은 남성이 많지만, 남성이 생존확률은 낮음
1.4 경재력 대비 생존률
1
pd.crosstab(titanic['pclass'], titanic['survived'], margins=True)
survived | 0 | 1 | All |
---|---|---|---|
pclass | |||
1 | 123 | 200 | 323 |
2 | 158 | 119 | 277 |
3 | 528 | 181 | 709 |
All | 809 | 500 | 1309 |
- 1등실의 생존 가능성이 아주 높음
- 여성의 생존률도 높음
- 1등실에는 여성이 많이 타고있었는가?
1.5 선실 등급별 성별 상황
1
2
3
4
grid = sns.FacetGrid(titanic, row = 'pclass', col = 'sex', height = 4, aspect=2)
grid.map(plt.hist, 'age', alpha =0.8, bins = 20)
grid.add_legend()
plt.show()
- 3등실에는 남성이 많았음, 특히 20대 남성
1.6 나이별 승객 현황
1
2
3
4
import plotly.express as px
fig = px.histogram(titanic, x = 'age')
fig.show()
- 아이들과 20대 ~ 30대 인원이 많음
1.7 객실별 생존율을 연연별로 관찰
1
2
3
4
grid = sns.FacetGrid(titanic, col = 'survived', row = 'pclass', height = 4, aspect = 2)
grid.map(plt.hist, 'age', alpha = .5, bins = 20)
grid.add_legend()
plt.show()
- 선싱 등급이 높으면 생존율이 높음
1.8 나이를 5단계로 정리하기
1
2
3
titanic['age_cat'] = pd.cut(titanic['age'], bins=[0, 7, 15, 30, 60, 100],
include_lowest=True, labels=['baby', 'teen', 'young', 'adult', 'old'])
titanic.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | age_cat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO | young |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON | baby |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON | baby |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON | young |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON | young |
1.9 나이, 성별, 등급별 생존자 수를 한번에 파악하기
1
2
3
4
5
6
7
8
9
10
plt.figure(figsize=(12, 4))
plt.subplot(131)
sns.barplot('pclass', 'survived', data=titanic)
plt.subplot(132)
sns.barplot('age_cat', 'survived', data=titanic)
plt.subplot(133)
sns.barplot('sex', 'survived', data=titanic)
plt.subplots_adjust(top=1, bottom=0.1, left=0.1,
right=1, hspace=0.5, wspace=0.5)
plt.show()
- 어리고, 여성이고, 1등실이면 생존하기 유리했을듯하다
1.10 남/여 나이별 생존 상황
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 6))
women = titanic[titanic['sex'] == 'female']
men = titanic[titanic['sex'] == 'male']
ax = sns.distplot(women[women['survived'] == 1]['age'], bins=20,
label='survived', ax=axes[0], kde=False)
ax = sns.distplot(women[women['survived'] == 0]['age'], bins=40,
label='survived', ax=axes[0], kde=False)
ax.legend()
ax.set_title('Female')
ax = sns.distplot(men[men['survived'] == 1]['age'], bins=18,
label='survived', ax=axes[1], kde=False)
ax = sns.distplot(men[men['survived'] == 0]['age'], bins=40,
label='survived', ax=axes[1], kde=False)
ax.legend()
ax.set_title('Male')
1
Text(0.5, 1.0, 'Male')
1.11 탑승객의 이름에서 신분을 확인
1
2
for idx, dataset in titanic.iterrows():
print(dataset['name'])
1
2
3
4
5
6
Allen, Miss. Elisabeth Walton
Allison, Master. Hudson Trevor
Allison, Miss. Helen Loraine
Allison, Mr. Hudson Joshua Creighton
Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
...
1.13 응용해서 title이라는 사회적 신분을 얻음
1
2
3
4
5
6
7
8
9
10
import re
title = []
for idx, dataset in titanic.iterrows():
tmp = dataset['name']
title.append(re.search('\,\s\w+(\s\w+)?\.', tmp).group()[2:-1])
titanic['title'] = title
titanic.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | age_cat | title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO | young | Miss |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON | baby | Master |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON | baby | Miss |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON | young | Mr |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON | young | Mrs |
1.14 성별별로 본 귀족
1
pd.crosstab(titanic['title'], titanic['sex'])
sex | female | male |
---|---|---|
title | ||
Capt | 0 | 1 |
Col | 0 | 4 |
Don | 0 | 1 |
Dona | 1 | 0 |
Dr | 1 | 7 |
Jonkheer | 0 | 1 |
Lady | 1 | 0 |
Major | 0 | 2 |
Master | 0 | 61 |
Miss | 260 | 0 |
Mlle | 2 | 0 |
Mme | 1 | 0 |
Mr | 0 | 757 |
Mrs | 197 | 0 |
Ms | 2 | 0 |
Rev | 0 | 8 |
Sir | 0 | 1 |
the Countess | 1 | 0 |
1.15 사회적 신분 전처리
1
2
3
4
5
6
7
8
9
10
11
12
13
14
titanic['title'] = titanic['title'].replace('Mlle', 'Miss')
titanic['title'] = titanic['title'].replace('Ms', 'Miss')
titanic['title'] = titanic['title'].replace('Mme', 'Mrs')
Rare_f = ['Dona', 'Dr', 'Lady', 'the Countess']
Rare_m = ['Capt', 'Col', 'Don', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Master']
for each in Rare_f:
titanic['title'] = titanic['title'].replace(each, 'Rare_f')
for each in Rare_m:
titanic['title'] = titanic['title'].replace(each, 'Rare_m')
titanic['title'].unique()
1
array(['Miss', 'Rare_m', 'Mr', 'Mrs', 'Rare_f'], dtype=object)
- 귀속이나 다른 이름을 가지고 있는 사람들은 Miss, Mrs, Mr, Rare_m, Rafe_f로 변경
1.16 귀족도 생각보다 많이 살았음
1
titanic[['title','survived']].groupby(['title'], as_index=False).mean()
title | survived | |
---|---|---|
0 | Miss | 0.678030 |
1 | Mr | 0.162483 |
2 | Mrs | 0.787879 |
3 | Rare_f | 0.636364 |
4 | Rare_m | 0.443038 |
- 귀속의 생존율도 높은편이다.
2. 머신러닝을 이용한 생존자 예측
2.1 구조확인
1
titanic.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null int64
1 survived 1309 non-null int64
2 name 1309 non-null object
3 sex 1309 non-null object
4 age 1046 non-null float64
5 sibsp 1309 non-null int64
6 parch 1309 non-null int64
7 ticket 1309 non-null object
8 fare 1308 non-null float64
9 cabin 295 non-null object
10 embarked 1307 non-null object
11 boat 486 non-null object
12 body 121 non-null float64
13 home.dest 745 non-null object
14 age_cat 1046 non-null category
15 title 1309 non-null object
dtypes: category(1), float64(3), int64(4), object(8)
memory usage: 155.0+ KB
- null값의 처리와, 라벨인코딩도 필요할듯 하다
2.2 성별 컬럼을 숫자로 변경하기
1
titanic['sex'].unique()
1
array(['female', 'male'], dtype=object)
2.3 Label Encode를 사용하기
1
2
3
4
5
6
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(titanic['sex'])
titanic['gender'] = le.transform(titanic['sex'])
titanic.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | age_cat | title | gender | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO | young | Miss | 0 |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON | baby | Rare_m | 1 |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON | baby | Miss | 0 |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON | young | Mr | 1 |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON | young | Mrs | 0 |
- gender라는 컬럼을 생성하여 성별을 변경함
2.4 결측치 제외
1
2
3
titanic = titanic[titanic['age'].notnull()]
titanic = titanic[titanic['fare'].notnull()]
titanic.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1045 entries, 0 to 1308
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1045 non-null int64
1 survived 1045 non-null int64
2 name 1045 non-null object
3 sex 1045 non-null object
4 age 1045 non-null float64
5 sibsp 1045 non-null int64
6 parch 1045 non-null int64
7 ticket 1045 non-null object
8 fare 1045 non-null float64
9 cabin 272 non-null object
10 embarked 1043 non-null object
11 boat 417 non-null object
12 body 119 non-null float64
13 home.dest 685 non-null object
14 age_cat 1045 non-null category
15 title 1045 non-null object
16 gender 1045 non-null int64
dtypes: category(1), float64(3), int64(5), object(8)
memory usage: 140.0+ KB
- gender 컬럼 생성
2.5 상관관계
1
2
3
correlation_matrix = titanic.corr().round(1)
sns.heatmap(data=correlation_matrix, annot=True, cmap='bwr')
plt.show()
- 실수형 데이터들의 상관관계를 보았을때, 생존(Survived)는 성별과 pclass의 상관관계가 높다
2.6 특성 선택 후 데이터 나누기
1
2
3
4
5
6
from sklearn.model_selection import train_test_split
X = titanic[['pclass', 'age', 'sibsp', 'parch', 'fare','gender']]
y = titanic['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 13)
- ‘pclass’, ‘age’, ‘sibsp’, ‘parch’, ‘fare’, ‘gender’ 만 사용하여 예측에 사용해봄
2.7 DecisionTree
1
2
3
4
5
6
7
8
9
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
dt = DecisionTreeClassifier(max_depth= 4, random_state= 13)
dt.fit(X_train, y_train)
pred = dt.predict(X_test)
print(accuracy_score(y_test, pred))
1
0.7655502392344498
- 의사결정나무로 해보았을때 accuracy는 0.76으로 나온다.
- 생각보다 높지는 않아보임
2.8 LogisticRegression
1
2
3
4
5
6
7
8
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
lr = LogisticRegression(random_state= 13, solver='liblinear')
lr.fit(X_train, y_train)
pred = lr.predict(X_test)
print(accuracy_score(y_test, pred))
1
0.7511961722488039
- 로지스틱 회귀는 0.75가 나온다.
2.9 RandomForest
1
2
3
4
5
6
7
8
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
rf = RandomForestClassifier(random_state= 13, n_estimators= 100, max_depth=4)
rf.fit(X_train, y_train)
pred = rf.predict(X_test)
print(accuracy_score(y_test, pred))
1
0.7799043062200957
- 앙상블인 랜덤포레스트는 다른 두개보다 높은 결과를 가져온다
2.10 Pipeline을 만들고
1
2
3
4
5
6
7
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
estimators = [('scaler', StandardScaler()),
('clf', RandomForestClassifier(random_state=13))]
pipe = Pipeline(estimators)
- Standard스케일러를 사용하여, 랜덤포레스트로 Pipeline을 만들었음
2.11 최적의 파라미터를 위한 그리드 서치
1
2
3
4
5
6
7
8
9
10
11
from sklearn.model_selection import GridSearchCV
params = [{
'clf__max_depth': [6, 8, 10, 100],
'clf__n_estimators': [50, 100, 200, 1000]
}]
gridsearch = GridSearchCV(
estimator=pipe, param_grid=params, return_train_score=True, cv=5, verbose=2)
gridsearch.fit(X, y)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] clf__max_depth=6, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=6, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=6, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=6, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=6, clf__n_estimators=50 ..........................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[CV] ........... clf__max_depth=6, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=6, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=6, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=6, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=6, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=6, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=6, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=6, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=6, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=6, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=6, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=6, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=6, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=6, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=6, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=6, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=6, clf__n_estimators=1000, total= 1.0s
[CV] clf__max_depth=6, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=6, clf__n_estimators=1000, total= 1.1s
[CV] clf__max_depth=6, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=6, clf__n_estimators=1000, total= 1.0s
[CV] clf__max_depth=6, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=6, clf__n_estimators=1000, total= 1.1s
[CV] clf__max_depth=6, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=6, clf__n_estimators=1000, total= 1.1s
[CV] clf__max_depth=8, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=8, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=8, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=8, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=8, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=8, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=8, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=8, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=8, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=8, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=8, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=8, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=8, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=8, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=8, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=8, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=8, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=8, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=8, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=8, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=8, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=8, clf__n_estimators=1000, total= 1.1s
[CV] clf__max_depth=8, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=8, clf__n_estimators=1000, total= 1.1s
[CV] clf__max_depth=8, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=8, clf__n_estimators=1000, total= 1.1s
[CV] clf__max_depth=8, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=8, clf__n_estimators=1000, total= 1.1s
[CV] clf__max_depth=8, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=8, clf__n_estimators=1000, total= 1.1s
[CV] clf__max_depth=10, clf__n_estimators=50 .........................
[CV] .......... clf__max_depth=10, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=10, clf__n_estimators=50 .........................
[CV] .......... clf__max_depth=10, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=10, clf__n_estimators=50 .........................
[CV] .......... clf__max_depth=10, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=10, clf__n_estimators=50 .........................
[CV] .......... clf__max_depth=10, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=10, clf__n_estimators=50 .........................
[CV] .......... clf__max_depth=10, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=10, clf__n_estimators=100 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=10, clf__n_estimators=100 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=10, clf__n_estimators=100 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=10, clf__n_estimators=100 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=10, clf__n_estimators=100 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=10, clf__n_estimators=200 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=10, clf__n_estimators=200 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=10, clf__n_estimators=200 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=10, clf__n_estimators=200 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=10, clf__n_estimators=200 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=10, clf__n_estimators=1000 .......................
[CV] ........ clf__max_depth=10, clf__n_estimators=1000, total= 1.2s
[CV] clf__max_depth=10, clf__n_estimators=1000 .......................
[CV] ........ clf__max_depth=10, clf__n_estimators=1000, total= 1.2s
[CV] clf__max_depth=10, clf__n_estimators=1000 .......................
[CV] ........ clf__max_depth=10, clf__n_estimators=1000, total= 1.2s
[CV] clf__max_depth=10, clf__n_estimators=1000 .......................
[CV] ........ clf__max_depth=10, clf__n_estimators=1000, total= 1.2s
[CV] clf__max_depth=10, clf__n_estimators=1000 .......................
[CV] ........ clf__max_depth=10, clf__n_estimators=1000, total= 1.2s
[CV] clf__max_depth=100, clf__n_estimators=50 ........................
[CV] ......... clf__max_depth=100, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=100, clf__n_estimators=50 ........................
[CV] ......... clf__max_depth=100, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=100, clf__n_estimators=50 ........................
[CV] ......... clf__max_depth=100, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=100, clf__n_estimators=50 ........................
[CV] ......... clf__max_depth=100, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=100, clf__n_estimators=50 ........................
[CV] ......... clf__max_depth=100, clf__n_estimators=50, total= 0.1s
[CV] clf__max_depth=100, clf__n_estimators=100 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=100, clf__n_estimators=100 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=100, clf__n_estimators=100 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=100, clf__n_estimators=100 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=100, clf__n_estimators=100 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=100, total= 0.1s
[CV] clf__max_depth=100, clf__n_estimators=200 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=100, clf__n_estimators=200 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=100, clf__n_estimators=200 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=100, clf__n_estimators=200 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=100, clf__n_estimators=200 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=200, total= 0.2s
[CV] clf__max_depth=100, clf__n_estimators=1000 ......................
[CV] ....... clf__max_depth=100, clf__n_estimators=1000, total= 1.2s
[CV] clf__max_depth=100, clf__n_estimators=1000 ......................
[CV] ....... clf__max_depth=100, clf__n_estimators=1000, total= 1.2s
[CV] clf__max_depth=100, clf__n_estimators=1000 ......................
[CV] ....... clf__max_depth=100, clf__n_estimators=1000, total= 1.2s
[CV] clf__max_depth=100, clf__n_estimators=1000 ......................
[CV] ....... clf__max_depth=100, clf__n_estimators=1000, total= 1.2s
[CV] clf__max_depth=100, clf__n_estimators=1000 ......................
[CV] ....... clf__max_depth=100, clf__n_estimators=1000, total= 1.3s
[Parallel(n_jobs=1)]: Done 80 out of 80 | elapsed: 35.1s finished
GridSearchCV(cv=5,
estimator=Pipeline(steps=[('scaler', StandardScaler()),
('clf',
RandomForestClassifier(random_state=13))]),
param_grid=[{'clf__max_depth': [6, 8, 10, 100],
'clf__n_estimators': [50, 100, 200, 1000]}],
return_train_score=True, verbose=2)
- pipeline을 사용하여 gridsearchcv를 사용해보았음
- max_depth는 6,8,10,100, n_estimators는 50, 100, 200, 1000개를 사용함
- pipeline의 gridsearch는
__
를 붙여야 한다.
2.12 각 모델간 성능 비교
1
2
3
score_df = pd.DataFrame(gridsearch.cv_results_)
score_df[['params', 'rank_test_score', 'mean_train_score',
'mean_test_score', 'std_train_score']]
params | rank_test_score | mean_train_score | mean_test_score | std_train_score | |
---|---|---|---|---|---|
0 | {'clf__max_depth': 6, 'clf__n_estimators': 50} | 4 | 0.861962 | 0.691866 | 0.013440 |
1 | {'clf__max_depth': 6, 'clf__n_estimators': 100} | 2 | 0.860526 | 0.694737 | 0.014366 |
2 | {'clf__max_depth': 6, 'clf__n_estimators': 200} | 3 | 0.859091 | 0.692823 | 0.013937 |
3 | {'clf__max_depth': 6, 'clf__n_estimators': 1000} | 1 | 0.859809 | 0.704306 | 0.013605 |
4 | {'clf__max_depth': 8, 'clf__n_estimators': 50} | 8 | 0.898325 | 0.684211 | 0.011597 |
5 | {'clf__max_depth': 8, 'clf__n_estimators': 100} | 5 | 0.900239 | 0.688038 | 0.012491 |
6 | {'clf__max_depth': 8, 'clf__n_estimators': 200} | 6 | 0.898804 | 0.686124 | 0.013247 |
7 | {'clf__max_depth': 8, 'clf__n_estimators': 1000} | 6 | 0.899043 | 0.686124 | 0.012985 |
8 | {'clf__max_depth': 10, 'clf__n_estimators': 50} | 12 | 0.933971 | 0.664115 | 0.013730 |
9 | {'clf__max_depth': 10, 'clf__n_estimators': 100} | 11 | 0.933971 | 0.666029 | 0.013264 |
10 | {'clf__max_depth': 10, 'clf__n_estimators': 200} | 9 | 0.932775 | 0.670813 | 0.012417 |
11 | {'clf__max_depth': 10, 'clf__n_estimators': 1000} | 9 | 0.935646 | 0.670813 | 0.014540 |
12 | {'clf__max_depth': 100, 'clf__n_estimators': 50} | 13 | 0.981100 | 0.648804 | 0.004102 |
13 | {'clf__max_depth': 100, 'clf__n_estimators': 100} | 14 | 0.981818 | 0.645933 | 0.003960 |
14 | {'clf__max_depth': 100, 'clf__n_estimators': 200} | 16 | 0.981818 | 0.643062 | 0.003960 |
15 | {'clf__max_depth': 100, 'clf__n_estimators': 1... | 14 | 0.981818 | 0.645933 | 0.003960 |
- 각 모델간의 성능을 비교하여 데이터프레임으로 생성함
2.13 Best Model
1
gridsearch.best_estimator_
1
2
3
4
Pipeline(steps=[('scaler', StandardScaler()),
('clf',
RandomForestClassifier(max_depth=6, n_estimators=1000,
random_state=13))])
- 베스트모델 확인
2.14 Test 데이터에서 다시 확인
1
2
pred = gridsearch.best_estimator_.predict(X_test)
print(accuracy_score(y_test, pred))
1
0.8325358851674641
- Gridsearch를 사용하여 찾아낸 Best모델을 적용한 Accuracy는 0.83으로 나옴
3. 디카프리오와 윈슬릿의 생존율은?
3.1 디카프리오
1
2
3
import numpy as np
decaprio = np.array([[3, 18, 0, 0, 5, 1]])
print('Decaprio :', gridsearch.best_estimator_.predict_proba(decaprio)[0, 1])
1
Decaprio : 0.16496996405863845
- 디카프리오의 정보(3등급, 18세, 형제나 배우자 없음, 부모 없음, 요금은 5불, 성별은 남자 )로 넣음
- 생존확률은 16% 정도 나온다.
3.2 윈슬릿
1
2
3
import numpy as np
winslet = np.array([[1, 16, 1, 1, 100, 0]])
print('Winslet :', gridsearch.best_estimator_.predict_proba(winslet)[0, 1])
1
Winslet : 0.9628936507308983 넣고
- 윈슬릿의 정보 (1등급, 16세, 형제 1명, 부모 1명, 요금은 100불, 여성)로 넣음
- 생존확률은 96% 나온다.