Posts Porto Seguro’s Safe Driver Prediction 데이터 Preparation & Exploration
Post
Cancel

Porto Seguro’s Safe Driver Prediction 데이터 Preparation & Exploration

1. Porto Seguro Safe Driver Prediction


1.1 Porto Seguro Safe Driver Prediction


2. Introduction


2.1 Introduction

  • 이 노트북은 PorteSeguro 대회의 데이터에서 좋은 통찰력을 얻는 것을 목표로합니다. 그 외에도 모델링을 위해 데이터를 준비하는 몇 가지 팁과 요령을 제공합니다. 노트북은 다음과 같은 주요 섹션으로 구성됩니다.


2.2 Sections

  • Visual inspection of your data
  • Defining the metadata
  • Descriptive statistics
  • Handling imbalanced classes
  • Data quality checks
  • Exploratory data visualization
  • Feature engineering
  • Feature selection
  • Feature scaling
  • Loading packages


2.3 출처


3. Loading packages


3.1 패키지 로드

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#from sklearn.preprocessing import Imputer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier

pd.set_option('display.max_columns', 100)


4. Visual inspection of your data


4.1 Data Load

1
2
train = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/porto-seguro-safe-driver-prediction/train.csv')
test = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/porto-seguro-safe-driver-prediction/test.csv')


4.2 Data at first sight

1
train.head()
idtargetps_ind_01ps_ind_02_catps_ind_03ps_ind_04_catps_ind_05_catps_ind_06_binps_ind_07_binps_ind_08_binps_ind_09_binps_ind_10_binps_ind_11_binps_ind_12_binps_ind_13_binps_ind_14ps_ind_15ps_ind_16_binps_ind_17_binps_ind_18_binps_reg_01ps_reg_02ps_reg_03ps_car_01_catps_car_02_catps_car_03_catps_car_04_catps_car_05_catps_car_06_catps_car_07_catps_car_08_catps_car_09_catps_car_10_catps_car_11_catps_car_11ps_car_12ps_car_13ps_car_14ps_car_15ps_calc_01ps_calc_02ps_calc_03ps_calc_04ps_calc_05ps_calc_06ps_calc_07ps_calc_08ps_calc_09ps_calc_10ps_calc_11ps_calc_12ps_calc_13ps_calc_14ps_calc_15_binps_calc_16_binps_calc_17_binps_calc_18_binps_calc_19_binps_calc_20_bin
07022510010000000110100.70.20.718070101-101410011220.4000000.8836790.3708103.6055510.60.50.23110110159158011001
1901170000100000030010.80.40.766078111-10-11111211930.3162280.6188170.3887162.4494900.30.10.321958173119011010
213054910001000000121000.00.0-1.00000071-10-11411216010.3162280.6415860.3472753.3166250.50.70.122918274277011010
31600120010000000081000.90.20.5809487100111113110410.3741660.5429490.2949582.0000000.60.90.124718422249000000
41700201010000000091000.70.60.840759111-10-11411218230.3160700.5658320.3651032.0000000.40.60.02263102123113000110
1
train.tail()
idtargetps_ind_01ps_ind_02_catps_ind_03ps_ind_04_catps_ind_05_catps_ind_06_binps_ind_07_binps_ind_08_binps_ind_09_binps_ind_10_binps_ind_11_binps_ind_12_binps_ind_13_binps_ind_14ps_ind_15ps_ind_16_binps_ind_17_binps_ind_18_binps_reg_01ps_reg_02ps_reg_03ps_car_01_catps_car_02_catps_car_03_catps_car_04_catps_car_05_catps_car_06_catps_car_07_catps_car_08_catps_car_09_catps_car_10_catps_car_11_catps_car_11ps_car_12ps_car_13ps_car_14ps_car_15ps_calc_01ps_calc_02ps_calc_03ps_calc_04ps_calc_05ps_calc_06ps_calc_07ps_calc_08ps_calc_09ps_calc_10ps_calc_11ps_calc_12ps_calc_13ps_calc_14ps_calc_15_binps_calc_16_binps_calc_17_binps_calc_18_binps_calc_19_binps_calc_20_bin
59520714880130311000000100000131000.50.30.692820101-101111013130.3741660.6846310.3854872.6457510.40.50.3309091124196011011
595208148801605130000010000061000.90.71.38202791-10-11500216320.3872980.972145-1.0000003.6055510.20.20.0248682124138101011
59520914880170111000100000000121000.90.20.65907171-10-1111213130.3974920.5963730.3987481.7320510.40.00.3327480103226001000
5952101488021052310001000000121000.90.40.698212111-10-111112110130.3741660.7644340.3849683.1622780.00.70.0409492114142011100
595211148802700180010000000071000.10.2-1.00000070-10-1010213420.4000000.9326490.3780213.7416570.40.00.52310410254438010000
  • 유사한 그룹에 속하는 기능은 기능 이름 (예 : ind, reg, car, calc)에 태그가 지정됩니다.
  • 기능 이름에는 이진 기능을 나타내는 접미사 bin과 범주 기능을 나타내는 cat이 포함됩니다.
  • 이러한 지정이없는 특징은 연속 형이거나 순서 형입니다.
  • -1 값은 관측치에서 피쳐가 누락되었음을 나타냅니다.
  • target 열은 해당 보험 계약자에 대한 청구가 접수되었는지 여부를 나타냅니다.


1
train.shape
1
(595212, 59)
  • Train Data는 59개의 Column과 595,212개의 Row로 이루어져있습니다.


1
2
train.drop_duplicates()
train.shape
1
(595212, 59)
  • 중복된 데이터가 있는지 확인하기 위해 drop_duplicates를 해보았고, 중복된 데이터는 없는것을 확인하였습니다.


1
test.shape
1
(892816, 58)
  • Test data는 Train data와 비교하여 1개의 column이 부족하지만, 이것은 target column입니다.


1
train.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 595212 entries, 0 to 595211
Data columns (total 59 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   id              595212 non-null  int64  
 1   target          595212 non-null  int64  
 2   ps_ind_01       595212 non-null  int64  
 3   ps_ind_02_cat   595212 non-null  int64  
 4   ps_ind_03       595212 non-null  int64  
 5   ps_ind_04_cat   595212 non-null  int64  
 6   ps_ind_05_cat   595212 non-null  int64  
 7   ps_ind_06_bin   595212 non-null  int64  
 8   ps_ind_07_bin   595212 non-null  int64  
 9   ps_ind_08_bin   595212 non-null  int64  
 10  ps_ind_09_bin   595212 non-null  int64  
 11  ps_ind_10_bin   595212 non-null  int64  
 12  ps_ind_11_bin   595212 non-null  int64  
 13  ps_ind_12_bin   595212 non-null  int64  
 14  ps_ind_13_bin   595212 non-null  int64  
 15  ps_ind_14       595212 non-null  int64  
 16  ps_ind_15       595212 non-null  int64  
 17  ps_ind_16_bin   595212 non-null  int64  
 18  ps_ind_17_bin   595212 non-null  int64  
 19  ps_ind_18_bin   595212 non-null  int64  
 20  ps_reg_01       595212 non-null  float64
 21  ps_reg_02       595212 non-null  float64
 22  ps_reg_03       595212 non-null  float64
 23  ps_car_01_cat   595212 non-null  int64  
 24  ps_car_02_cat   595212 non-null  int64  
 25  ps_car_03_cat   595212 non-null  int64  
 26  ps_car_04_cat   595212 non-null  int64  
 27  ps_car_05_cat   595212 non-null  int64  
 28  ps_car_06_cat   595212 non-null  int64  
 29  ps_car_07_cat   595212 non-null  int64  
 30  ps_car_08_cat   595212 non-null  int64  
 31  ps_car_09_cat   595212 non-null  int64  
 32  ps_car_10_cat   595212 non-null  int64  
 33  ps_car_11_cat   595212 non-null  int64  
 34  ps_car_11       595212 non-null  int64  
 35  ps_car_12       595212 non-null  float64
 36  ps_car_13       595212 non-null  float64
 37  ps_car_14       595212 non-null  float64
 38  ps_car_15       595212 non-null  float64
 39  ps_calc_01      595212 non-null  float64
 40  ps_calc_02      595212 non-null  float64
 41  ps_calc_03      595212 non-null  float64
 42  ps_calc_04      595212 non-null  int64  
 43  ps_calc_05      595212 non-null  int64  
 44  ps_calc_06      595212 non-null  int64  
 45  ps_calc_07      595212 non-null  int64  
 46  ps_calc_08      595212 non-null  int64  
 47  ps_calc_09      595212 non-null  int64  
 48  ps_calc_10      595212 non-null  int64  
 49  ps_calc_11      595212 non-null  int64  
 50  ps_calc_12      595212 non-null  int64  
 51  ps_calc_13      595212 non-null  int64  
 52  ps_calc_14      595212 non-null  int64  
 53  ps_calc_15_bin  595212 non-null  int64  
 54  ps_calc_16_bin  595212 non-null  int64  
 55  ps_calc_17_bin  595212 non-null  int64  
 56  ps_calc_18_bin  595212 non-null  int64  
 57  ps_calc_19_bin  595212 non-null  int64  
 58  ps_calc_20_bin  595212 non-null  int64  
dtypes: float64(10), int64(49)
memory usage: 267.9 MB
  • 14개의 Categorical 변수(cat)는 더미 변수를 만들어야하고, binary 변수(bin)는 binary이기에 더미변수를 만들지 않아도 됩니다.
  • 데이터는 float이거나 int64 데이터 타입입니다.
  • Null값은 없는것으로 나오는데, 누락값은 -1로 처리하였기 때문입니다.


5. Defining the metadata


5.1 Metadata

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
data = []
for f in train.columns:
    # Defining the role
    if f == 'target':
        role = 'target'
    elif f == 'id':
        role = 'id'
    else:
        role = 'input'

    # Defining the level
    if 'bin' in f or f == 'target':
        level = 'binary'
    elif 'cat' in f or f == 'id':
        level = 'nominal'
    elif train[f].dtype == float:
        level = 'interval'
    elif train[f].dtype == int:
        level = 'ordinal'
        
    # Initialize keep to True for all variables except for id
    
    keep = True
    if f == 'id':
        keep = False
        
    # Defining the data type
    
    dtype = train[f].dtype
    
    # Creating a Dict tha contains all the metadata for the variable
    
    f_dict = {
        'varname' : f,
        'role' : role,
        'level' : level,
        'keep' : keep,
        'dtype' : dtype
    }
    data.append(f_dict)
    
meta = pd.DataFrame(data, columns = ['varname', 'role', 'level', 'keep', 'dtype'])
meta.set_index('varname', inplace = True)
1
meta
rolelevelkeepdtype
varname
ididnominalFalseint64
targettargetbinaryTrueint64
ps_ind_01inputordinalTrueint64
ps_ind_02_catinputnominalTrueint64
ps_ind_03inputordinalTrueint64
ps_ind_04_catinputnominalTrueint64
ps_ind_05_catinputnominalTrueint64
ps_ind_06_bininputbinaryTrueint64
ps_ind_07_bininputbinaryTrueint64
ps_ind_08_bininputbinaryTrueint64
ps_ind_09_bininputbinaryTrueint64
ps_ind_10_bininputbinaryTrueint64
ps_ind_11_bininputbinaryTrueint64
ps_ind_12_bininputbinaryTrueint64
ps_ind_13_bininputbinaryTrueint64
ps_ind_14inputordinalTrueint64
ps_ind_15inputordinalTrueint64
ps_ind_16_bininputbinaryTrueint64
ps_ind_17_bininputbinaryTrueint64
ps_ind_18_bininputbinaryTrueint64
ps_reg_01inputintervalTruefloat64
ps_reg_02inputintervalTruefloat64
ps_reg_03inputintervalTruefloat64
ps_car_01_catinputnominalTrueint64
ps_car_02_catinputnominalTrueint64
ps_car_03_catinputnominalTrueint64
ps_car_04_catinputnominalTrueint64
ps_car_05_catinputnominalTrueint64
ps_car_06_catinputnominalTrueint64
ps_car_07_catinputnominalTrueint64
ps_car_08_catinputnominalTrueint64
ps_car_09_catinputnominalTrueint64
ps_car_10_catinputnominalTrueint64
ps_car_11_catinputnominalTrueint64
ps_car_11inputordinalTrueint64
ps_car_12inputintervalTruefloat64
ps_car_13inputintervalTruefloat64
ps_car_14inputintervalTruefloat64
ps_car_15inputintervalTruefloat64
ps_calc_01inputintervalTruefloat64
ps_calc_02inputintervalTruefloat64
ps_calc_03inputintervalTruefloat64
ps_calc_04inputordinalTrueint64
ps_calc_05inputordinalTrueint64
ps_calc_06inputordinalTrueint64
ps_calc_07inputordinalTrueint64
ps_calc_08inputordinalTrueint64
ps_calc_09inputordinalTrueint64
ps_calc_10inputordinalTrueint64
ps_calc_11inputordinalTrueint64
ps_calc_12inputordinalTrueint64
ps_calc_13inputordinalTrueint64
ps_calc_14inputordinalTrueint64
ps_calc_15_bininputbinaryTrueint64
ps_calc_16_bininputbinaryTrueint64
ps_calc_17_bininputbinaryTrueint64
ps_calc_18_bininputbinaryTrueint64
ps_calc_19_bininputbinaryTrueint64
ps_calc_20_bininputbinaryTrueint64
  • 데이터의 시각화, 분석, 모델링 위해 모델의 메타데이터를 데이터프레임에 저장함
    • role: input, ID, target
    • level: nominal, interval, ordinal, binary
    • keep: True or False
    • dtype: int, float, str
1
meta[(meta.level == 'nominal') & (meta.keep == True)].index
1
2
3
4
5
Index(['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_car_01_cat',
       'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_04_cat', 'ps_car_05_cat',
       'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat', 'ps_car_09_cat',
       'ps_car_10_cat', 'ps_car_11_cat'],
      dtype='object', name='varname')
  • Kepp은 True(삭제되지않음)인 nominal 변수들의 목록


1
pd.DataFrame({'count' : meta.groupby(['role', 'level'])['role'].size()}).reset_index()
rolelevelcount
0idnominal1
1inputbinary17
2inputinterval10
3inputnominal14
4inputordinal16
5targetbinary1
  • Role 및 level별 변수의 count를 Group By로 묶어서 DataFrame으로 만듬


6. Descriptive statistics


6.1 Descriptive statistics

  • 데이터프레임에 describe를 할수 있습니다. 다만, 범주형 범수에는 의미가 없으니 실수형 변수에 사용하여 평균, 표준편차 등을 알수 있습니다.


6.2 Interval variables

1
2
v = meta[(meta.level == 'interval') & (meta.keep == True)].index
train[v].describe()
ps_reg_01ps_reg_02ps_reg_03ps_car_12ps_car_13ps_car_14ps_car_15ps_calc_01ps_calc_02ps_calc_03
count595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000
mean0.6109910.4391840.5511020.3799450.8132650.2762563.0658990.4497560.4495890.449849
std0.2876430.4042640.7935060.0583270.2245880.3571540.7313660.2871980.2868930.287153
min0.0000000.000000-1.000000-1.0000000.250619-1.0000000.0000000.0000000.0000000.000000
25%0.4000000.2000000.5250000.3162280.6708670.3331672.8284270.2000000.2000000.200000
50%0.7000000.3000000.7206770.3741660.7658110.3687823.3166250.5000000.4000000.500000
75%0.9000000.6000001.0000000.4000000.9061900.3964853.6055510.7000000.7000000.700000
max0.9000001.8000004.0379451.2649113.7206260.6363963.7416570.9000000.9000000.900000
  • Level이 interval인 변수에 대해 describe를 진행하였습니다.
  • reg 변수들중에는 ps_reg_03에만 -1(Null data)가 있습니다.
  • car 변수들중에는 ps_car_12, ps_car_14에 -1(Null data)가 있습니다.
  • calc 변수들에는 -1(NUll data)는 없습니다.
  • 변수별로 min과 max의 range가 다릅니다, 스케일링을 적용해야 할듯 합니다.
  • interval 변수들의 범위는 그렇게 크지 않음을 알수 있습니다.


6.3 Ordinal variables

1
2
v = meta[(meta.level == 'ordinal') & (meta.keep == True)].index
train[v].describe()
ps_ind_01ps_ind_03ps_ind_14ps_ind_15ps_car_11ps_calc_04ps_calc_05ps_calc_06ps_calc_07ps_calc_08ps_calc_09ps_calc_10ps_calc_11ps_calc_12ps_calc_13ps_calc_14
count595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000
mean1.9003784.4233180.0124517.2999222.3460722.3720811.8858867.6894453.0058239.2259042.3390348.4335905.4413821.4419182.8722887.539026
std1.9837892.6999020.1275453.5460420.8325481.1172191.1349271.3343121.4145641.4596721.2469492.9045972.3328711.2029631.6948872.746652
min0.0000000.0000000.0000000.000000-1.0000000.0000000.0000000.0000000.0000002.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%0.0000002.0000000.0000005.0000002.0000002.0000001.0000007.0000002.0000008.0000001.0000006.0000004.0000001.0000002.0000006.000000
50%1.0000004.0000000.0000007.0000003.0000002.0000002.0000008.0000003.0000009.0000002.0000008.0000005.0000001.0000003.0000007.000000
75%3.0000006.0000000.00000010.0000003.0000003.0000003.0000009.0000004.00000010.0000003.00000010.0000007.0000002.0000004.0000009.000000
max7.00000011.0000004.00000013.0000003.0000005.0000006.00000010.0000009.00000012.0000007.00000025.00000019.00000010.00000013.00000023.000000
  • ps_car_11 변수에만 -1(Null data)가 있습니다.
  • 모두 min, max range가 다르므로 scaling을 진행해야 합니다.


6.4 Binary variables

1
2
v = meta[(meta.level == 'binary') & (meta.keep == True)].index
train[v].describe()
targetps_ind_06_binps_ind_07_binps_ind_08_binps_ind_09_binps_ind_10_binps_ind_11_binps_ind_12_binps_ind_13_binps_ind_16_binps_ind_17_binps_ind_18_binps_calc_15_binps_calc_16_binps_calc_17_binps_calc_18_binps_calc_19_binps_calc_20_bin
count595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000595212.000000
mean0.0364480.3937420.2570330.1639210.1853040.0003730.0016920.0094390.0009480.6608230.1210810.1534460.1224270.6278400.5541820.2871820.3490240.153318
std0.1874010.4885790.4369980.3702050.3885440.0193090.0410970.0966930.0307680.4734300.3262220.3604170.3277790.4833810.4970560.4524470.4766620.360295
min0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
50%0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.0000000.0000000.0000000.0000001.0000001.0000000.0000000.0000000.000000
75%0.0000001.0000001.0000000.0000000.0000000.0000000.0000000.0000000.0000001.0000000.0000000.0000000.0000001.0000001.0000001.0000001.0000000.000000
max1.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.000000
  • Train 데이터에서 target은 3.645% 입니다. 이것은 강력한 불균형 (strongly imbalanced) 입니다.
  • 이것은 대부분의 값이 0으로 되어있음을 의미합니다.


7. Handling imbalanced classes


7.1 Handling imbalanced classes

  • Target = 1인 Record의 비율이 너무 적습니다. 그 말인즉슨, 모두다 target을 0으로 예측해도 얼마안되는 1만 틀린것으로 파악됩니다.
  • 이를 해결하기 위해 1을 오버 샘플링하거나 0을 언더샘플링 하는 방법이 있습니다.
  • 이번에는 언더샘플링을 하겠습니다.


7.2 UnderSampling

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
desired_apriori=0.10

# Get the indices per target value
idx_0 = train[train.target == 0].index
idx_1 = train[train.target == 1].index

# Get original number of records per target value
nb_0 = len(train.loc[idx_0])
nb_1 = len(train.loc[idx_1])

# Calculate the undersampling rate and resulting number of records with target=0
undersampling_rate = ((1-desired_apriori)*nb_1)/(nb_0*desired_apriori)
undersampled_nb_0 = int(undersampling_rate*nb_0)
print('Rate to undersample records with target=0: {}'.format(undersampling_rate))
print('Number of records with target=0 after undersampling: {}'.format(undersampled_nb_0))

# Randomly select records with target=0 to get at the desired a priori
undersampled_idx = shuffle(idx_0, random_state=37, n_samples=undersampled_nb_0)

# Construct list with remaining indices
idx_list = list(undersampled_idx) + list(idx_1)

# Return undersample data frame
train = train.loc[idx_list].reset_index(drop=True)
under_rate = train['target'].sum() / train['target'].count()
print(f'Under sampling으로 변환된 target의 비율 : {under_rate} %')
1
2
3
Rate to undersample records with target=0: 0.34043569687437886
Number of records with target=0 after undersampling: 195246
Under sampling으로 변환된 target의 비율 : 0.1 %
  • undersampling_rate는 0인 타겟이 몇%가 되어야 target 1이 1%가 되는지의 대한 비율
  • desired_apriori = 0.10 는 undersampling 하여 나오게될 target = 1의 비율


8. Data Quality Checks


8.1 Checking missing values

1
2
3
4
5
6
7
8
9
10
vars_with_missing = []

for f in train.columns:
    missings = train[train[f] == -1][f].count()
    if missings > 0:
        vars_with_missing.append(f)
        missings_perc = missings / train.shape[0]
        
        print(f'Variable {f} has {missings} records {missings_perc:.2%} with missing values')
print(f'In total, there are {len(vars_with_missing)} varialbles with missing values')
1
2
3
4
5
6
7
8
9
10
11
12
13
Variable ps_ind_02_cat has 103 records 0.05% with missing values
Variable ps_ind_04_cat has 51 records 0.02% with missing values
Variable ps_ind_05_cat has 2256 records 1.04% with missing values
Variable ps_reg_03 has 38580 records 17.78% with missing values
Variable ps_car_01_cat has 62 records 0.03% with missing values
Variable ps_car_02_cat has 2 records 0.00% with missing values
Variable ps_car_03_cat has 148367 records 68.39% with missing values
Variable ps_car_05_cat has 96026 records 44.26% with missing values
Variable ps_car_07_cat has 4431 records 2.04% with missing values
Variable ps_car_09_cat has 230 records 0.11% with missing values
Variable ps_car_11 has 1 records 0.00% with missing values
Variable ps_car_14 has 15726 records 7.25% with missing values
In total, there are 12 varialbles with missing values
  • Missing Values(Null Data)인 -1을 각 변수별로 찾아서, 비율을 확인한것 입니다.
  • 생각보다 Missing Values가 많은 변수가 있습니다. ps_res_03, ps_car_03_cat, ps_car_05_cat …
  • 총 12개의 변수에서 Missing values가 있습니다.


1
2
3
4
5
6
7
8
9
10
11
12
# Dropping the variables with too many missing values
vars_to_drop = ['ps_car_03_cat', 'ps_car_05_cat']
train.drop(vars_to_drop, inplace=True, axis=1)
meta.loc[(vars_to_drop),'keep'] = False  # Updating the meta

# Imputing with the mean or mode
mean_imp = SimpleImputer(missing_values=-1, strategy='mean')
mode_imp = SimpleImputer(missing_values=-1, strategy='most_frequent')
train['ps_reg_03'] = mean_imp.fit_transform(train[['ps_reg_03']]).ravel()
train['ps_car_12'] = mean_imp.fit_transform(train[['ps_car_12']]).ravel()
train['ps_car_14'] = mean_imp.fit_transform(train[['ps_car_14']]).ravel()
train['ps_car_11'] = mode_imp.fit_transform(train[['ps_car_11']]).ravel()
  • Missing value이있는 다른범주형 변수의 경우 Missing value -1을 그대로 둠
  • ps_reg_03 (continuous)의 18%의 Missing value는 평균으로 바꿉니다.
  • ps_car_11 (ordinal)의 5개의 Misisng values는 최빈값으로 바꿉니다.
  • ps_car_12 (continuous)의 단 1개의 Missing value 평균으로 바꿉니다.
  • ps_car_14 (continuous)의 7% Missing values는 평균으로 바꿉니다.


8.2 Checking the cardinality of the categorical variables

1
2
3
4
5
v = meta[(meta['level'] == 'nominal') & (meta['keep'])].index

for f in v:
    dist_values = train[f].value_counts().shape[0]
    print(f'Variable {f} has {dist_values} distinct values')
1
2
3
4
5
6
7
8
9
10
11
12
Variable ps_ind_02_cat has 5 distinct values
Variable ps_ind_04_cat has 3 distinct values
Variable ps_ind_05_cat has 8 distinct values
Variable ps_car_01_cat has 13 distinct values
Variable ps_car_02_cat has 3 distinct values
Variable ps_car_04_cat has 10 distinct values
Variable ps_car_06_cat has 18 distinct values
Variable ps_car_07_cat has 3 distinct values
Variable ps_car_08_cat has 2 distinct values
Variable ps_car_09_cat has 6 distinct values
Variable ps_car_10_cat has 3 distinct values
Variable ps_car_11_cat has 104 distinct values
  • cardinality는 변수에있는 서로 다른 값의 수를 나타냅니다.
  • 범주 형 변수에서 더미 변수를 만들 것이므로 고유 한 값이 많은 변수가 있는지 확인해야합니다. 이러한 변수는 많은 더미 변수를 생성하므로 다르게 처리해야합니다
  • ps_car_11_cat이 104개의 distinct data를 가집니다.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Script by https://www.kaggle.com/ogrellier
# Code: https://www.kaggle.com/ogrellier/python-target-encoding-for-categorical-features
def add_noise(series, noise_level):
    return series * (1 + noise_level * np.random.randn(len(series)))

def target_encode(trn_series=None, 
                  tst_series=None, 
                  target=None, 
                  min_samples_leaf=1, 
                  smoothing=1,
                  noise_level=0):
    """
    Smoothing is computed like in the following paper by Daniele Micci-Barreca
    https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf
    trn_series : training categorical feature as a pd.Series
    tst_series : test categorical feature as a pd.Series
    target : target data as a pd.Series
    min_samples_leaf (int) : minimum samples to take category average into account
    smoothing (int) : smoothing effect to balance categorical average vs prior  
    """ 
    assert len(trn_series) == len(target)
    assert trn_series.name == tst_series.name
    temp = pd.concat([trn_series, target], axis=1)
    # Compute target mean 
    averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
    # Compute smoothing
    smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))
    # Apply average function to all target data
    prior = target.mean()
    # The bigger the count the less full_avg is taken into account
    averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
    averages.drop(["mean", "count"], axis=1, inplace=True)
    # Apply averages to trn and tst series
    ft_trn_series = pd.merge(
        trn_series.to_frame(trn_series.name),
        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
        on=trn_series.name,
        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
    # pd.merge does not keep the index so restore it
    ft_trn_series.index = trn_series.index 
    ft_tst_series = pd.merge(
        tst_series.to_frame(tst_series.name),
        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
        on=tst_series.name,
        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
    # pd.merge does not keep the index so restore it
    ft_tst_series.index = tst_series.index
    return add_noise(ft_trn_series, noise_level), add_noise(ft_tst_series, noise_level)
1
2
3
4
5
6
7
8
9
10
11
12
train_encoded, test_encoded = target_encode(train['ps_car_11_cat'],
                                            test['ps_car_11_cat'],
                                            target=train.target,
                                            min_samples_leaf=100,
                                            smoothing=10,
                                            noise_level=0.01)

train['ps_car_11_cat_te'] = train_encoded
train.drop('ps_car_11_cat', axis=1, inplace=True)
meta.loc['ps_car_11_cat', 'keep'] = False  # Updating the meta
test['ps_car_11_cat_te'] = test_encoded
test.drop('ps_car_11_cat', axis=1, inplace=True)


9. Exploratory Data Visualization


9.1 Categorical variables

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
v = meta[(meta.level == 'nominal') & (meta.keep)].index

for f in v:
    plt.figure()
    fig, ax = plt.subplots(figsize = (20,10))
    
    # Calculate the percentage of target = 1 per category value
    cat_perc = train[[f, 'target']].groupby([f], as_index = False).mean()
    cat_perc.sort_values(by = 'target', ascending = False, inplace = True)
    
    # Bar plot
    # Order the bars descending on target mean
    sns.barplot(ax = ax, x = f, y = 'target', data = cat_perc, order= cat_perc[f])
    plt.title(f'barplot of {f}', fontsize = 18)
    plt.ylabel('% Target', fontsize = 18)
    plt.xlabel(f, fontsize = 18)
    plt.tick_params(axis = 'both', which = 'major', labelsize = 18)
    plt.show()
1
<Figure size 432x288 with 0 Axes>

1
<Figure size 432x288 with 0 Axes>

1
<Figure size 432x288 with 0 Axes>

1
<Figure size 432x288 with 0 Axes>

1
<Figure size 432x288 with 0 Axes>

1
<Figure size 432x288 with 0 Axes>

1
<Figure size 432x288 with 0 Axes>

1
<Figure size 432x288 with 0 Axes>

1
<Figure size 432x288 with 0 Axes>

1
<Figure size 432x288 with 0 Axes>

1
<Figure size 432x288 with 0 Axes>

  • Categorical variables와 Target = 1인 고객 비율을 살펴 보겠습니다.
  • Missing value가 있는 변수에서 알 수 있듯이 Missing value를 다른 값으로 대체하는 대신 별도의 범주 값으로 유지하는 것이 좋습니다.
  • Missing value가 있는 고객은 보험 청구를 요청할 가능성이 훨씬 더 높은 (경우에 따라 훨씬 더 낮은) 것으로 보입니다


9.2 Interval variables

1
2
3
4
5
6
7
8
9
10
11
12
13
def corr_heatmap(v):
    correlations = train[v].corr()

    # Create color map ranging between two colors
    cmap = sns.diverging_palette(220, 10, as_cmap=True)

    fig, ax = plt.subplots(figsize=(10,10))
    sns.heatmap(correlations, cmap=cmap, vmax=1.0, center=0, fmt='.2f',
                square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .75})
    plt.show();
    
v = meta[(meta.level == 'interval') & (meta.keep)].index
corr_heatmap(v)

  • Interval variables 간의 상관 관계를 확인합니다.
  • heatmap은 변수 간의 상관 관계를 시각화하는 좋은 방법입니다.
  • 아래의 변수들은 강한 상관 관계를 가집니다.
    • ps_reg_02 and ps_reg_03 (0.7)
    • ps_car_12 and ps_car13 (0.67)
    • ps_car_12 and ps_car14 (0.58)
    • ps_car_13 and ps_car15 (0.67)
  • Seaborn은 변수들 사이의 (선형) 관계를 시각화할 수 있는 몇 가지 유용한 플롯을 가지고 있다. 우리는 변수들 사이의 관계를 시각화하기 위해 Pairplot 사용할 수 있습니다.
  • 하지만 Heatmap에서 이미 제한된 수의 상관 변수를 보여 주었기 때문에, 우리는 각각의 높은 상관 관계를 가진 변수들을 개별적으로 살펴보도록 하겠습니다.


1
s = train.sample(frac = 0.1)
  • 참고: 속도를 높이기 위해 학습 데이터의 샘플을 가져옵니다.


1
2
3
sns.lmplot(x='ps_reg_02', y='ps_reg_03', data=s, hue='target',
           palette='Set1', scatter_kws={'alpha': 0.3})
plt.show()

  • ps_reg_02 및 ps_reg_03 회귀선에서 알 수 있듯이 이러한 변수 사이에는 선형 관계가 있습니다.
  • hue 매개 변수는 target = 0과 target = 1에 대한 회귀선이 동일 함을 알 수 있습니다.


1
2
3
sns.lmplot(x='ps_car_12', y='ps_car_13', data=s, hue='target',
           palette='Set1', scatter_kws={'alpha': 0.3})
plt.show()

  • ps_car_12, ps_car_13의 선형관계


1
2
3
sns.lmplot(x='ps_car_12', y='ps_car_14', data=s, hue='target',
           palette='Set1', scatter_kws={'alpha': 0.3})
plt.show()

  • ps_car_12, ps_car_14의 선형관계


1
2
3
sns.lmplot(x='ps_car_15', y='ps_car_13', data=s, hue='target',
           palette='Set1', scatter_kws={'alpha': 0.3})
plt.show()

  • ps_car_15와 ps_car_13


  • 변수에 대해 PCA (주성분 분석)를 수행하여 차원을 줄일 수 있습니다.
  • 하지만 상관 변수의 수가 적기 때문에 모델이 무거운 작업을 수행하도록 할 것입니다.


9.3 Checking the correlations betwwen ordinal variables

1
2
v = meta[(meta.level == 'ordinal') & (meta.keep)].index
corr_heatmap(v)

  • Ordinal variables의 경우 많은 상관 관계를 볼 수 없다.
  • 하지만 Tatget Value으로 그룹화할 때 분포가 어떻게 되는지 살펴볼 수 있다.


10. Feature engineering


10.1 Creating dummy variables

1
2
3
4
v = meta[(meta.level == 'nominal') & (meta.keep)].index
print(f'Before dummification we have {train.shape[1]} variables in train.')
train = pd.get_dummies(train, columns= v, drop_first= True)
print(f'After dummification we have {train.shape[1]} variables in train.')
1
2
Before dummification we have 57 variables in train.
After dummification we have 109 variables in train.
  • Categorical variables의 값은 순서나 크기를 나타내지 않는다. 예를 들어 범주 2는 범주 1의 두 배가 아니다.
  • 그러므로 우리는 그것을 다룰 더미 변수를 만들 수 있다.
  • 이 정보는 원래 변수의 범주에 대해 생성된 다른 더미 변수에서 파생될 수 있으므로 첫 번째 더미 변수를 삭제한다.
  • 총 52개의 dummy 변수를 생성하였습니다.


10.2 Creating interaction variables

1
2
3
4
5
6
7
8
9
10
11
12
v = meta[(meta.level == 'interval') & (meta.keep)].index

poly = PolynomialFeatures(degree = 2, interaction_only= False, include_bias= False)
interactions = pd.DataFrame(data = poly.fit_transform(train[v]), columns=poly.get_feature_names(v))
interactions.drop(v, axis = 1, inplace = True) # Remove the original columns

# Concat the interaction variables to the train data
print(f'Before creating interactions we have {train.shape[1]} variables in train.')

train = pd.concat([train, interactions], axis = 1)

print(f'After creating interactions we have {train.shape[1]} variables in train.')
1
2
Before creating interactions we have 109 variables in train.
After creating interactions we have 164 variables in train.
  • get_feature_names 메서드를 사용해서 편하게 interactions variables을 추가하였습니다.


11. Feature selection


11.1 Removing features with low or zero variance

1
2
3
4
5
6
7
selector = VarianceThreshold(threshold=0.01)
selector.fit(train.drop(['id', 'target'], axis = 1)) # Fit to train without id and target variables

f = np.vectorize(lambda x : not x) # Function to toggle boolean_array elements
v = train.drop(['id', 'target'], axis = 1).columns[f(selector.get_support())]
print(f'{len(v)} variables have too low variance.')
print(f'These variables are {list(v)}')
1
2
28 variables have too low variance.
These variables are ['ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_car_12', 'ps_car_14', 'ps_car_11_cat_te', 'ps_ind_05_cat_2', 'ps_ind_05_cat_5', 'ps_car_01_cat_1', 'ps_car_01_cat_2', 'ps_car_04_cat_3', 'ps_car_04_cat_4', 'ps_car_04_cat_5', 'ps_car_04_cat_6', 'ps_car_04_cat_7', 'ps_car_06_cat_2', 'ps_car_06_cat_5', 'ps_car_06_cat_8', 'ps_car_06_cat_12', 'ps_car_06_cat_16', 'ps_car_06_cat_17', 'ps_car_09_cat_4', 'ps_car_10_cat_1', 'ps_car_10_cat_2', 'ps_car_12^2', 'ps_car_12 ps_car_14', 'ps_car_14^2']
  • 변동이 없거나 매우 낮은 특성을 제거하는 것입니다. (분산이 0인것)
  • Sklearn에는 VarianceThreshold라는 편리한 방법이 있습니다. 기본적으로 분산이 0 인 기능을 제거합니다.
  • 이전 단계에서 0 분산 변수가 없음을 확인 했으므로이 대회에는 적용되지 않습니다.
  • 그러나 분산이 1 % 미만인 특성을 제거하면 31 개의 변수가 제거됩니다.
  • 분산을 기반으로 선택하면 다소 많은 변수(31개)를 잃게됩니다. 그러나 변수가 너무 많지 않기 때문에 classifier가 선택하도록 할 것입니다.
  • 더 많은 변수가있는 데이터 세트의 경우 처리 시간을 줄일 수 있습니다.
  • Sklearn은 또한 다른 기능 선택 방법과 함께 제공됩니다.
  • 이러한 메서드 중 하나는 another classifier가 최상의 기능을 선택하고 계속 진행하도록하는 SelectFromModel입니다.
  • 아래에서는 Random Forest로 수행하겠습니다.


11.2 Selecting features with a Random Forest and SelectFromModel

1
2
3
4
5
6
7
8
9
10
11
12
13
X_train = train.drop(['id', 'target'], axis = 1)
y_train = train['target']

feat_labels = X_train.columns

rf = RandomForestClassifier(n_estimators= 1000, random_state= 0, n_jobs= -1)
rf.fit(X_train, y_train)
importances = rf.feature_importances_

indices = np.argsort(rf.feature_importances_)[::-1]

for f in range(X_train.shape[1]):
    print('%2d) %-*s %f' % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
 1) ps_car_11_cat_te               0.021062
 2) ps_car_13^2                    0.017319
 3) ps_car_13                      0.017288
 4) ps_car_12 ps_car_13            0.017244
 5) ps_car_13 ps_car_14            0.017148
 6) ps_reg_03 ps_car_13            0.017067
 7) ps_car_13 ps_car_15            0.016812
 8) ps_reg_01 ps_car_13            0.016788
 9) ps_reg_03 ps_car_14            0.016261
10) ps_reg_03 ps_car_12            0.015580
11) ps_reg_03 ps_car_15            0.015165
12) ps_car_14 ps_car_15            0.015012
13) ps_car_13 ps_calc_01           0.014751
14) ps_car_13 ps_calc_03           0.014726
15) ps_car_13 ps_calc_02           0.014673
16) ps_reg_02 ps_car_13            0.014671
17) ps_reg_01 ps_reg_03            0.014666
18) ps_reg_01 ps_car_14            0.014455
19) ps_reg_03^2                    0.014283
20) ps_reg_03                      0.014255
21) ps_reg_03 ps_calc_02           0.013804
22) ps_reg_03 ps_calc_03           0.013758
23) ps_reg_03 ps_calc_01           0.013711
24) ps_calc_10                     0.013696
25) ps_car_14 ps_calc_02           0.013633
26) ps_car_14 ps_calc_01           0.013542
27) ps_car_14 ps_calc_03           0.013499
28) ps_calc_14                     0.013363
29) ps_car_12 ps_car_14            0.012968
30) ps_ind_03                      0.012923
31) ps_car_14                      0.012806
32) ps_car_14^2                    0.012734
33) ps_reg_02 ps_car_14            0.012671
34) ps_calc_11                     0.012585
35) ps_reg_02 ps_reg_03            0.012559
36) ps_ind_15                      0.012153
37) ps_car_12 ps_car_15            0.010944
38) ps_car_15 ps_calc_03           0.010888
39) ps_car_15 ps_calc_02           0.010879
40) ps_car_15 ps_calc_01           0.010851
41) ps_calc_13                     0.010479
42) ps_car_12 ps_calc_01           0.010467
43) ps_car_12 ps_calc_03           0.010340
44) ps_car_12 ps_calc_02           0.010287
45) ps_reg_02 ps_car_15            0.010213
46) ps_reg_01 ps_car_15            0.010201
47) ps_calc_02 ps_calc_03          0.010092
48) ps_calc_01 ps_calc_03          0.010010
49) ps_calc_01 ps_calc_02          0.010005
50) ps_calc_07                     0.009837
51) ps_calc_08                     0.009801
52) ps_reg_01 ps_car_12            0.009480
53) ps_reg_02 ps_calc_01           0.009281
54) ps_reg_02 ps_car_12            0.009270
55) ps_reg_02 ps_calc_03           0.009218
56) ps_reg_02 ps_calc_02           0.009210
57) ps_reg_01 ps_calc_03           0.009043
58) ps_reg_01 ps_calc_01           0.009036
59) ps_calc_06                     0.009021
60) ps_reg_01 ps_calc_02           0.008985
61) ps_calc_09                     0.008808
62) ps_ind_01                      0.008519
63) ps_calc_05                     0.008296
64) ps_calc_04                     0.008122
65) ps_calc_12                     0.008066
66) ps_reg_01 ps_reg_02            0.008024
67) ps_car_15^2                    0.006172
68) ps_car_15                      0.006147
69) ps_calc_01                     0.005971
70) ps_calc_03^2                   0.005967
71) ps_calc_03                     0.005955
72) ps_calc_02                     0.005949
73) ps_calc_01^2                   0.005949
74) ps_calc_02^2                   0.005930
75) ps_car_12                      0.005373
76) ps_car_12^2                    0.005366
77) ps_reg_02^2                    0.005007
78) ps_reg_02                      0.004993
79) ps_reg_01                      0.004152
80) ps_reg_01^2                    0.004116
81) ps_car_11                      0.003787
82) ps_ind_05_cat_0                0.003570
83) ps_ind_17_bin                  0.002847
84) ps_calc_17_bin                 0.002692
85) ps_calc_16_bin                 0.002611
86) ps_calc_19_bin                 0.002534
87) ps_calc_18_bin                 0.002485
88) ps_ind_16_bin                  0.002397
89) ps_ind_04_cat_0                0.002387
90) ps_car_01_cat_11               0.002376
91) ps_ind_04_cat_1                0.002370
92) ps_ind_07_bin                  0.002327
93) ps_car_09_cat_2                0.002292
94) ps_ind_02_cat_1                0.002249
95) ps_car_09_cat_0                0.002115
96) ps_car_01_cat_7                0.002103
97) ps_ind_02_cat_2                0.002093
98) ps_calc_20_bin                 0.002081
99) ps_ind_06_bin                  0.002042
100) ps_calc_15_bin                 0.001985
101) ps_car_06_cat_1                0.001983
102) ps_car_07_cat_1                0.001971
103) ps_ind_08_bin                  0.001952
104) ps_car_09_cat_1                0.001833
105) ps_car_06_cat_11               0.001810
106) ps_ind_09_bin                  0.001731
107) ps_ind_18_bin                  0.001718
108) ps_car_01_cat_10               0.001593
109) ps_car_01_cat_9                0.001580
110) ps_car_06_cat_14               0.001549
111) ps_car_01_cat_6                0.001547
112) ps_car_01_cat_4                0.001545
113) ps_ind_05_cat_6                0.001502
114) ps_ind_02_cat_3                0.001437
115) ps_car_07_cat_0                0.001388
116) ps_car_08_cat_1                0.001345
117) ps_car_01_cat_8                0.001335
118) ps_car_02_cat_1                0.001329
119) ps_car_02_cat_0                0.001314
120) ps_car_06_cat_4                0.001232
121) ps_ind_05_cat_4                0.001212
122) ps_car_01_cat_5                0.001151
123) ps_ind_02_cat_4                0.001149
124) ps_car_06_cat_6                0.001111
125) ps_car_06_cat_10               0.001066
126) ps_ind_05_cat_2                0.001025
127) ps_car_04_cat_1                0.001017
128) ps_car_06_cat_7                0.000991
129) ps_car_04_cat_2                0.000979
130) ps_car_01_cat_3                0.000899
131) ps_car_09_cat_3                0.000879
132) ps_car_01_cat_0                0.000872
133) ps_car_06_cat_15               0.000851
134) ps_ind_14                      0.000846
135) ps_car_06_cat_9                0.000796
136) ps_ind_05_cat_1                0.000740
137) ps_car_06_cat_3                0.000706
138) ps_car_10_cat_1                0.000700
139) ps_ind_12_bin                  0.000689
140) ps_ind_05_cat_3                0.000671
141) ps_car_09_cat_4                0.000631
142) ps_car_01_cat_2                0.000562
143) ps_car_04_cat_8                0.000561
144) ps_car_06_cat_17               0.000511
145) ps_car_06_cat_16               0.000481
146) ps_car_04_cat_9                0.000433
147) ps_car_06_cat_12               0.000422
148) ps_car_06_cat_13               0.000385
149) ps_car_01_cat_1                0.000379
150) ps_ind_05_cat_5                0.000305
151) ps_car_06_cat_5                0.000283
152) ps_ind_11_bin                  0.000218
153) ps_car_04_cat_6                0.000207
154) ps_ind_13_bin                  0.000148
155) ps_car_04_cat_3                0.000146
156) ps_car_06_cat_2                0.000137
157) ps_car_06_cat_8                0.000099
158) ps_car_04_cat_5                0.000098
159) ps_car_04_cat_7                0.000082
160) ps_ind_10_bin                  0.000072
161) ps_car_10_cat_2                0.000062
162) ps_car_04_cat_4                0.000044
  • 여기서는 랜덤 포레스트의 feature importances를 기준으로 기능 선택을 할 것입니다.
  • Sklearn의 SelectFromModel을 사용하면 유지할 변수 수를 지정할 수 있습니다.
  • feature importances 수준에 대한 threshold를 수동으로 설정할 수 있습니다.
  • 그러나 우리는 단순히 상위 50 % 최고의 변수를 선택합니다.
  • 위의 코드는 Sebastian Raschka의 GitHub 저장소에서 가져 왔습니다.


1
2
3
4
5
6
sfm = SelectFromModel(rf, threshold='median', prefit=True)
print(f'Number of features before selection : {X_train.shape[1]}')

n_features = sfm.transform(X_train).shape[1]
print(f'Number of features after selection : {n_features}')
selected_vars = list(feat_labels[sfm.get_support()])
1
2
Number of features before selection : 162
Number of features after selection : 81
  • SelectFromModel을 사용하여 사용할 prefit classifier와 feature importances에 대한 threshold을 지정할 수 있습니다.
  • get_support 메소드를 사용하면 train 데이터의 변수 수를 제한 할 수 있습니다.


1
train = train[selected_vars + ['target']]
  • train 데이터에 target까지 더함


12. Feature scaling


12.1 Feature scaling

1
2
scaler = StandardScaler()
scaler.fit_transform(train.drop(['target'], axis = 1))
1
2
3
4
5
6
7
8
9
10
11
12
13
array([[-0.45941104, -1.26665356,  1.05087653, ..., -0.72553616,
        -1.01071913, -1.06173767],
       [ 1.55538958,  0.95034274, -0.63847299, ..., -1.06120876,
        -1.01071913,  0.27907892],
       [ 1.05168943, -0.52765479, -0.92003125, ...,  1.95984463,
        -0.56215309, -1.02449277],
       ...,
       [-0.9631112 ,  0.58084336,  0.48776003, ..., -0.46445747,
         0.18545696,  0.27907892],
       [-0.9631112 , -0.89715418, -1.48314775, ..., -0.91202093,
        -0.41263108,  0.27907892],
       [-0.45941104, -1.26665356,  1.61399304, ...,  0.28148164,
        -0.11358706, -0.72653353]])
  • train 데이터에 standardscaler를 적용 할 수 있습니다.
  • 이 작업이 완료되면 일부 classifier가 더 잘 작동됩니다.


13. Conclusion


13.1 Conclusion

  • Porto Seguro Safe Driver Prediction의 EDA Note book
  • Kaggle 필사를 진행 한것입니다.
This post is licensed under CC BY 4.0 by the author.