Skip to content

Kojungbeom/titanic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

31 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Titanic

Kaggle์— ์žˆ๋Š” "Titanic: Machine Learning from Disaster" ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์–ด๋–ค ์ข…๋ฅ˜์˜ ์‚ฌ๋žŒ๋“ค์ด ์‚ด์•„๋‚จ๋Š”์ง€ ์˜ˆ์ธกํ•˜๊ธฐ

์ด ํ”„๋กœ์ ํŠธ๋Š” Ubuntu 18.04LTS ํ™˜๊ฒฝ์—์„œ ์ œ์ž‘ํ–ˆ์Šต๋‹ˆ๋‹ค.

Index

  • Requirement
  • Installation
  • Experiments

Requirement

Installation

$ conda create -n titanic python=3.7 scikit-learn pandas jupyter
$ conda activate titanic
$ conda list
  1. Anaconda ํ™˜๊ฒฝ์—์„œ ์ž‘์—…ํ•˜๊ธฐ ์œ„ํ•ด titanic์ด๋ผ๋Š” ์ƒˆ๋กœ์šด Environment๋ฅผ ์ƒ์„ฑ
    • ์ƒ์„ฑ๊ณผ ๋™์‹œ์— Python 3.7 scikit-learn pandas jupyter๋ฅผ ์„ค์น˜
  2. conda activate titanic์œผ๋กœ titanic Environment๋ฅผ ํ™œ์„ฑํ™”
  3. conda list๋กœ Package๋“ค์ด ์ž˜ ๊น”๋ ธ๋Š”์ง€ ํ™•์ธ

๋จผ์ € ์ด ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰ํ•˜๊ณ  ์ €์žฅํ•  Github repository๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

ํšŒ์›๊ฐ€์ž… - ์šฐ์ธก์ƒ๋‹จ ํ”„๋กœํ•„ ์‚ฌ์ง„ - Your repositories - new๋ฅผ ๋ˆ„๋ฅธ๋‹ค์Œ Repository name์„ ์„ค์ •ํ•˜๊ณ , Repository๋ฅผ Public์œผ๋กœ ํ• ์ง€, Private๋กœ ํ• ์ง€ ์„ ํƒํ•œ ๋‹ค์Œ Create repository๋ฅผ ๋ˆ„๋ฅด๋ฉด ์‰ฝ๊ฒŒ ์ƒ์„ฑ๋œ๋‹ค.

๋‚˜๋Š” titanic์ด๋ผ๋Š” ์ด๋ฆ„์œผ๋กœ Private repository๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค.

๊ทธ ๋‹ค์Œ์—๋Š” ํ„ฐ๋ฏธ๋„ ํ™˜๊ฒฝ์—์„œ Git์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ๋‹ค์šด๋กœ๋“œํ•œ๋‹ค.

$ sudo apt install git
$ git --version
  • Ubuntu์—์„œ์˜ git ๋‹ค์šด๋กœ๋“œ๋Š” ์—„์ฒญ ๊ฐ„๋‹จํ•˜๋‹ค.
  • ๋‹ค์šด๋กœ๋“œ๊ฐ€ ์™„๋ฃŒ๋˜๋ฉด git --verison์œผ๋กœ ์ž˜ ๊น”๋ ธ๋Š”์ง€์™€ ๋ฒ„์ „์ •๋ณด๋ฅผ ํ™•์ธํ•œ๋‹ค.

๋‹ค์‹œ ๋‚ด repository๋กœ ๋Œ์•„๊ฐ€์„œ clone์ด๋ผ๋Š” ์ดˆ๋ก์ƒ‰ ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฐฝ์ด ๋‚˜์˜จ๋‹ค.

์ €๊ธฐ์„œ ๋นจ๊ฐ„์ƒ‰ ์ค„์ณ์ง„ ๋ถ€๋ถ„์„ ๋ˆ„๋ฅด๋ฉด ๋‚ด Repository์˜ web URL์ด ๋ณต์‚ฌ๊ฐ€ ๋œ๋‹ค.

๋‹ค์‹œ ํ„ฐ๋ฏธ๋„ ํ™˜๊ฒฝ์œผ๋กœ ๋Œ์•„๊ฐ€์„œ ์•„๋ž˜์˜ ๋ช…๋ น์–ด๋ฅผ ์ž…๋ ฅํ•œ๋‹ค.

$ git clone https://github.com/Kojungbeom/titanic.git
$ cd titanic
  • git clone ๋ช…๋ น๊ณผ Github repository์˜ web URL์„ ์‚ฌ์šฉํ•˜๋ฉด ๊ทธ Repository์˜ ์ด๋ฆ„๊ณผ ๊ฐ™์€ ํด๋”๊ฐ€ ์ƒ๊ธฐ๊ณ , ํ•ด๋‹น Repository์˜ ๋ฐ์ดํ„ฐ๋“ค์ด ๋‹ค์šด๋กœ๋“œ๋œ๋‹ค.
$ pip install kaggle
  • pip install kaggle๋กœ kaggle ๋‹ค์šด๋กœ๋“œ๋ฅผ ํ•œ๋‹ค.

    โ€‹

  • kaggle ํ™ˆํŽ˜์ด์ง€ - ํšŒ์›๊ฐ€์ž… - ๋กœ๊ทธ์ธ - ์šฐ์ธก์ƒ๋‹จ ํ”„๋กœํ•„ ์ด๋ฏธ์ง€ ํด๋ฆญ - My Account

Create New API Token์„ ํด๋ฆญํ•˜๊ณ  kaggle.json์„ ๋‹ค์šด๋กœ๋“œ ๋ฐ›์€ ํ›„์—, /home/ines์•„๋ž˜ .kaggle์ด๋ผ๋Š” ์ด๋ฆ„์˜ ํด๋”๋ฅผ ๋งŒ๋“  ํ›„, ๋‹ค์šด๋กœ๋“œ ๋ฐ›์€ kaggle.jsonํŒŒ์ผ์„ ๋„ฃ๋Š”๋‹ค.

$ chmod 600 "path to kaggle.json"
$ kaggle competitions list
# Home ๋””๋ ‰ํ† ๋ฆฌ๋กœ ์ด๋™ํ•œ๋‹ค.
$ cd
$ cd titanic
$ kaggle competitions download -c titanic
  • ํ„ฐ๋ฏธ๋„์—์„œ ./kaggle์— ๋“ค์–ด๊ฐ„ ๋’ค, chmod 600์œผ๋กœ kaggle.json์— ์ฝ๊ธฐ, ์“ฐ๊ธฐ ๊ถŒํ•œ์„ ๋ถ€์—ฌํ•œ๋‹ค. (์ž์„ธํ•œ ๋‚ด์šฉ์€ ์ด ๋ธ”๋กœ๊ทธ์„ ์ฐธ๊ณ )
  • kaggle competitions list๋ฅผ ์ž…๋ ฅํ•˜์—ฌ ์ž˜ ๋‚˜์˜ค๋Š”์ง€ ํ™•์ธ
  • ๋‚ด Github repository ํด๋” ์•ˆ์— titanic dataset์„ ๋‹ค์šด๋ฐ›์•˜๋‹ค.
  • ์••์ถ•ํ’€์–ด์ฃผ๊ณ  .zipํŒŒ์ผ์€ ์‚ญ์ œํ•œ๋‹ค.
  • Repository ์ด๋ฆ„ํ•˜๊ณ  ์••์ถ•์ด ํ’€์–ด์ง„ ํด๋”ํ•˜๊ณ  ์ด๋ฆ„์ด ๋˜‘๊ฐ™์œผ๋‹ˆ ํ˜ผ๋ž€์ด ์—†๋„๋ก ํด๋”์ด๋ฆ„์„ dataset์œผ๋กœ ๋ฐ”๊พผ๋‹ค.

Experiments

์œ„์—์„œ ๋งŒ๋“ค์—ˆ๋˜ home directory์— cloneํ•ด๋†จ๋˜ ํด๋” titanic์— src๋ผ๋Š” ํด๋”๋ฅผ ๋งŒ๋“ค๊ณ  jupyter notebook์„ ์‹คํ–‰ํ•œ๋‹ค.

$ cd titanic
$ mkdir src
$ jupyter notebook

๋‹ค์Œ๊ณผ ๊ฐ™์€ ์›น์‚ฌ์ดํŠธ๊ฐ€ ์—ด๋ฆฌ๋ฉด ์„ฑ๊ณต

srcํด๋”๋กœ ๋“ค์–ด๊ฐ€์„œ ์ƒˆ๋กœ์šด jupyter notebook ํŒŒ์ผ์„ ๋งŒ๋“ ๋‹ค.

Tip) jupyter notebook์€ ๊ฐœ๋ฐœํ•˜๊ธฐ ์ข‹์€ ํ™˜๊ฒฝ์„ ์ œ๊ณตํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋” ํŽธ๋ฆฌํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ๋‹จ์ถ•ํ‚ค๋ฅผ ์ตํžˆ๋Š” ๊ฒƒ์„ ์ถ”์ฒœํ•œ๋‹ค! ์ฐธ๊ณ ํ•  ๋ธ”๋กœ๊ทธ

์ž ์ด์ œ ์ •๋ง ์‹œ์ž‘์ด๋‹ค.

์šฐ์„  ์ฒ˜์Œ์—๋Š” ์‚ฌ์šฉํ•  library๋ฅผ importํ•ด์ฃผ๋Š”๊ฒŒ ์ผ๋ฐ˜์ ์ด๋‹ค.

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
  • numpy๋Š” Train๊ณผ Test, Predict๋ฅผ ํ•˜๊ธฐ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ Array type์œผ๋กœ ๋งŒ๋“ค์–ด์ค„ ๋…€์„์ด๋‹ค.
  • pandas๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์‰ฝ๊ณ  ๊ฐ„๋‹จํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ๋…€์„์ด๋‹ค.
  • matplotlib๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ฐํ™”ํ•˜๊ธฐ์œ„ํ•œ ๋…€์„์ด๋‹ค.

์šฐ์„  ์ด ์„ธ๊ฐ€์ง€๋งŒ importํ•ด๋†“์•˜๋‹ค. ์–ด๋–ค ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•ด์„œ ๋ชจ๋ธ์„ ํ•™์Šตํ• ์ง€ ์ƒ๊ฐํ•˜๊ธฐ์— ์•ž์„œ ๋ฐ์ดํ„ฐ ํŒŒ์•…์ด ์„ ํ–‰๋˜์–ด์•ผํ•œ๋‹ค.

๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ์‚ดํŽด๋ณด๊ธฐ

train_data = pd.read_csv("path to train.csv")
test_data = pd.read_csv("path to test.csv")
train_data.head()

train_data.info()
test_data.info()

train_data.describe()

์ƒˆ๋กœ์šด Title column ๋งŒ๋“ค๊ธฐ

์ด ๋งํฌ๋ฅผ ์ฐธ๊ณ 

def makeTitle(data):
    data['Title'] = data['Name'].str.extract(' ([A-Za-z]+)\.', expand = False)

makeTitle(test_data)
makeTitle(train_data)

์ด๋ฆ„์— ๋ถ™์–ด์žˆ๋Š” Mr, Mrs ๊ฐ™์€ ์ˆ˜์‹์–ด๋ฅผ ์ถ”์ถœํ•ด์„œ Title์ด๋ผ๋Š” ์ƒˆ๋กœ์šด Column์„ Training data์™€ test_data์— ๋งŒ๋“ค์—ˆ๋‹ค. value_counts()๋กœ ํ™•์ธํ•œ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์•˜๋‹ค.

๋‹ค์Œ์—๋Š” Train_data์—๋งŒ ์žˆ๋Š” Survived column์„ drop์‹œํ‚ค๊ณ  test_data ์™€ ๊ฒฐํ•ฉ์‹œ์ผœ์„œ ๊ฐ™์ด ๊ฐ€๊ณตํ•œ๋‹ค.

# ๋ฐ์ดํ„ฐ ํ•ฉ์ณ์„œ ๊ฐ€๊ณตํ•˜๊ธฐ
all_data = pd.concat([train_data, test_data], axis=0)
all_data.info()

์œ„ ์ •๋ณด์— ๋”ฐ๋ฅด๋ฉด Age, Fare, Embark, Cabin์—์„œ ๊ฒฐ์ธก๊ฐ’์ด ์žˆ๋‹ค!

# ๊ฒฐ์ธก๊ฐ’์„ ์ฑ„์šฐ๊ธฐ์œ„ํ•œ ํ‰๊ท ๊ฐ’ ์ค€๋น„
mr_mean = all_data[train_data['Title'] == 'Mr']['Age'].mean()
miss_mean = all_data[train_data['Title'] == 'Miss']['Age'].mean()
mrs_mean = all_data[train_data['Title'] == 'Mrs']['Age'].mean()
master_mean = all_data[train_data['Title'] == 'Master']['Age'].mean()
dr_mean = all_data[train_data['Title'] == 'Dr']['Age'].mean()
rev_mean = all_data[train_data['Title'] == 'Rev']['Age'].mean()
major_mean = all_data[train_data['Title'] == 'Major']['Age'].mean()
mlle_mean = all_data[train_data['Title'] == 'Mlle']['Age'].mean()
col_mean = all_data[train_data['Title'] == 'Col']['Age'].mean()
age_mean = all_data['Age'].mean()

Age ์—ด์˜ ๋น„์–ด์žˆ๋Š”๊ณณ์€ ์–ด๋–ป๊ฒŒ ์ฑ„์›Œ์ค„๊นŒ ํ•˜๋‹ค๊ฐ€, ๋งŒ๋“ค์–ด๋‚ธ Title data๋ฅผ ๊ฐ€์ง€๊ณ  ๊ฐ Title์˜ ๋‚˜์ด์˜ ํ‰๊ท ์œผ๋กœ ๋„ฃ์–ด์ฃผ๋Š”๊ฒŒ ์ข‹๊ฒ ๋‹ค๊ณ  ์ƒ๊ฐํ•˜์—ฌ, ์กด์žฌํ•˜๋Š” Age๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด์„œ ๊ฐ๊ฐ์˜ ํ‰๊ท ์„ ๊ตฌํ•ด๋ƒˆ๋‹ค. ํ•˜๋‚˜๋ฐ–์— ์—†๋Š” Titleํ•ญ๋ชฉ์— ๋Œ€ํ•ด์„œ๋Š” ๊ตณ์ด ํ‰๊ท ์„ ๋งŒ๋“ค์ง€ ์•Š์•˜๋‹ค.

all_array_data = np.array(all_data)

# NaN๊ฐ’์€ ์ด๋ ‡๊ฒŒ ์ ‘๊ทผํ•ด์•ผ ํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.
for data in all_array_data:
    if np.isnan(data[4]):
        if data[11] == 'Mr':
            data[4] = round(mr_mean)
            #print("OK")
        elif data[11] == 'Miss':
            data[4] = round(miss_mean)
            #print("OK")
        elif data[11] == 'Mrs':
            data[4] = round(mrs_mean)
            #print("OK")
        elif data[11] == 'Master':
            data[4] = round(master_mean)
            #print("OK")
        elif data[11] == 'Dr':
            data[4] = round(dr_mean)
            #print("OK")
        elif data[11] == 'Rev':
            data[4] = round(rev_mean)
            #print("OK")
        elif data[11] == 'Major':
            data[4] = round(major_mean)
            #print("OK")
        elif data[11] == 'Mlle':
            data[4] = round(mlle_mean)
            #print("OK")
        elif data[11] == 'Col':
            data[4] = round(col_mean)
            #print("OK")
        else:
            data[4] = round(age_mean)
            #print("OK")

array๋กœ ๋ฐ”๊ฟ”์„œ ์ฒ˜๋ฆฌํ•˜๊ณ ๋‚˜๋‹ˆ๊นŒ column name์ด ๋‹ค ์—†์–ด์ ธ์„œ ๋‹ค์‹œ ์ •์˜ํ•ด์ฃผ๊ณ , ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๋งŒ๋“ค์—ˆ๋‹ค.

column_list = ['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Title']
new_all_data = pd.DataFrame(all_array_data, columns=column_list)

# ๋‚˜๋จธ์ง€ ๋‚จ์•„์žˆ๋Š” ๊ฒฐ์ธก๊ฐ’์€ ๊ฒฐ์ธก๊ฐ’ ์œ„์น˜์˜ ๋ฐ”๋กœ ์œ„์˜ ๊ฐ’์œผ๋กœ ๊ฒฐ์ธก๊ฐ’์„ ์ฑ„์šฐ๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ!
new_all_data['Embarked'] = new_all_data['Embarked'].fillna(method='pad')
new_all_data['Fare'] = new_all_data['Fare'].fillna(method='pad')

# ์ „๋ถ€ ์ฑ„์›Œ์ง„๊ฑธ ํ™•์ธ
new_all_data.info()

์œ„์— ๋ณด์ด๋Š” ๊ฒƒ ์ฒ˜๋Ÿผ ๊ฒฐ์ธก๊ฐ’๋“ค์€ ์ „๋ถ€ ์—†์–ด์กŒ๋‹ค.

๋ฐ์ดํ„ฐ ๋งคํ•‘

์ด์ œ Training์— ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ๋ฅผ ๋งคํ•‘ํ•˜๊ณ , ๋‚˜๋จธ์ง€๋Š” drop์‹œํ‚ค๋Š” ์ž‘์—…์ด ํ•„์š”ํ•˜๋‹ค

# ์—ฐ๋ น๋Œ€ ๋ณ„๋กœ ๋‚˜๋ˆ„๊ธฐ
for data in [new_all_data]:
    data.loc[data['Age'] <= 10, 'Age'] = 0,
    data.loc[(data['Age'] > 10) & (data['Age'] <= 20), 'Age'] = 1,
    data.loc[(data['Age'] > 20) & (data['Age'] <= 30), 'Age'] = 2,
    data.loc[(data['Age'] > 30) & (data['Age'] <= 40), 'Age'] = 3,
    data.loc[(data['Age'] > 40) & (data['Age'] <= 50), 'Age'] = 4,
    data.loc[(data['Age'] > 50) & (data['Age'] <= 60), 'Age'] = 5,
    data.loc[(data['Age'] > 60) & (data['Age'] <= 70), 'Age'] = 6,
    data.loc[data['Age'] > 70, 'Age'] = 7
  • Age๋Š” ์—ฐ๋ น๋Œ€๋ณ„๋กœ Interval์„ ์ฃผ์–ด ๋‚˜๋ˆ„์—ˆ๋‹ค.
# ํด๋ž˜์Šค๋ณ„ Fare์˜ ํ‰๊ท ๊ฐ’์„ ์ด์šฉํ•ด์„œ ๊ฒฝ๊ณ„๋งŒ๋“ค์–ด์„œ ๋งคํ•‘
p1 = new_all_data[new_all_data['Pclass']==1]
p2 = new_all_data[new_all_data['Pclass']==2]
p3 = new_all_data[new_all_data['Pclass']==3]
p1_mean = p1['Fare'].mean()
p2_mean = p2['Fare'].mean()
p3_mean = p3['Fare'].mean()
r1 = (p2_mean - p3_mean) / 2
r2 = (p1_mean - p2_mean) / 2

for data in [new_all_data]:
    data.loc[data['Fare'] <= p3_mean+r1, 'Fare'] = 0,
    data.loc[(data['Fare'] > p3_mean+r1) & (data['Fare'] <= p2_mean+r2), 'Fare'] = 1,
    data.loc[data['Fare'] > p2_mean+r2, 'Fare'] = 2
  • Fare๋Š” ๊ฐ๊ฐ Pclass์™€ ์—ฐ๊ด€์„ฑ์„ ๊ฐ€์ง€๊ณ ์žˆ์–ด์„œ ๊ฐ Pclass์—์„œ Fare์˜ ํ‰๊ท ๊ฐ’์„ ๊ตฌํ•ด์„œ Interval์„ ๋‚˜๋ˆ„๋Š” ์šฉ๋„๋กœ ์‚ฌ์šฉํ•˜์˜€๋‹ค.
# ์„ฑ๋ณ„ ๋งคํ•‘
for data in [new_all_data]:
    data.loc[data['Sex'] == 'male', 'Sex'] = 0,
    data.loc[data['Sex'] == 'female', 'Sex'] = 1
  • Sex์€ ๊ฐ„๋‹จํ•˜๊ฒŒ 0๊ณผ 1๋กœ ๋‚˜๋ˆ ์ฃผ์—ˆ๋‹ค.
# Embarked ๋งคํ•‘
for data in [new_all_data]:
    data.loc[data['Embarked'] == 'S', 'Embarked'] = 0,
    data.loc[data['Embarked'] == 'C', 'Embarked'] = 1,
    data.loc[data['Embarked'] == 'Q', 'Embarked'] = 2
  • Embarked๋„ ๊ฐ„๋‹จํ•˜๊ฒŒ 3๊ฐœ๋กœ ๋‚˜๋ˆ ์ฃผ์—ˆ๋‹ค.
# ํƒ€์ดํ‹€์„ ๋งคํ•‘, ๊ฐœ์ˆ˜๊ฐ€ ์ž‘์€ Title๋“ค์€ ๋‚˜๋จธ์ง€์™€ ๋ฌถ์–ด์„œ ์ฒ˜๋ฆฌ
for data in [new_all_data]:
    data.loc[data['Title'] == 'Mr', 'Title'] = 0,
    data.loc[data['Title'] == 'Miss', 'Title'] = 1,
    data.loc[data['Title'] == 'Mrs', 'Title'] = 2
    data.loc[data['Title'] == 'Master', 'Title'] = 3,
    data.loc[data['Title'] == 'Dr', 'Title'] = 4,
    data.loc[data['Title'] == 'Rev', 'Title'] = 4,
    data.loc[data['Title'] == 'Col', 'Title'] = 4,
    data.loc[data['Title'] == 'Major', 'Title'] = 4,
    data.loc[data['Title'] == 'Sir', 'Title'] = 4,
    data.loc[data['Title'] == 'Countess', 'Title'] = 4,
    data.loc[data['Title'] == 'Don', 'Title'] = 4,
    data.loc[data['Title'] == 'Jonkheer', 'Title'] = 4,
    data.loc[data['Title'] == 'Lady', 'Title'] = 4,
    data.loc[data['Title'] == 'Ms', 'Title'] = 4,
    data.loc[data['Title'] == 'Capt', 'Title'] = 4,
    data.loc[data['Title'] == 'Mme', 'Title'] = 4
    data.loc[data['Title'] == 'Mlle', 'Title'] = 4
    data.loc[data['Title'] == 'Dona', 'Title'] = 4
  • ์ฒ˜์Œ์— ๋งŒ๋“ค์—ˆ๋˜ Title๊ฐ™์€ ๊ฒฝ์šฐ๋Š” ๊ฐ’์ด ๋งŽ์€ ๊ฒƒ๋“ค์€ ๋‹จ๋…์œผ๋กœ ๋งคํ•‘ํ•˜๊ณ , ๋‚˜๋จธ์ง€ ๊ฐ’์ด ์ ์€ ๊ฒƒ๋“ค์€ ๋ฌถ์–ด์„œ ํ•˜๋‚˜์˜ Class๋กœ ์ฒ˜๋ฆฌํ•˜์˜€๋‹ค.
new_all_data.head()

์ด์ œ Training์— ์‚ฌ์šฉ๋˜์ง€ ์•Š์„ Feature๋“ค์„ ์ œ๊ฑฐํ•œ๋‹ค.

# ๋งคํ•‘์„ ๋งˆ์ณค์œผ๋‹ˆ ๋‚˜๋จธ์ง€ ๋ชป์“ธ๊ฒƒ๊ฐ™์€ ๋ฐ์ดํ„ฐ๋“ค์€ ๋‹ค ๋ฒ„๋ฆฐ๋‹ค.
drop_list = ['Ticket', 'SibSp', 'Parch', 'Name', 'Cabin', 'PassengerId']

# ์•„๊นŒ ๋ฌถ์–ด๋†จ๋˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์‹œ ๋ถ„๋ฆฌํ•œ๋‹ค.
final_train_data = new_all_data[new_all_data['PassengerId']<=891]
final_test_data = new_all_data[new_all_data['PassengerId']>891]

final_train_data = final_train_data.drop(drop_list, axis=1)
final_test_data = final_test_data.drop(drop_list, axis=1)

final_train_data.info()
final_test_data.info()

Training

from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

์–ด๋–ค ๋ชจ๋ธ์„ ์“ธ๊นŒํ•˜๋‹ค๊ฐ€ ์—ฌ๋Ÿฌ๊ฐ€์ง€๋ฅผ ์จ๋ณด๊ณ  ์„ฑ๋Šฅ์„ ๋น„๊ตํ•œ ํ›„, ๊ฐ€์žฅ ์ข‹์€ Model๋กœ ์‚ฌ์šฉํ•˜๊ธฐ๋กœ ํ–ˆ๋‹ค.

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold

param_grid = {
    'n_estimators': [200, 300, 400, 500],
    'max_depth': [6, 20, 30, 40],
    'min_samples_leaf' : [3,5,7,10],
    'min_samples_split' : [2,3,5,10]
}
kf = KFold(random_state=42,
           n_splits=10,
           shuffle=True,
          )
rf_grid = GridSearchCV(rf_clf2, param_grid=param_grid, scoring='accuracy')
rf_grid.fit(final_train_data, label)
rf_grid.best_params_

svm_clf = SVC()
param_svm_grid = {
    'degree': [1, 10, 20, 30],
    'C' : [0.1, 10, 20, 30, 40, 70, 100]
}
svm_grid = GridSearchCV(svm_clf, param_grid=param_svm_grid, scoring='accuracy')
svm_grid.fit(final_train_data, label)
svm_grid.best_params_


gbrt = GradientBoostingClassifier(random_state=42)
param_gbrt_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15, 20],
    'min_samples_leaf' : [3,5,7,9],
    'min_samples_split' : [2,4,6,8],
    'learning_rate' : [0.1, 1]
}
gbrt_grid = GridSearchCV(gbrt, param_grid=param_gbrt_grid, scoring='accuracy')
gbrt_grid.fit(final_train_data, label)
gbrt_grid.best_params_
  • ํ•™์Šตํ•˜๊ธฐ์ „ Random Forest์—์„œ ์ด ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ์ตœ์ ์˜ Hyperparameter๋ฅผ ์ฐพ๊ธฐ์œ„ํ•ด GridSearch๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.
label_data = pd.read_csv("path to test.csv")

label = label_data['Survived']

knn_clf = KNeighborsClassifier(n_neighbors = 8)
score = cross_val_score(knn_clf, final_train_data, label, cv=5, scoring='accuracy')
print(score)

# GridSearch ๊ฒฐ๊ณผ๋ฅผ ํ† ๋Œ€๋กœ Hyperparameter ์ž…๋ ฅ
rf_clf = RandomForestClassifier(n_estimators=400, max_depth=20, max_leaf_nodes=5, min_samples_split=2)
score = cross_val_score(rf_clf, final_train_data, label, cv=5, scoring='accuracy')
print(score)

# GridSearch ๊ฒฐ๊ณผ๋ฅผ ํ† ๋Œ€๋กœ Hyperparameter ์ž…๋ ฅ
gbrt = GradientBoostingClassifier(learning_rate=0.1, max_depth=5, min_samples_leaf=9, min_samples_split=2, n_estimators=100)
gbrt.fit(final_train_data, label)
gbrt.score(final_train_data, label)

# GridSearch ๊ฒฐ๊ณผ๋ฅผ ํ† ๋Œ€๋กœ Hyperparameter ์ž…๋ ฅ
svm_clf = SVC(C=10, degree=1)
score = cross_val_score(svm_clf, final_train_data, label, cv=5, scoring='accuracy')
print(score)
  • label์„ ์ƒ์„ฑํ•ด์ฃผ๊ณ , ๋ชจ๋ธ๋“ค์„ Cross_val_score์„ ์ด์šฉํ•ด์„œ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•œ๋‹ค.

๊ฐ๊ฐ์˜ ๋ชจ๋ธ์ด ๋ฐ›์€ ์ ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์•˜๋‹ค.

KNN
[0.81005587 0.78651685 0.76404494 0.78651685 0.80898876]

Random Forest
[0.81564246 0.80337079 0.79775281 0.79213483 0.80337079]

GradientBoostingClassifier
[0.8529741863075196]

SVM
[0.82681564 0.82022472 0.83146067 0.78651685 0.86516854]

SVM, Random Forest, GradientBoosting์œผ๋กœ ์ œ์ถœํ•  ํŒŒ์ผ ๋งŒ๋“ค๊ธฐ

svm๊ณผ Random Forest, GradientBoosting ๊ฐ๊ฐ์— ๋Œ€ํ•œ ํŒŒ์ผ์„ ๋งŒ๋“ค์–ด์„œ ์—ฌ๋Ÿฌ๋ฒˆ ์ œ์ถœํ•ด๋ณด์•˜๋‹ค.

svm_clf.fit(final_train_data, label)
predictions = svm_clf.predict(final_test_data)

์ด์ œ ๋งŒ๋“ค์–ด์ง„ Prediction์œผ๋กœ ์ œ์ถœ์šฉ ํŒŒ์ผ submission.csv, submission_rf.csv, submission_gbf๋ฅผ ๋งŒ๋“ ๋‹ค.

submission = pd.DataFrame({"PassengerId" : test_data['PassengerId'],
                          "Survived" : predictions})
submission.to_csv('submission.csv', index=False)
submission = pd.read_csv('submission.csv')
submission.head()

rf_clf.fit(final_train_data, label)
predictions_rf = rf_clf.predict(final_test_data)
submission_rf = pd.DataFrame({"PassengerId" : test_data['PassengerId'],
                          "Survived" : predictions_rf})
submission_rf.to_csv('submission_rf.csv', index=False)
submission_rf = pd.read_csv('submission_rf.csv')
submission_rf.head()

gbrt.fit(final_train_data, label)
predictions_gbf = gbrt.predict(final_test_data)
submission_gbf = pd.DataFrame({"PassengerId" : test_data['PassengerId'],
                          "Survived" : predictions_gbf})
submission_gbf.to_csv('submission_gbf.csv', index=False)
submission_gbf = pd.read_csv('submission_gbf.csv')
submission_gbf.head()

์œ„์˜ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ์ž์‹ ์˜ ์ฝ”๋“œ๊ฐ€ ์žˆ๋Š” ํด๋”์— submission.csv๊ฐ€ ๋งŒ๋“ค์–ด์ง„๊ฑธ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค!

Submission file ์ œ์ถœ

  • ์—ฌ๊ธฐ์—์„œ Submit Predictions ํด๋ฆญํ•œ ๋’ค, ํŒŒ์ผ์„ ์˜ฌ๋ฆฌ๊ณ , Make Submission์„ ํด๋ฆญํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ ์ˆ˜๋ฅผ ๋ณผ ์ˆ˜๊ฐ€ ์žˆ๋‹ค.

์„ธ๊ฐ€์ง€๋ฅผ ์ „๋ถ€ ์˜ฌ๋ ค ํ…Œ์ŠคํŠธ๋ฅผ ํ•ด๋ณธ ๊ฒฐ๊ณผ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”๋‹ค.

Gradient Boosting Score
-> 0.75598

Random Forest Score
-> 0.7751

SVM Score
-> 0.78468

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published