Skip to content

Latest commit

 

History

History
3150 lines (2718 loc) · 63.7 KB

220323.md

File metadata and controls

3150 lines (2718 loc) · 63.7 KB

패스트캠퍼스의 강의를 정리한 것

링크

Chapter 2

Linear Regression을 사용하여 고객별 연간 지출액을 예측

2-1 모듈 및 데이터 로딩

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("data/ecommerce.csv")

2-2 데이터 특성 확인

데이터 처음 부분 확인

data.head()
Email Address Avatar Avg. Session Length Time on App Time on Website Length of Membership Yearly Amount Spent
0 mstephenson@fernandez.com 835 Frank Tunnel\nWrightmouth, MI 82180-9605 Violet 34.497268 12.655651 39.577668 4.082621 587.951054
1 hduke@hotmail.com 4547 Archer Common\nDiazchester, CA 06566-8576 DarkGreen 31.926272 11.109461 37.268959 2.664034 392.204933
2 pallen@yahoo.com 24645 Valerie Unions Suite 582\nCobbborough, D... Bisque 33.000915 11.330278 37.110597 4.104543 487.547505
3 riverarebecca@gmail.com 1414 David Throughway\nPort Jason, OH 22070-1220 SaddleBrown 34.305557 13.717514 36.721283 3.120179 581.852344
4 mstephens@davidson-herman.com 14023 Rodriguez Passage\nPort Jacobville, PR 3... MediumAquaMarine 33.330673 12.795189 37.536653 4.446308 599.406092
data.head(10)
Email Address Avatar Avg. Session Length Time on App Time on Website Length of Membership Yearly Amount Spent
0 mstephenson@fernandez.com 835 Frank Tunnel\nWrightmouth, MI 82180-9605 Violet 34.497268 12.655651 39.577668 4.082621 587.951054
1 hduke@hotmail.com 4547 Archer Common\nDiazchester, CA 06566-8576 DarkGreen 31.926272 11.109461 37.268959 2.664034 392.204933
2 pallen@yahoo.com 24645 Valerie Unions Suite 582\nCobbborough, D... Bisque 33.000915 11.330278 37.110597 4.104543 487.547505
3 riverarebecca@gmail.com 1414 David Throughway\nPort Jason, OH 22070-1220 SaddleBrown 34.305557 13.717514 36.721283 3.120179 581.852344
4 mstephens@davidson-herman.com 14023 Rodriguez Passage\nPort Jacobville, PR 3... MediumAquaMarine 33.330673 12.795189 37.536653 4.446308 599.406092
5 alvareznancy@lucas.biz 645 Martha Park Apt. 611\nJeffreychester, MN 6... FloralWhite 33.871038 12.026925 34.476878 5.493507 637.102448
6 katherine20@yahoo.com 68388 Reyes Lights Suite 692\nJosephbury, WV 9... DarkSlateBlue 32.021596 11.366348 36.683776 4.685017 521.572175
7 awatkins@yahoo.com Unit 6538 Box 8980\nDPO AP 09026-4941 Aqua 32.739143 12.351959 37.373359 4.434273 549.904146
8 vchurch@walter-martinez.com 860 Lee Key\nWest Debra, SD 97450-0495 Salmon 33.987773 13.386235 37.534497 3.273434 570.200409
9 bonnie69@lin.biz PSC 2734, Box 5255\nAPO AA 98456-7482 Brown 31.936549 11.814128 37.145168 3.202806 427.199385

데이터 끝 부분 확인

data.tail()
Email Address Avatar Avg. Session Length Time on App Time on Website Length of Membership Yearly Amount Spent
495 lewisjessica@craig-evans.com 4483 Jones Motorway Suite 872\nLake Jamiefurt,... Tan 33.237660 13.566160 36.417985 3.746573 573.847438
496 katrina56@gmail.com 172 Owen Divide Suite 497\nWest Richard, CA 19320 PaleVioletRed 34.702529 11.695736 37.190268 3.576526 529.049004
497 dale88@hotmail.com 0787 Andrews Ranch Apt. 633\nSouth Chadburgh, ... Cornsilk 32.646777 11.499409 38.332576 4.958264 551.620145
498 cwilson@hotmail.com 680 Jennifer Lodge Apt. 808\nBrendachester, TX... Teal 33.322501 12.391423 36.840086 2.336485 456.469510
499 hannahwilson@davidson.com 49791 Rachel Heights Apt. 898\nEast Drewboroug... DarkMagenta 33.715981 12.418808 35.771016 2.735160 497.778642

데이터 정보 확인

data.info()
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Email                 500 non-null    object 
 1   Address               500 non-null    object 
 2   Avatar                500 non-null    object 
 3   Avg. Session Length   500 non-null    float64
 4   Time on App           500 non-null    float64
 5   Time on Website       500 non-null    float64
 6   Length of Membership  500 non-null    float64
 7   Yearly Amount Spent   500 non-null    float64
dtypes: float64(5), object(3)
memory usage: 31.4+ KB

non-null이 총 데이터 수와 같아 결측치가 없음을 의미한다.

데이터의 전반적인 수치들 확인

data.describe()
Avg. Session Length Time on App Time on Website Length of Membership Yearly Amount Spent
count 500.000000 500.000000 500.000000 500.000000 500.000000
mean 33.053194 12.052488 37.060445 3.533462 499.314038
std 0.992563 0.994216 1.010489 0.999278 79.314782
min 29.532429 8.508152 33.913847 0.269901 256.670582
25% 32.341822 11.388153 36.349257 2.930450 445.038277
50% 33.082008 11.983231 37.069367 3.533975 498.887875
75% 33.711985 12.753850 37.716432 4.126502 549.313828
max 36.139662 15.126994 40.005182 6.922689 765.518462

데이터들의 스케일을 알 수 있다.

스케일 : min ~ max

Outlier도 검출할 수 있다.

75%와 max 사이에 엄청난 차이가 있다면 Outlier가 있을 수 있다.

2-3 불필요한 변수 처리

컬럼 불러오기

data['Length of Membership']
0      4.082621
1      2.664034
2      4.104543
3      3.120179
4      4.446308
         ...   
495    3.746573
496    3.576526
497    4.958264
498    2.336485
499    2.735160
Name: Length of Membership, Length: 500, dtype: float64

두 개 이상의 컬럼을 불러올 때는 대괄호를 하나 더

data[['Length of Membership', 'Yearly Amount Spent']]
Length of Membership Yearly Amount Spent
0 4.082621 587.951054
1 2.664034 392.204933
2 4.104543 487.547505
3 3.120179 581.852344
4 4.446308 599.406092
... ... ...
495 3.746573 573.847438
496 3.576526 529.049004
497 4.958264 551.620145
498 2.336485 456.469510
499 2.735160 497.778642

500 rows × 2 columns

data = data[['Avg. Session Length', 'Time on App', 'Time on Website', 'Length of Membership', 'Yearly Amount Spent']]
data.head()
Avg. Session Length Time on App Time on Website Length of Membership Yearly Amount Spent
0 34.497268 12.655651 39.577668 4.082621 587.951054
1 31.926272 11.109461 37.268959 2.664034 392.204933
2 33.000915 11.330278 37.110597 4.104543 487.547505
3 34.305557 13.717514 36.721283 3.120179 581.852344
4 33.330673 12.795189 37.536653 4.446308 599.406092

2-4 Train Test Split

정의 : 데이터를 테스트와 학습 용으로 나누는 것

idea : 학습할 때 사용한 데이터 이외의 데이터가 들어왔을 때 잘 처리할 수 있는가를 확인할 수 있어야 함

데이터 나누기

from sklearn.model_selection import train_test_split

X는 독립변수, y는 종속변수

X_train, X_test, y_train, y_test = train_test_split(X, y)

X와 y 정의

X = data[['Avg. Session Length', 'Time on App', 'Time on Website', 'Length of Membership']]
y = data['Yearly Amount Spent']

데이터 분리

분리되는 비율을 지정할 수 있고, 디폴트는 test에 1/3

X_train, X_test, y_train, y_test = train_test_split(X, y)

비율을 지정할 때 train data의 양이 충분한지 확인해야 한다.

test_size가 비율, random_state는 랜덤 시드 값 정도로 생각하면 된다.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)
X_train
Avg. Session Length Time on App Time on Website Length of Membership
205 34.967610 13.919494 37.952013 5.066697
404 32.278443 12.527472 36.688367 3.531402
337 31.827979 12.461147 37.428997 2.974737
440 33.200616 11.965980 36.831536 3.549036
55 33.925297 11.588655 35.252242 3.392050
... ... ... ... ...
343 32.302748 12.815393 37.957810 4.615426
359 32.054262 13.149670 37.650400 4.195614
323 32.762456 10.952353 37.646292 4.019470
280 32.271848 13.485009 37.550880 3.086337
8 33.987773 13.386235 37.534497 3.273434

400 rows × 4 columns

2-5 Linear Regression Model 생성

import statsmodels.api as sm
model = sm.OLS(y_train, X_train)

훈련

model = model.fit()

Linear Regression Report 확인

model.summary()
OLS Regression Results
Dep. Variable: Yearly Amount Spent R-squared (uncentered): 0.998
Model: OLS Adj. R-squared (uncentered): 0.998
Method: Least Squares F-statistic: 4.798e+04
Date: Wed, 23 Mar 2022 Prob (F-statistic): 0.00
Time: 18:18:14 Log-Likelihood: -1820.0
No. Observations: 400 AIC: 3648.
Df Residuals: 396 BIC: 3664.
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Avg. Session Length 11.9059 0.869 13.696 0.000 10.197 13.615
Time on App 34.3257 1.121 30.610 0.000 32.121 36.530
Time on Website -14.1405 0.812 -17.405 0.000 -15.738 -12.543
Length of Membership 61.0149 1.144 53.318 0.000 58.765 63.265
Omnibus: 0.490 Durbin-Watson: 1.987
Prob(Omnibus): 0.783 Jarque-Bera (JB): 0.606
Skew: -0.022 Prob(JB): 0.739
Kurtosis: 2.814 Cond. No. 55.4


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

R-squared

  • 클수록 좋은 모델

  • Adj.가 더욱 적절한 평가 기준

  • 그냥 R-squared는 독립변수의 개수가 많아지면 증가할 수 밖에 없음

  • Adj.는 독립변수의 개수를 고려 가중치를 통해 수치를 표현

Coef

  • 변수의 영향력 (강도와 방향)

  • 기울기(1이 증가할 때 증가하는 수치)

  • 스케일이 다른 경우 공정한 비교가 되지 않을 수도 있음 (e.g. 연봉의 1원과 시간의 1년)

P-value

  • 신뢰할 수 있는 결과인가 평가한 척도

  • 0~1 범위

  • 0.05 이하면 양호

  • 0.05 이상이면 데이터를 신뢰할 수 없다고 판단

SST : 평균값과 실제값의 오차

SSE : 실제값과 예측값의 오차

SSR : 예측값과 실제값의 오차

R-Squared = SSR/SST

2-6 모델을 활용하여 예측하고 평가

예측

pred = model.predict(X_test)

시각화

plt.figure(figsize=(10, 10))
sns.scatterplot(x=y_test, y=pred)

MSE(Mean Squared Error)

MSE(Mean Squared Error) : 예측값과 실제 테스트 값과의 오차의 제곱의 평균

문제

  • 그냥 오차를 더해서 평균을 내는 경우 방향성 때문에 제대로 된 오차를 계산하기 어렵다.

해결 방안

  1. 절댓값

  2. 제곱

from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, pred)
482.28901390889246

위의 수치만 보고 수치의 좋고 나쁨을 판단할 수 없다.

다른 모델의 수치와 비교할 때 의미를 갖는다.

RMSE(Root Mean Squared Error)

MSE는 오차가 제곱만큼 커져서 계산이 되기 때문에 결과 값이 원래 y 값의 스케일보다 너무 큰 느낌을 준다.

이를 해결하기 위해 루트를 씌워주는 것

np.sqrt(mean_squared_error(y_test, pred))
21.961079525125637

MSE와 마찬가지로 이 값 하나만으로는 좋고 나쁨을 판단할 수 없다.

2-7 Numpy and Pandas

Numpy array는 연산 속도가 빠름

Numpy와 Pandas는 상호 호환

Numpy

a = [1, 2, 3]
b = [4, 5, 6]
np.array(a)
array([1, 2, 3])
np.array([a, b])
array([[1, 2, 3],
       [4, 5, 6]])

Pandas

pd.DataFrame([a, b])
0 1 2
0 1 2 3
1 4 5 6

컬럼명과 행 이름도 바꿀 수 있음

pd.DataFrame([a, b], columns=['a','b','c'], index=['x','y'])
a b c
x 1 2 3
y 4 5 6

Series는 1차원

pd.Series(a)
0    1
1    2
2    3
dtype: int64
data = pd.read_csv('data/eCommerce.csv')

DataFrame에서 한 컬럼만 불러오면 Series type임

type(data)
pandas.core.frame.DataFrame
type(data['Yearly Amount Spent'])
pandas.core.series.Series

Series를 DataFrame으로 형변환이 가능하다.

pd.DataFrame(data['Yearly Amount Spent'])
Yearly Amount Spent
0 587.951054
1 392.204933
2 487.547505
3 581.852344
4 599.406092
... ...
495 573.847438
496 529.049004
497 551.620145
498 456.469510
499 497.778642

500 rows × 1 columns

Pandas와 Numpy간의 호환

pd.DataFrame(np.array([a, b]))
0 1 2
0 1 2 3
1 4 5 6
np.array(data)
array([['mstephenson@fernandez.com',
        '835 Frank Tunnel\nWrightmouth, MI 82180-9605', 'Violet', ...,
        39.57766801952616, 4.0826206329529615, 587.9510539684005],
       ['hduke@hotmail.com',
        '4547 Archer Common\nDiazchester, CA 06566-8576', 'DarkGreen',
        ..., 37.268958868297744, 2.66403418213262, 392.2049334443264],
       ['pallen@yahoo.com',
        '24645 Valerie Unions Suite 582\nCobbborough, DC 99414-7564',
        'Bisque', ..., 37.110597442120856, 4.104543202376424,
        487.54750486747207],
       ...,
       ['dale88@hotmail.com',
        '0787 Andrews Ranch Apt. 633\nSouth Chadburgh, TN 56128',
        'Cornsilk', ..., 38.33257633196044, 4.958264472618699,
        551.6201454762477],
       ['cwilson@hotmail.com',
        '680 Jennifer Lodge Apt. 808\nBrendachester, TX 05000-5873',
        'Teal', ..., 36.84008572976701, 2.336484668112853,
        456.469510066298],
       ['hannahwilson@davidson.com',
        '49791 Rachel Heights Apt. 898\nEast Drewborough, OR 55919-9528',
        'DarkMagenta', ..., 35.771016191612965, 2.7351595670822757,
        497.7786422156802]], dtype=object)

Pandas Indexing

data['Yearly Amount Spent']
0      587.951054
1      392.204933
2      487.547505
3      581.852344
4      599.406092
          ...    
495    573.847438
496    529.049004
497    551.620145
498    456.469510
499    497.778642
Name: Yearly Amount Spent, Length: 500, dtype: float64

두 컬럼 이상을 불러올 때 []를 한번 더 사용하는 이유는 리스트라는 하나의 객체로 만들어 전달하기 위함

data[['Time on App', 'Time on Website']]
Time on App Time on Website
0 12.655651 39.577668
1 11.109461 37.268959
2 11.330278 37.110597
3 13.717514 36.721283
4 12.795189 37.536653
... ... ...
495 13.566160 36.417985
496 11.695736 37.190268
497 11.499409 38.332576
498 12.391423 36.840086
499 12.418808 35.771016

500 rows × 2 columns

특정 컬럼 제거

drop은 원래 행에서 이름을 찾아 제거한다.

컬럼을 제거하기 위해서는 axis=1을 옵션으로 주면 된다.

data.drop('Yearly Amount Spent', axis=1)
Email Address Avatar Avg. Session Length Time on App Time on Website Length of Membership
0 mstephenson@fernandez.com 835 Frank Tunnel\nWrightmouth, MI 82180-9605 Violet 34.497268 12.655651 39.577668 4.082621
1 hduke@hotmail.com 4547 Archer Common\nDiazchester, CA 06566-8576 DarkGreen 31.926272 11.109461 37.268959 2.664034
2 pallen@yahoo.com 24645 Valerie Unions Suite 582\nCobbborough, D... Bisque 33.000915 11.330278 37.110597 4.104543
3 riverarebecca@gmail.com 1414 David Throughway\nPort Jason, OH 22070-1220 SaddleBrown 34.305557 13.717514 36.721283 3.120179
4 mstephens@davidson-herman.com 14023 Rodriguez Passage\nPort Jacobville, PR 3... MediumAquaMarine 33.330673 12.795189 37.536653 4.446308
... ... ... ... ... ... ... ...
495 lewisjessica@craig-evans.com 4483 Jones Motorway Suite 872\nLake Jamiefurt,... Tan 33.237660 13.566160 36.417985 3.746573
496 katrina56@gmail.com 172 Owen Divide Suite 497\nWest Richard, CA 19320 PaleVioletRed 34.702529 11.695736 37.190268 3.576526
497 dale88@hotmail.com 0787 Andrews Ranch Apt. 633\nSouth Chadburgh, ... Cornsilk 32.646777 11.499409 38.332576 4.958264
498 cwilson@hotmail.com 680 Jennifer Lodge Apt. 808\nBrendachester, TX... Teal 33.322501 12.391423 36.840086 2.336485
499 hannahwilson@davidson.com 49791 Rachel Heights Apt. 898\nEast Drewboroug... DarkMagenta 33.715981 12.418808 35.771016 2.735160

500 rows × 7 columns

drop도 마찬가지로 두 개 이상의 컬럼을 제거할 때는 []를 써준다.

data.drop(['Email', 'Yearly Amount Spent'], axis=1)
Address Avatar Avg. Session Length Time on App Time on Website Length of Membership
0 835 Frank Tunnel\nWrightmouth, MI 82180-9605 Violet 34.497268 12.655651 39.577668 4.082621
1 4547 Archer Common\nDiazchester, CA 06566-8576 DarkGreen 31.926272 11.109461 37.268959 2.664034
2 24645 Valerie Unions Suite 582\nCobbborough, D... Bisque 33.000915 11.330278 37.110597 4.104543
3 1414 David Throughway\nPort Jason, OH 22070-1220 SaddleBrown 34.305557 13.717514 36.721283 3.120179
4 14023 Rodriguez Passage\nPort Jacobville, PR 3... MediumAquaMarine 33.330673 12.795189 37.536653 4.446308
... ... ... ... ... ... ...
495 4483 Jones Motorway Suite 872\nLake Jamiefurt,... Tan 33.237660 13.566160 36.417985 3.746573
496 172 Owen Divide Suite 497\nWest Richard, CA 19320 PaleVioletRed 34.702529 11.695736 37.190268 3.576526
497 0787 Andrews Ranch Apt. 633\nSouth Chadburgh, ... Cornsilk 32.646777 11.499409 38.332576 4.958264
498 680 Jennifer Lodge Apt. 808\nBrendachester, TX... Teal 33.322501 12.391423 36.840086 2.336485
499 49791 Rachel Heights Apt. 898\nEast Drewboroug... DarkMagenta 33.715981 12.418808 35.771016 2.735160

500 rows × 6 columns

drop은 원본 데이터를 수정하지 않는다.

data
Email Address Avatar Avg. Session Length Time on App Time on Website Length of Membership Yearly Amount Spent
0 mstephenson@fernandez.com 835 Frank Tunnel\nWrightmouth, MI 82180-9605 Violet 34.497268 12.655651 39.577668 4.082621 587.951054
1 hduke@hotmail.com 4547 Archer Common\nDiazchester, CA 06566-8576 DarkGreen 31.926272 11.109461 37.268959 2.664034 392.204933
2 pallen@yahoo.com 24645 Valerie Unions Suite 582\nCobbborough, D... Bisque 33.000915 11.330278 37.110597 4.104543 487.547505
3 riverarebecca@gmail.com 1414 David Throughway\nPort Jason, OH 22070-1220 SaddleBrown 34.305557 13.717514 36.721283 3.120179 581.852344
4 mstephens@davidson-herman.com 14023 Rodriguez Passage\nPort Jacobville, PR 3... MediumAquaMarine 33.330673 12.795189 37.536653 4.446308 599.406092
... ... ... ... ... ... ... ... ...
495 lewisjessica@craig-evans.com 4483 Jones Motorway Suite 872\nLake Jamiefurt,... Tan 33.237660 13.566160 36.417985 3.746573 573.847438
496 katrina56@gmail.com 172 Owen Divide Suite 497\nWest Richard, CA 19320 PaleVioletRed 34.702529 11.695736 37.190268 3.576526 529.049004
497 dale88@hotmail.com 0787 Andrews Ranch Apt. 633\nSouth Chadburgh, ... Cornsilk 32.646777 11.499409 38.332576 4.958264 551.620145
498 cwilson@hotmail.com 680 Jennifer Lodge Apt. 808\nBrendachester, TX... Teal 33.322501 12.391423 36.840086 2.336485 456.469510
499 hannahwilson@davidson.com 49791 Rachel Heights Apt. 898\nEast Drewboroug... DarkMagenta 33.715981 12.418808 35.771016 2.735160 497.778642

500 rows × 8 columns

drop으로 원본 수정

  1. data에 다시 대입

  2. inplace=True 옵션 추가

data.drop(['Email', 'Yearly Amount Spent'], axis=1, inplace=True)
data
Address Avatar Avg. Session Length Time on App Time on Website Length of Membership
0 835 Frank Tunnel\nWrightmouth, MI 82180-9605 Violet 34.497268 12.655651 39.577668 4.082621
1 4547 Archer Common\nDiazchester, CA 06566-8576 DarkGreen 31.926272 11.109461 37.268959 2.664034
2 24645 Valerie Unions Suite 582\nCobbborough, D... Bisque 33.000915 11.330278 37.110597 4.104543
3 1414 David Throughway\nPort Jason, OH 22070-1220 SaddleBrown 34.305557 13.717514 36.721283 3.120179
4 14023 Rodriguez Passage\nPort Jacobville, PR 3... MediumAquaMarine 33.330673 12.795189 37.536653 4.446308
... ... ... ... ... ... ...
495 4483 Jones Motorway Suite 872\nLake Jamiefurt,... Tan 33.237660 13.566160 36.417985 3.746573
496 172 Owen Divide Suite 497\nWest Richard, CA 19320 PaleVioletRed 34.702529 11.695736 37.190268 3.576526
497 0787 Andrews Ranch Apt. 633\nSouth Chadburgh, ... Cornsilk 32.646777 11.499409 38.332576 4.958264
498 680 Jennifer Lodge Apt. 808\nBrendachester, TX... Teal 33.322501 12.391423 36.840086 2.336485
499 49791 Rachel Heights Apt. 898\nEast Drewboroug... DarkMagenta 33.715981 12.418808 35.771016 2.735160

500 rows × 6 columns

특정 행 제거

df = data.head(10)
df.index = ['a', 'b','c','d','e','f','g','h','i','j']
df
Address Avatar Avg. Session Length Time on App Time on Website Length of Membership
a 835 Frank Tunnel\nWrightmouth, MI 82180-9605 Violet 34.497268 12.655651 39.577668 4.082621
b 4547 Archer Common\nDiazchester, CA 06566-8576 DarkGreen 31.926272 11.109461 37.268959 2.664034
c 24645 Valerie Unions Suite 582\nCobbborough, D... Bisque 33.000915 11.330278 37.110597 4.104543
d 1414 David Throughway\nPort Jason, OH 22070-1220 SaddleBrown 34.305557 13.717514 36.721283 3.120179
e 14023 Rodriguez Passage\nPort Jacobville, PR 3... MediumAquaMarine 33.330673 12.795189 37.536653 4.446308
f 645 Martha Park Apt. 611\nJeffreychester, MN 6... FloralWhite 33.871038 12.026925 34.476878 5.493507
g 68388 Reyes Lights Suite 692\nJosephbury, WV 9... DarkSlateBlue 32.021596 11.366348 36.683776 4.685017
h Unit 6538 Box 8980\nDPO AP 09026-4941 Aqua 32.739143 12.351959 37.373359 4.434273
i 860 Lee Key\nWest Debra, SD 97450-0495 Salmon 33.987773 13.386235 37.534497 3.273434
j PSC 2734, Box 5255\nAPO AA 98456-7482 Brown 31.936549 11.814128 37.145168 3.202806

행을 추출할 때는 loc를 사용

df.loc['d']
Address                 1414 David Throughway\nPort Jason, OH 22070-1220
Avatar                                                       SaddleBrown
Avg. Session Length                                            34.305557
Time on App                                                    13.717514
Time on Website                                                36.721283
Length of Membership                                            3.120179
Name: d, dtype: object

여러 행을 추출할 때는 인덱싱 사용

df.loc['d':'h']
Address Avatar Avg. Session Length Time on App Time on Website Length of Membership
d 1414 David Throughway\nPort Jason, OH 22070-1220 SaddleBrown 34.305557 13.717514 36.721283 3.120179
e 14023 Rodriguez Passage\nPort Jacobville, PR 3... MediumAquaMarine 33.330673 12.795189 37.536653 4.446308
f 645 Martha Park Apt. 611\nJeffreychester, MN 6... FloralWhite 33.871038 12.026925 34.476878 5.493507
g 68388 Reyes Lights Suite 692\nJosephbury, WV 9... DarkSlateBlue 32.021596 11.366348 36.683776 4.685017
h Unit 6538 Box 8980\nDPO AP 09026-4941 Aqua 32.739143 12.351959 37.373359 4.434273

인덱스 번호로 추출할 때는 iloc 사용

마지막 번호는 포함 안됨

df.iloc[3:7]
Address Avatar Avg. Session Length Time on App Time on Website Length of Membership
d 1414 David Throughway\nPort Jason, OH 22070-1220 SaddleBrown 34.305557 13.717514 36.721283 3.120179
e 14023 Rodriguez Passage\nPort Jacobville, PR 3... MediumAquaMarine 33.330673 12.795189 37.536653 4.446308
f 645 Martha Park Apt. 611\nJeffreychester, MN 6... FloralWhite 33.871038 12.026925 34.476878 5.493507
g 68388 Reyes Lights Suite 692\nJosephbury, WV 9... DarkSlateBlue 32.021596 11.366348 36.683776 4.685017
df.iloc[3:]
Address Avatar Avg. Session Length Time on App Time on Website Length of Membership
d 1414 David Throughway\nPort Jason, OH 22070-1220 SaddleBrown 34.305557 13.717514 36.721283 3.120179
e 14023 Rodriguez Passage\nPort Jacobville, PR 3... MediumAquaMarine 33.330673 12.795189 37.536653 4.446308
f 645 Martha Park Apt. 611\nJeffreychester, MN 6... FloralWhite 33.871038 12.026925 34.476878 5.493507
g 68388 Reyes Lights Suite 692\nJosephbury, WV 9... DarkSlateBlue 32.021596 11.366348 36.683776 4.685017
h Unit 6538 Box 8980\nDPO AP 09026-4941 Aqua 32.739143 12.351959 37.373359 4.434273
i 860 Lee Key\nWest Debra, SD 97450-0495 Salmon 33.987773 13.386235 37.534497 3.273434
j PSC 2734, Box 5255\nAPO AA 98456-7482 Brown 31.936549 11.814128 37.145168 3.202806
df.iloc[:3]
Address Avatar Avg. Session Length Time on App Time on Website Length of Membership
a 835 Frank Tunnel\nWrightmouth, MI 82180-9605 Violet 34.497268 12.655651 39.577668 4.082621
b 4547 Archer Common\nDiazchester, CA 06566-8576 DarkGreen 31.926272 11.109461 37.268959 2.664034
c 24645 Valerie Unions Suite 582\nCobbborough, D... Bisque 33.000915 11.330278 37.110597 4.104543

iloc으로 컬럼도 인덱싱할 수 있다.

df.iloc[1:4, 0:3]
Address Avatar Avg. Session Length
b 4547 Archer Common\nDiazchester, CA 06566-8576 DarkGreen 31.926272
c 24645 Valerie Unions Suite 582\nCobbborough, D... Bisque 33.000915
d 1414 David Throughway\nPort Jason, OH 22070-1220 SaddleBrown 34.305557

2-8 Linear Regression 원리

데이터를 표현하기 위한 선들이 여러 개 그려지는데 이 중에서 가장 잘 설명할 수 있는 선을 찾는 것이 Linear Regression이다.

이 선을 찾는 방법은 예측값과 실제값의 오차의 평균이 가장 작은 선을 찾는 것이다.

Gradient Descent(경사하강법)을 사용해서 선을 찾는데 도움을 준다.

2-9 수식 만들기

model.summary()
OLS Regression Results
Dep. Variable: Yearly Amount Spent R-squared (uncentered): 0.998
Model: OLS Adj. R-squared (uncentered): 0.998
Method: Least Squares F-statistic: 4.798e+04
Date: Wed, 23 Mar 2022 Prob (F-statistic): 0.00
Time: 20:11:49 Log-Likelihood: -1820.0
No. Observations: 400 AIC: 3648.
Df Residuals: 396 BIC: 3664.
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Avg. Session Length 11.9059 0.869 13.696 0.000 10.197 13.615
Time on App 34.3257 1.121 30.610 0.000 32.121 36.530
Time on Website -14.1405 0.812 -17.405 0.000 -15.738 -12.543
Length of Membership 61.0149 1.144 53.318 0.000 58.765 63.265
Omnibus: 0.490 Durbin-Watson: 1.987
Prob(Omnibus): 0.783 Jarque-Bera (JB): 0.606
Skew: -0.022 Prob(JB): 0.739
Kurtosis: 2.814 Cond. No. 55.4


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

summary로 수식 만들기

y = sum(coef*column)

y = 11.9059Avg. Session Length + 34.3257Time on App + ...

Chapter 3

Logistic Regression을 사용하여 고객별 광고 반응율을 예측

Logistic Regression은 이진 분류

import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

데이터 로딩 및 데이터 확인

data = pd.read_csv('data/advertising.csv')
data.head()
Daily Time Spent on Site Age Area Income Daily Internet Usage Ad Topic Line City Male Country Timestamp Clicked on Ad
0 68.95 NaN 61833.90 256.09 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia 3/27/2016 0:53 0
1 80.23 31.0 68441.85 193.77 Monitored national standardization West Jodi 1 Nauru 4/4/2016 1:39 0
2 69.47 26.0 59785.94 236.50 Organic bottom-line service-desk Davidton 0 San Marino 3/13/2016 20:35 0
3 74.15 29.0 54806.18 245.89 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy 1/10/2016 2:31 0
4 68.37 35.0 73889.99 225.58 Robust logistical utilization South Manuel 0 Iceland 6/3/2016 3:36 0

컬럼 뜻

  • Daily Time on Site

    • 사이트에 머문 시간
  • Age

    • 나이
  • Area Income

    • 개인의 수입을 특정할 수 없어 해당 지역의 평균 수입을 나타냄
  • Daily Internet Usage

    • 하루 인터넷 사용 시간
  • Ad Topic Line

    • 광고에 대한 설명
  • City

    • 사용자 도시
  • Male

    • 성별 (여자면 0, 남자면 1)
  • Country

    • 사용자 국가
  • Timestamp

    • 시간과 관련됨
  • Clicked on Ad

    • 광고를 클릭하지 않았으면 0, 했으면 1

NaN은 비어있는 값이다.

data.tail()
Daily Time Spent on Site Age Area Income Daily Internet Usage Ad Topic Line City Male Country Timestamp Clicked on Ad
995 72.97 30.0 71384.57 208.58 Fundamental modular algorithm Duffystad 1 Lebanon 2/11/2016 21:49 1
996 51.30 45.0 67782.17 134.42 Grass-roots cohesive monitoring New Darlene 1 Bosnia and Herzegovina 4/22/2016 2:07 1
997 51.63 51.0 42415.72 120.37 Expanded intangible solution South Jessica 1 Mongolia 2/1/2016 17:24 1
998 55.55 19.0 41920.79 187.95 Proactive bandwidth-monitored policy West Steven 0 Guatemala 3/24/2016 2:35 0
999 45.01 26.0 29875.80 178.35 Virtual 5thgeneration emulation Ronniemouth 0 Brazil 6/3/2016 21:43 1
data.info()
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Daily Time Spent on Site  1000 non-null   float64
 1   Age                       916 non-null    float64
 2   Area Income               1000 non-null   float64
 3   Daily Internet Usage      1000 non-null   float64
 4   Ad Topic Line             1000 non-null   object 
 5   City                      1000 non-null   object 
 6   Male                      1000 non-null   int64  
 7   Country                   1000 non-null   object 
 8   Timestamp                 1000 non-null   object 
 9   Clicked on Ad             1000 non-null   int64  
dtypes: float64(4), int64(2), object(4)
memory usage: 78.2+ KB

Clicked on Ad가 종속변수로 우리가 예측할 값

Age만 916개가 non-null로 빈 값이 있음을 의미

data.describe()
Daily Time Spent on Site Age Area Income Daily Internet Usage Male Clicked on Ad
count 1000.000000 916.000000 1000.000000 1000.000000 1000.000000 1000.00000
mean 65.000200 36.128821 55000.000080 180.000100 0.481000 0.50000
std 15.853615 9.018548 13414.634022 43.902339 0.499889 0.50025
min 32.600000 19.000000 13996.500000 104.780000 0.000000 0.00000
25% 51.360000 29.000000 47031.802500 138.830000 0.000000 0.00000
50% 68.215000 35.000000 57012.300000 183.130000 0.000000 0.50000
75% 78.547500 42.000000 65470.635000 218.792500 1.000000 1.00000
max 91.430000 61.000000 79484.800000 269.960000 1.000000 1.00000

Area Income이 min과 25%의 차이가 커서 왼쪽으로 좀 치우친 것을 알 수 있다.

Male의 mean값이 0.48인 것으로 48%정도가 남자라는 것을 알 수 있다.

시각화

sns.displot(data['Area Income'])
sns.displot(data['Age'])

고유값 갯수 확인

data['Country'].nunique()
237
data['City'].nunique()
969
data['Ad Topic Line'].nunique()
1000

Missing Value 확인 및 처리

Missing Value

  • na

  • NaN

  • Null

Missing Value 확인

data.isna()
Daily Time Spent on Site Age Area Income Daily Internet Usage Ad Topic Line City Male Country Timestamp Clicked on Ad
0 False True False False False False False False False False
1 False False False False False False False False False False
2 False False False False False False False False False False
3 False False False False False False False False False False
4 False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ...
995 False False False False False False False False False False
996 False False False False False False False False False False
997 False False False False False False False False False False
998 False False False False False False False False False False
999 False False False False False False False False False False

1000 rows × 10 columns

data.isna().sum()
Daily Time Spent on Site     0
Age                         84
Area Income                  0
Daily Internet Usage         0
Ad Topic Line                0
City                         0
Male                         0
Country                      0
Timestamp                    0
Clicked on Ad                0
dtype: int64
data.isna().sum() / len(data)
Daily Time Spent on Site    0.000
Age                         0.084
Area Income                 0.000
Daily Internet Usage        0.000
Ad Topic Line               0.000
City                        0.000
Male                        0.000
Country                     0.000
Timestamp                   0.000
Clicked on Ad               0.000
dtype: float64