패스트캠퍼스의 강의를 정리한 것

Chapter 2

Linear Regression을 사용하여 고객별 연간 지출액을 예측

2-1 모듈 및 데이터 로딩

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv("data/ecommerce.csv")

2-2 데이터 특성 확인

데이터 처음 부분 확인

data.head()

	Email	Address	Avatar	Avg. Session Length	Time on App	Time on Website	Length of Membership	Yearly Amount Spent
0	mstephenson@fernandez.com	835 Frank Tunnel\nWrightmouth, MI 82180-9605	Violet	34.497268	12.655651	39.577668	4.082621	587.951054
1	hduke@hotmail.com	4547 Archer Common\nDiazchester, CA 06566-8576	DarkGreen	31.926272	11.109461	37.268959	2.664034	392.204933
2	pallen@yahoo.com	24645 Valerie Unions Suite 582\nCobbborough, D...	Bisque	33.000915	11.330278	37.110597	4.104543	487.547505
3	riverarebecca@gmail.com	1414 David Throughway\nPort Jason, OH 22070-1220	SaddleBrown	34.305557	13.717514	36.721283	3.120179	581.852344
4	mstephens@davidson-herman.com	14023 Rodriguez Passage\nPort Jacobville, PR 3...	MediumAquaMarine	33.330673	12.795189	37.536653	4.446308	599.406092

data.head(10)

	Email	Address	Avatar	Avg. Session Length	Time on App	Time on Website	Length of Membership	Yearly Amount Spent
0	mstephenson@fernandez.com	835 Frank Tunnel\nWrightmouth, MI 82180-9605	Violet	34.497268	12.655651	39.577668	4.082621	587.951054
1	hduke@hotmail.com	4547 Archer Common\nDiazchester, CA 06566-8576	DarkGreen	31.926272	11.109461	37.268959	2.664034	392.204933
2	pallen@yahoo.com	24645 Valerie Unions Suite 582\nCobbborough, D...	Bisque	33.000915	11.330278	37.110597	4.104543	487.547505
3	riverarebecca@gmail.com	1414 David Throughway\nPort Jason, OH 22070-1220	SaddleBrown	34.305557	13.717514	36.721283	3.120179	581.852344
4	mstephens@davidson-herman.com	14023 Rodriguez Passage\nPort Jacobville, PR 3...	MediumAquaMarine	33.330673	12.795189	37.536653	4.446308	599.406092
5	alvareznancy@lucas.biz	645 Martha Park Apt. 611\nJeffreychester, MN 6...	FloralWhite	33.871038	12.026925	34.476878	5.493507	637.102448
6	katherine20@yahoo.com	68388 Reyes Lights Suite 692\nJosephbury, WV 9...	DarkSlateBlue	32.021596	11.366348	36.683776	4.685017	521.572175
7	awatkins@yahoo.com	Unit 6538 Box 8980\nDPO AP 09026-4941	Aqua	32.739143	12.351959	37.373359	4.434273	549.904146
8	vchurch@walter-martinez.com	860 Lee Key\nWest Debra, SD 97450-0495	Salmon	33.987773	13.386235	37.534497	3.273434	570.200409
9	bonnie69@lin.biz	PSC 2734, Box 5255\nAPO AA 98456-7482	Brown	31.936549	11.814128	37.145168	3.202806	427.199385

데이터 끝 부분 확인

data.tail()

	Email	Address	Avatar	Avg. Session Length	Time on App	Time on Website	Length of Membership	Yearly Amount Spent
495	lewisjessica@craig-evans.com	4483 Jones Motorway Suite 872\nLake Jamiefurt,...	Tan	33.237660	13.566160	36.417985	3.746573	573.847438
496	katrina56@gmail.com	172 Owen Divide Suite 497\nWest Richard, CA 19320	PaleVioletRed	34.702529	11.695736	37.190268	3.576526	529.049004
497	dale88@hotmail.com	0787 Andrews Ranch Apt. 633\nSouth Chadburgh, ...	Cornsilk	32.646777	11.499409	38.332576	4.958264	551.620145
498	cwilson@hotmail.com	680 Jennifer Lodge Apt. 808\nBrendachester, TX...	Teal	33.322501	12.391423	36.840086	2.336485	456.469510
499	hannahwilson@davidson.com	49791 Rachel Heights Apt. 898\nEast Drewboroug...	DarkMagenta	33.715981	12.418808	35.771016	2.735160	497.778642

데이터 정보 확인

data.info()

RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Email                 500 non-null    object 
 1   Address               500 non-null    object 
 2   Avatar                500 non-null    object 
 3   Avg. Session Length   500 non-null    float64
 4   Time on App           500 non-null    float64
 5   Time on Website       500 non-null    float64
 6   Length of Membership  500 non-null    float64
 7   Yearly Amount Spent   500 non-null    float64
dtypes: float64(5), object(3)
memory usage: 31.4+ KB

non-null이 총 데이터 수와 같아 결측치가 없음을 의미한다.

데이터의 전반적인 수치들 확인

data.describe()

	Avg. Session Length	Time on App	Time on Website	Length of Membership	Yearly Amount Spent
count	500.000000	500.000000	500.000000	500.000000	500.000000
mean	33.053194	12.052488	37.060445	3.533462	499.314038
std	0.992563	0.994216	1.010489	0.999278	79.314782
min	29.532429	8.508152	33.913847	0.269901	256.670582
25%	32.341822	11.388153	36.349257	2.930450	445.038277
50%	33.082008	11.983231	37.069367	3.533975	498.887875
75%	33.711985	12.753850	37.716432	4.126502	549.313828
max	36.139662	15.126994	40.005182	6.922689	765.518462

데이터들의 스케일을 알 수 있다.

스케일 : min ~ max

Outlier도 검출할 수 있다.

75%와 max 사이에 엄청난 차이가 있다면 Outlier가 있을 수 있다.

2-3 불필요한 변수 처리

컬럼 불러오기

data['Length of Membership']

0      4.082621
1      2.664034
2      4.104543
3      3.120179
4      4.446308
         ...   
495    3.746573
496    3.576526
497    4.958264
498    2.336485
499    2.735160
Name: Length of Membership, Length: 500, dtype: float64

두 개 이상의 컬럼을 불러올 때는 대괄호를 하나 더

data[['Length of Membership', 'Yearly Amount Spent']]

	Length of Membership	Yearly Amount Spent
0	4.082621	587.951054
1	2.664034	392.204933
2	4.104543	487.547505
3	3.120179	581.852344
4	4.446308	599.406092
...	...	...
495	3.746573	573.847438
496	3.576526	529.049004
497	4.958264	551.620145
498	2.336485	456.469510
499	2.735160	497.778642

500 rows × 2 columns

data = data[['Avg. Session Length', 'Time on App', 'Time on Website', 'Length of Membership', 'Yearly Amount Spent']]

data.head()

	Avg. Session Length	Time on App	Time on Website	Length of Membership	Yearly Amount Spent
0	34.497268	12.655651	39.577668	4.082621	587.951054
1	31.926272	11.109461	37.268959	2.664034	392.204933
2	33.000915	11.330278	37.110597	4.104543	487.547505
3	34.305557	13.717514	36.721283	3.120179	581.852344
4	33.330673	12.795189	37.536653	4.446308	599.406092

2-4 Train Test Split

정의 : 데이터를 테스트와 학습 용으로 나누는 것

idea : 학습할 때 사용한 데이터 이외의 데이터가 들어왔을 때 잘 처리할 수 있는가를 확인할 수 있어야 함

데이터 나누기

from sklearn.model_selection import train_test_split

X는 독립변수, y는 종속변수

X_train, X_test, y_train, y_test = train_test_split(X, y)

X와 y 정의

X = data[['Avg. Session Length', 'Time on App', 'Time on Website', 'Length of Membership']]

y = data['Yearly Amount Spent']

데이터 분리

분리되는 비율을 지정할 수 있고, 디폴트는 test에 1/3

X_train, X_test, y_train, y_test = train_test_split(X, y)

비율을 지정할 때 train data의 양이 충분한지 확인해야 한다.

test_size가 비율, random_state는 랜덤 시드 값 정도로 생각하면 된다.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

X_train

	Avg. Session Length	Time on App	Time on Website	Length of Membership
205	34.967610	13.919494	37.952013	5.066697
404	32.278443	12.527472	36.688367	3.531402
337	31.827979	12.461147	37.428997	2.974737
440	33.200616	11.965980	36.831536	3.549036
55	33.925297	11.588655	35.252242	3.392050
...	...	...	...	...
343	32.302748	12.815393	37.957810	4.615426
359	32.054262	13.149670	37.650400	4.195614
323	32.762456	10.952353	37.646292	4.019470
280	32.271848	13.485009	37.550880	3.086337
8	33.987773	13.386235	37.534497	3.273434

400 rows × 4 columns

2-5 Linear Regression Model 생성

import statsmodels.api as sm

model = sm.OLS(y_train, X_train)

훈련

model = model.fit()

Linear Regression Report 확인

model.summary()

OLS Regression Results

Dep. Variable:	Yearly Amount Spent	R-squared (uncentered):	0.998
Model:	OLS	Adj. R-squared (uncentered):	0.998
Method:	Least Squares	F-statistic:	4.798e+04
Date:	Wed, 23 Mar 2022	Prob (F-statistic):	0.00
Time:	18:18:14	Log-Likelihood:	-1820.0
No. Observations:	400	AIC:	3648.
Df Residuals:	396	BIC:	3664.
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Avg. Session Length	11.9059	0.869	13.696	0.000	10.197	13.615
Time on App	34.3257	1.121	30.610	0.000	32.121	36.530
Time on Website	-14.1405	0.812	-17.405	0.000	-15.738	-12.543
Length of Membership	61.0149	1.144	53.318	0.000	58.765	63.265

Omnibus:	0.490	Durbin-Watson:	1.987
Prob(Omnibus):	0.783	Jarque-Bera (JB):	0.606
Skew:	-0.022	Prob(JB):	0.739
Kurtosis:	2.814	Cond. No.	55.4

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

R-squared

클수록 좋은 모델
Adj.가 더욱 적절한 평가 기준
그냥 R-squared는 독립변수의 개수가 많아지면 증가할 수 밖에 없음
Adj.는 독립변수의 개수를 고려 가중치를 통해 수치를 표현

Coef

변수의 영향력 (강도와 방향)
기울기(1이 증가할 때 증가하는 수치)
스케일이 다른 경우 공정한 비교가 되지 않을 수도 있음 (e.g. 연봉의 1원과 시간의 1년)

P-value

신뢰할 수 있는 결과인가 평가한 척도
0~1 범위
0.05 이하면 양호
0.05 이상이면 데이터를 신뢰할 수 없다고 판단

SST : 평균값과 실제값의 오차

SSE : 실제값과 예측값의 오차

SSR : 예측값과 실제값의 오차

R-Squared = SSR/SST

2-6 모델을 활용하여 예측하고 평가

예측

pred = model.predict(X_test)

시각화

plt.figure(figsize=(10, 10))
sns.scatterplot(x=y_test, y=pred)

MSE(Mean Squared Error)

MSE(Mean Squared Error) : 예측값과 실제 테스트 값과의 오차의 제곱의 평균

문제

그냥 오차를 더해서 평균을 내는 경우 방향성 때문에 제대로 된 오차를 계산하기 어렵다.

해결 방안

절댓값
제곱

from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, pred)

482.28901390889246

위의 수치만 보고 수치의 좋고 나쁨을 판단할 수 없다.

다른 모델의 수치와 비교할 때 의미를 갖는다.

RMSE(Root Mean Squared Error)

MSE는 오차가 제곱만큼 커져서 계산이 되기 때문에 결과 값이 원래 y 값의 스케일보다 너무 큰 느낌을 준다.

이를 해결하기 위해 루트를 씌워주는 것

np.sqrt(mean_squared_error(y_test, pred))

21.961079525125637

MSE와 마찬가지로 이 값 하나만으로는 좋고 나쁨을 판단할 수 없다.

2-7 Numpy and Pandas

Numpy array는 연산 속도가 빠름

Numpy와 Pandas는 상호 호환

Numpy

a = [1, 2, 3]
b = [4, 5, 6]

np.array(a)

array([1, 2, 3])

np.array([a, b])

array([[1, 2, 3],
       [4, 5, 6]])

Pandas

pd.DataFrame([a, b])

	0	1	2
0	1	2	3
1	4	5	6

컬럼명과 행 이름도 바꿀 수 있음

pd.DataFrame([a, b], columns=['a','b','c'], index=['x','y'])

	a	b	c
x	1	2	3
y	4	5	6

Series는 1차원

pd.Series(a)

0    1
1    2
2    3
dtype: int64

data = pd.read_csv('data/eCommerce.csv')

DataFrame에서 한 컬럼만 불러오면 Series type임

type(data)

pandas.core.frame.DataFrame

type(data['Yearly Amount Spent'])

pandas.core.series.Series

Series를 DataFrame으로 형변환이 가능하다.

pd.DataFrame(data['Yearly Amount Spent'])

	Yearly Amount Spent
0	587.951054
1	392.204933
2	487.547505
3	581.852344
4	599.406092
...	...
495	573.847438
496	529.049004
497	551.620145
498	456.469510
499	497.778642

500 rows × 1 columns

Pandas와 Numpy간의 호환

pd.DataFrame(np.array([a, b]))

	0	1	2
0	1	2	3
1	4	5	6

np.array(data)

array([['mstephenson@fernandez.com',
        '835 Frank Tunnel\nWrightmouth, MI 82180-9605', 'Violet', ...,
        39.57766801952616, 4.0826206329529615, 587.9510539684005],
       ['hduke@hotmail.com',
        '4547 Archer Common\nDiazchester, CA 06566-8576', 'DarkGreen',
        ..., 37.268958868297744, 2.66403418213262, 392.2049334443264],
       ['pallen@yahoo.com',
        '24645 Valerie Unions Suite 582\nCobbborough, DC 99414-7564',
        'Bisque', ..., 37.110597442120856, 4.104543202376424,
        487.54750486747207],
       ...,
       ['dale88@hotmail.com',
        '0787 Andrews Ranch Apt. 633\nSouth Chadburgh, TN 56128',
        'Cornsilk', ..., 38.33257633196044, 4.958264472618699,
        551.6201454762477],
       ['cwilson@hotmail.com',
        '680 Jennifer Lodge Apt. 808\nBrendachester, TX 05000-5873',
        'Teal', ..., 36.84008572976701, 2.336484668112853,
        456.469510066298],
       ['hannahwilson@davidson.com',
        '49791 Rachel Heights Apt. 898\nEast Drewborough, OR 55919-9528',
        'DarkMagenta', ..., 35.771016191612965, 2.7351595670822757,
        497.7786422156802]], dtype=object)

Pandas Indexing

data['Yearly Amount Spent']

0      587.951054
1      392.204933
2      487.547505
3      581.852344
4      599.406092
          ...    
495    573.847438
496    529.049004
497    551.620145
498    456.469510
499    497.778642
Name: Yearly Amount Spent, Length: 500, dtype: float64

두 컬럼 이상을 불러올 때 []를 한번 더 사용하는 이유는 리스트라는 하나의 객체로 만들어 전달하기 위함

data[['Time on App', 'Time on Website']]

	Time on App	Time on Website
0	12.655651	39.577668
1	11.109461	37.268959
2	11.330278	37.110597
3	13.717514	36.721283
4	12.795189	37.536653
...	...	...
495	13.566160	36.417985
496	11.695736	37.190268
497	11.499409	38.332576
498	12.391423	36.840086
499	12.418808	35.771016

500 rows × 2 columns

특정 컬럼 제거

drop은 원래 행에서 이름을 찾아 제거한다.

컬럼을 제거하기 위해서는 axis=1을 옵션으로 주면 된다.

data.drop('Yearly Amount Spent', axis=1)

	Email	Address	Avatar	Avg. Session Length	Time on App	Time on Website	Length of Membership
0	mstephenson@fernandez.com	835 Frank Tunnel\nWrightmouth, MI 82180-9605	Violet	34.497268	12.655651	39.577668	4.082621
1	hduke@hotmail.com	4547 Archer Common\nDiazchester, CA 06566-8576	DarkGreen	31.926272	11.109461	37.268959	2.664034
2	pallen@yahoo.com	24645 Valerie Unions Suite 582\nCobbborough, D...	Bisque	33.000915	11.330278	37.110597	4.104543
3	riverarebecca@gmail.com	1414 David Throughway\nPort Jason, OH 22070-1220	SaddleBrown	34.305557	13.717514	36.721283	3.120179
4	mstephens@davidson-herman.com	14023 Rodriguez Passage\nPort Jacobville, PR 3...	MediumAquaMarine	33.330673	12.795189	37.536653	4.446308
...	...	...	...	...	...	...	...
495	lewisjessica@craig-evans.com	4483 Jones Motorway Suite 872\nLake Jamiefurt,...	Tan	33.237660	13.566160	36.417985	3.746573
496	katrina56@gmail.com	172 Owen Divide Suite 497\nWest Richard, CA 19320	PaleVioletRed	34.702529	11.695736	37.190268	3.576526
497	dale88@hotmail.com	0787 Andrews Ranch Apt. 633\nSouth Chadburgh, ...	Cornsilk	32.646777	11.499409	38.332576	4.958264
498	cwilson@hotmail.com	680 Jennifer Lodge Apt. 808\nBrendachester, TX...	Teal	33.322501	12.391423	36.840086	2.336485
499	hannahwilson@davidson.com	49791 Rachel Heights Apt. 898\nEast Drewboroug...	DarkMagenta	33.715981	12.418808	35.771016	2.735160

500 rows × 7 columns

drop도 마찬가지로 두 개 이상의 컬럼을 제거할 때는 []를 써준다.

data.drop(['Email', 'Yearly Amount Spent'], axis=1)

	Address	Avatar	Avg. Session Length	Time on App	Time on Website	Length of Membership
0	835 Frank Tunnel\nWrightmouth, MI 82180-9605	Violet	34.497268	12.655651	39.577668	4.082621
1	4547 Archer Common\nDiazchester, CA 06566-8576	DarkGreen	31.926272	11.109461	37.268959	2.664034
2	24645 Valerie Unions Suite 582\nCobbborough, D...	Bisque	33.000915	11.330278	37.110597	4.104543
3	1414 David Throughway\nPort Jason, OH 22070-1220	SaddleBrown	34.305557	13.717514	36.721283	3.120179
4	14023 Rodriguez Passage\nPort Jacobville, PR 3...	MediumAquaMarine	33.330673	12.795189	37.536653	4.446308
...	...	...	...	...	...	...
495	4483 Jones Motorway Suite 872\nLake Jamiefurt,...	Tan	33.237660	13.566160	36.417985	3.746573
496	172 Owen Divide Suite 497\nWest Richard, CA 19320	PaleVioletRed	34.702529	11.695736	37.190268	3.576526
497	0787 Andrews Ranch Apt. 633\nSouth Chadburgh, ...	Cornsilk	32.646777	11.499409	38.332576	4.958264
498	680 Jennifer Lodge Apt. 808\nBrendachester, TX...	Teal	33.322501	12.391423	36.840086	2.336485
499	49791 Rachel Heights Apt. 898\nEast Drewboroug...	DarkMagenta	33.715981	12.418808	35.771016	2.735160

500 rows × 6 columns

drop은 원본 데이터를 수정하지 않는다.

data

	Email	Address	Avatar	Avg. Session Length	Time on App	Time on Website	Length of Membership	Yearly Amount Spent
0	mstephenson@fernandez.com	835 Frank Tunnel\nWrightmouth, MI 82180-9605	Violet	34.497268	12.655651	39.577668	4.082621	587.951054
1	hduke@hotmail.com	4547 Archer Common\nDiazchester, CA 06566-8576	DarkGreen	31.926272	11.109461	37.268959	2.664034	392.204933
2	pallen@yahoo.com	24645 Valerie Unions Suite 582\nCobbborough, D...	Bisque	33.000915	11.330278	37.110597	4.104543	487.547505
3	riverarebecca@gmail.com	1414 David Throughway\nPort Jason, OH 22070-1220	SaddleBrown	34.305557	13.717514	36.721283	3.120179	581.852344
4	mstephens@davidson-herman.com	14023 Rodriguez Passage\nPort Jacobville, PR 3...	MediumAquaMarine	33.330673	12.795189	37.536653	4.446308	599.406092
...	...	...	...	...	...	...	...	...
495	lewisjessica@craig-evans.com	4483 Jones Motorway Suite 872\nLake Jamiefurt,...	Tan	33.237660	13.566160	36.417985	3.746573	573.847438
496	katrina56@gmail.com	172 Owen Divide Suite 497\nWest Richard, CA 19320	PaleVioletRed	34.702529	11.695736	37.190268	3.576526	529.049004
497	dale88@hotmail.com	0787 Andrews Ranch Apt. 633\nSouth Chadburgh, ...	Cornsilk	32.646777	11.499409	38.332576	4.958264	551.620145
498	cwilson@hotmail.com	680 Jennifer Lodge Apt. 808\nBrendachester, TX...	Teal	33.322501	12.391423	36.840086	2.336485	456.469510
499	hannahwilson@davidson.com	49791 Rachel Heights Apt. 898\nEast Drewboroug...	DarkMagenta	33.715981	12.418808	35.771016	2.735160	497.778642

500 rows × 8 columns

drop으로 원본 수정

data에 다시 대입
inplace=True 옵션 추가

data.drop(['Email', 'Yearly Amount Spent'], axis=1, inplace=True)

data

	Address	Avatar	Avg. Session Length	Time on App	Time on Website	Length of Membership
0	835 Frank Tunnel\nWrightmouth, MI 82180-9605	Violet	34.497268	12.655651	39.577668	4.082621
1	4547 Archer Common\nDiazchester, CA 06566-8576	DarkGreen	31.926272	11.109461	37.268959	2.664034
2	24645 Valerie Unions Suite 582\nCobbborough, D...	Bisque	33.000915	11.330278	37.110597	4.104543
3	1414 David Throughway\nPort Jason, OH 22070-1220	SaddleBrown	34.305557	13.717514	36.721283	3.120179
4	14023 Rodriguez Passage\nPort Jacobville, PR 3...	MediumAquaMarine	33.330673	12.795189	37.536653	4.446308
...	...	...	...	...	...	...
495	4483 Jones Motorway Suite 872\nLake Jamiefurt,...	Tan	33.237660	13.566160	36.417985	3.746573
496	172 Owen Divide Suite 497\nWest Richard, CA 19320	PaleVioletRed	34.702529	11.695736	37.190268	3.576526
497	0787 Andrews Ranch Apt. 633\nSouth Chadburgh, ...	Cornsilk	32.646777	11.499409	38.332576	4.958264
498	680 Jennifer Lodge Apt. 808\nBrendachester, TX...	Teal	33.322501	12.391423	36.840086	2.336485
499	49791 Rachel Heights Apt. 898\nEast Drewboroug...	DarkMagenta	33.715981	12.418808	35.771016	2.735160

500 rows × 6 columns

특정 행 제거

df = data.head(10)

df.index = ['a', 'b','c','d','e','f','g','h','i','j']

df

	Address	Avatar	Avg. Session Length	Time on App	Time on Website	Length of Membership
a	835 Frank Tunnel\nWrightmouth, MI 82180-9605	Violet	34.497268	12.655651	39.577668	4.082621
b	4547 Archer Common\nDiazchester, CA 06566-8576	DarkGreen	31.926272	11.109461	37.268959	2.664034
c	24645 Valerie Unions Suite 582\nCobbborough, D...	Bisque	33.000915	11.330278	37.110597	4.104543
d	1414 David Throughway\nPort Jason, OH 22070-1220	SaddleBrown	34.305557	13.717514	36.721283	3.120179
e	14023 Rodriguez Passage\nPort Jacobville, PR 3...	MediumAquaMarine	33.330673	12.795189	37.536653	4.446308
f	645 Martha Park Apt. 611\nJeffreychester, MN 6...	FloralWhite	33.871038	12.026925	34.476878	5.493507
g	68388 Reyes Lights Suite 692\nJosephbury, WV 9...	DarkSlateBlue	32.021596	11.366348	36.683776	4.685017
h	Unit 6538 Box 8980\nDPO AP 09026-4941	Aqua	32.739143	12.351959	37.373359	4.434273
i	860 Lee Key\nWest Debra, SD 97450-0495	Salmon	33.987773	13.386235	37.534497	3.273434
j	PSC 2734, Box 5255\nAPO AA 98456-7482	Brown	31.936549	11.814128	37.145168	3.202806

행을 추출할 때는 loc를 사용

df.loc['d']

Address                 1414 David Throughway\nPort Jason, OH 22070-1220
Avatar                                                       SaddleBrown
Avg. Session Length                                            34.305557
Time on App                                                    13.717514
Time on Website                                                36.721283
Length of Membership                                            3.120179
Name: d, dtype: object

여러 행을 추출할 때는 인덱싱 사용

df.loc['d':'h']

	Address	Avatar	Avg. Session Length	Time on App	Time on Website	Length of Membership
d	1414 David Throughway\nPort Jason, OH 22070-1220	SaddleBrown	34.305557	13.717514	36.721283	3.120179
e	14023 Rodriguez Passage\nPort Jacobville, PR 3...	MediumAquaMarine	33.330673	12.795189	37.536653	4.446308
f	645 Martha Park Apt. 611\nJeffreychester, MN 6...	FloralWhite	33.871038	12.026925	34.476878	5.493507
g	68388 Reyes Lights Suite 692\nJosephbury, WV 9...	DarkSlateBlue	32.021596	11.366348	36.683776	4.685017
h	Unit 6538 Box 8980\nDPO AP 09026-4941	Aqua	32.739143	12.351959	37.373359	4.434273

인덱스 번호로 추출할 때는 iloc 사용

마지막 번호는 포함 안됨

df.iloc[3:7]

	Address	Avatar	Avg. Session Length	Time on App	Time on Website	Length of Membership
d	1414 David Throughway\nPort Jason, OH 22070-1220	SaddleBrown	34.305557	13.717514	36.721283	3.120179
e	14023 Rodriguez Passage\nPort Jacobville, PR 3...	MediumAquaMarine	33.330673	12.795189	37.536653	4.446308
f	645 Martha Park Apt. 611\nJeffreychester, MN 6...	FloralWhite	33.871038	12.026925	34.476878	5.493507
g	68388 Reyes Lights Suite 692\nJosephbury, WV 9...	DarkSlateBlue	32.021596	11.366348	36.683776	4.685017

df.iloc[3:]

	Address	Avatar	Avg. Session Length	Time on App	Time on Website	Length of Membership
d	1414 David Throughway\nPort Jason, OH 22070-1220	SaddleBrown	34.305557	13.717514	36.721283	3.120179
e	14023 Rodriguez Passage\nPort Jacobville, PR 3...	MediumAquaMarine	33.330673	12.795189	37.536653	4.446308
f	645 Martha Park Apt. 611\nJeffreychester, MN 6...	FloralWhite	33.871038	12.026925	34.476878	5.493507
g	68388 Reyes Lights Suite 692\nJosephbury, WV 9...	DarkSlateBlue	32.021596	11.366348	36.683776	4.685017
h	Unit 6538 Box 8980\nDPO AP 09026-4941	Aqua	32.739143	12.351959	37.373359	4.434273
i	860 Lee Key\nWest Debra, SD 97450-0495	Salmon	33.987773	13.386235	37.534497	3.273434
j	PSC 2734, Box 5255\nAPO AA 98456-7482	Brown	31.936549	11.814128	37.145168	3.202806

df.iloc[:3]

	Address	Avatar	Avg. Session Length	Time on App	Time on Website	Length of Membership
a	835 Frank Tunnel\nWrightmouth, MI 82180-9605	Violet	34.497268	12.655651	39.577668	4.082621
b	4547 Archer Common\nDiazchester, CA 06566-8576	DarkGreen	31.926272	11.109461	37.268959	2.664034
c	24645 Valerie Unions Suite 582\nCobbborough, D...	Bisque	33.000915	11.330278	37.110597	4.104543

iloc으로 컬럼도 인덱싱할 수 있다.

df.iloc[1:4, 0:3]

	Address	Avatar	Avg. Session Length
b	4547 Archer Common\nDiazchester, CA 06566-8576	DarkGreen	31.926272
c	24645 Valerie Unions Suite 582\nCobbborough, D...	Bisque	33.000915
d	1414 David Throughway\nPort Jason, OH 22070-1220	SaddleBrown	34.305557

2-8 Linear Regression 원리

데이터를 표현하기 위한 선들이 여러 개 그려지는데 이 중에서 가장 잘 설명할 수 있는 선을 찾는 것이 Linear Regression이다.

이 선을 찾는 방법은 예측값과 실제값의 오차의 평균이 가장 작은 선을 찾는 것이다.

Gradient Descent(경사하강법)을 사용해서 선을 찾는데 도움을 준다.

2-9 수식 만들기

model.summary()

OLS Regression Results

Dep. Variable:	Yearly Amount Spent	R-squared (uncentered):	0.998
Model:	OLS	Adj. R-squared (uncentered):	0.998
Method:	Least Squares	F-statistic:	4.798e+04
Date:	Wed, 23 Mar 2022	Prob (F-statistic):	0.00
Time:	20:11:49	Log-Likelihood:	-1820.0
No. Observations:	400	AIC:	3648.
Df Residuals:	396	BIC:	3664.
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Avg. Session Length	11.9059	0.869	13.696	0.000	10.197	13.615
Time on App	34.3257	1.121	30.610	0.000	32.121	36.530
Time on Website	-14.1405	0.812	-17.405	0.000	-15.738	-12.543
Length of Membership	61.0149	1.144	53.318	0.000	58.765	63.265

Omnibus:	0.490	Durbin-Watson:	1.987
Prob(Omnibus):	0.783	Jarque-Bera (JB):	0.606
Skew:	-0.022	Prob(JB):	0.739
Kurtosis:	2.814	Cond. No.	55.4

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

summary로 수식 만들기

y = sum(coef*column)

y = 11.9059Avg. Session Length + 34.3257Time on App + ...

Chapter 3

Logistic Regression을 사용하여 고객별 광고 반응율을 예측

Logistic Regression은 이진 분류

import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

데이터 로딩 및 데이터 확인

data = pd.read_csv('data/advertising.csv')

data.head()

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp
0	68.95	NaN	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	3/27/2016 0:53
1	80.23	31.0	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	4/4/2016 1:39
2	69.47	26.0	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	3/13/2016 20:35
3	74.15	29.0	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	1/10/2016 2:31
4	68.37	35.0	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	6/3/2016 3:36

컬럼 뜻

Daily Time on Site
- 사이트에 머문 시간
Age
- 나이
Area Income
- 개인의 수입을 특정할 수 없어 해당 지역의 평균 수입을 나타냄
Daily Internet Usage
- 하루 인터넷 사용 시간
Ad Topic Line
- 광고에 대한 설명
City
- 사용자 도시
Male
- 성별 (여자면 0, 남자면 1)
Country
- 사용자 국가
Timestamp
- 시간과 관련됨
Clicked on Ad
- 광고를 클릭하지 않았으면 0, 했으면 1

NaN은 비어있는 값이다.

data.tail()

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp	Clicked on Ad
995	72.97	30.0	71384.57	208.58	Fundamental modular algorithm	Duffystad	1	Lebanon	2/11/2016 21:49	1
996	51.30	45.0	67782.17	134.42	Grass-roots cohesive monitoring	New Darlene	1	Bosnia and Herzegovina	4/22/2016 2:07	1
997	51.63	51.0	42415.72	120.37	Expanded intangible solution	South Jessica	1	Mongolia	2/1/2016 17:24	1
998	55.55	19.0	41920.79	187.95	Proactive bandwidth-monitored policy	West Steven	0	Guatemala	3/24/2016 2:35	0
999	45.01	26.0	29875.80	178.35	Virtual 5thgeneration emulation	Ronniemouth	0	Brazil	6/3/2016 21:43	1

data.info()

RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Daily Time Spent on Site  1000 non-null   float64
 1   Age                       916 non-null    float64
 2   Area Income               1000 non-null   float64
 3   Daily Internet Usage      1000 non-null   float64
 4   Ad Topic Line             1000 non-null   object 
 5   City                      1000 non-null   object 
 6   Male                      1000 non-null   int64  
 7   Country                   1000 non-null   object 
 8   Timestamp                 1000 non-null   object 
 9   Clicked on Ad             1000 non-null   int64  
dtypes: float64(4), int64(2), object(4)
memory usage: 78.2+ KB

Clicked on Ad가 종속변수로 우리가 예측할 값

Age만 916개가 non-null로 빈 값이 있음을 의미

data.describe()

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Male	Clicked on Ad
count	1000.000000	916.000000	1000.000000	1000.000000	1000.000000	1000.00000
mean	65.000200	36.128821	55000.000080	180.000100	0.481000	0.50000
std	15.853615	9.018548	13414.634022	43.902339	0.499889	0.50025
min	32.600000	19.000000	13996.500000	104.780000	0.000000	0.00000
25%	51.360000	29.000000	47031.802500	138.830000	0.000000	0.00000
50%	68.215000	35.000000	57012.300000	183.130000	0.000000	0.50000
75%	78.547500	42.000000	65470.635000	218.792500	1.000000	1.00000
max	91.430000	61.000000	79484.800000	269.960000	1.000000	1.00000

Area Income이 min과 25%의 차이가 커서 왼쪽으로 좀 치우친 것을 알 수 있다.

Male의 mean값이 0.48인 것으로 48%정도가 남자라는 것을 알 수 있다.

시각화

sns.displot(data['Area Income'])

sns.displot(data['Age'])

고유값 갯수 확인

data['Country'].nunique()

data['City'].nunique()

data['Ad Topic Line'].nunique()

Missing Value 확인 및 처리

Missing Value

na
NaN
Null

Missing Value 확인

data.isna()

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp	Clicked on Ad
0	False	True	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...	...
995	False	False	False	False	False	False	False	False	False	False
996	False	False	False	False	False	False	False	False	False	False
997	False	False	False	False	False	False	False	False	False	False
998	False	False	False	False	False	False	False	False	False	False
999	False	False	False	False	False	False	False	False	False	False

1000 rows × 10 columns

data.isna().sum()

Daily Time Spent on Site     0
Age                         84
Area Income                  0
Daily Internet Usage         0
Ad Topic Line                0
City                         0
Male                         0
Country                      0
Timestamp                    0
Clicked on Ad                0
dtype: int64

data.isna().sum() / len(data)

Daily Time Spent on Site    0.000
Age                         0.084
Area Income                 0.000
Daily Internet Usage        0.000
Ad Topic Line               0.000
City                        0.000
Male                        0.000
Country                     0.000
Timestamp                   0.000
Clicked on Ad               0.000
dtype: float64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

220323.md

220323.md

패스트캠퍼스의 강의를 정리한 것

Chapter 2

2-1 모듈 및 데이터 로딩

2-2 데이터 특성 확인

데이터 처음 부분 확인

데이터 끝 부분 확인

데이터 정보 확인

데이터의 전반적인 수치들 확인

2-3 불필요한 변수 처리

컬럼 불러오기

2-4 Train Test Split

데이터 나누기

2-5 Linear Regression Model 생성

훈련

Linear Regression Report 확인

2-6 모델을 활용하여 예측하고 평가

예측

시각화

MSE(Mean Squared Error)

RMSE(Root Mean Squared Error)

2-7 Numpy and Pandas

Numpy

Pandas

Pandas Indexing

특정 컬럼 제거

특정 행 제거

2-8 Linear Regression 원리

2-9 수식 만들기

summary로 수식 만들기

Chapter 3

데이터 로딩 및 데이터 확인

시각화

Missing Value 확인 및 처리

Missing Value 확인

Files

220323.md

Latest commit

History

220323.md

File metadata and controls

패스트캠퍼스의 강의를 정리한 것

Chapter 2

2-1 모듈 및 데이터 로딩

2-2 데이터 특성 확인

데이터 처음 부분 확인

데이터 끝 부분 확인

데이터 정보 확인

데이터의 전반적인 수치들 확인

2-3 불필요한 변수 처리

컬럼 불러오기

2-4 Train Test Split

데이터 나누기

2-5 Linear Regression Model 생성

훈련

Linear Regression Report 확인

2-6 모델을 활용하여 예측하고 평가

예측

시각화

MSE(Mean Squared Error)

RMSE(Root Mean Squared Error)

2-7 Numpy and Pandas

Numpy

Pandas

Pandas Indexing

특정 컬럼 제거

특정 행 제거

2-8 Linear Regression 원리

2-9 수식 만들기

summary로 수식 만들기

Chapter 3

데이터 로딩 및 데이터 확인

시각화

Missing Value 확인 및 처리

Missing Value 확인